From owner-freebsd-net@FreeBSD.ORG  Sun Jun 15 14:39:33 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id F0CF9FFE;
 Sun, 15 Jun 2014 14:39:33 +0000 (UTC)
Received: from mail-wi0-x233.google.com (mail-wi0-x233.google.com
 [IPv6:2a00:1450:400c:c05::233])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 5D32B2FE6;
 Sun, 15 Jun 2014 14:39:33 +0000 (UTC)
Received: by mail-wi0-f179.google.com with SMTP id cc10so2854531wib.6
 for <multiple recipients>; Sun, 15 Jun 2014 07:39:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=message-id:date:from:user-agent:mime-version:to:cc:subject
 :references:in-reply-to:content-type:content-transfer-encoding;
 bh=negTkOTcAck1Pxn3npaKYpwgOqHurYW9M+1GUZTojfY=;
 b=Rc3bAabT4Q1lSfiw7swT4BqdWFnw8K1Dwgl3iv6rfflvvbz+Z8zutZnPGFfBQSfOz2
 96W1/DmsjQPWieQYlZ2hErjHcv+xBXyeoSbzxRvh70/q3eiabsqmaSZzyTy4dutkZBIF
 LolruHvLPpuYzkV0AhbkV04wkVT/UOZpB0T/PGFFASMeUQL9ujIxLonvjWtkFw23cOdq
 uxER9ELJPdY5eYmqmDJ96AdAqHjEZsFVasqltU96JGvsBf+hfy1s2U9Mgl8jt6Nz3+8b
 d99mkJJkschimo3pu0lSx2no06THU4xpgaTMjnA2z2GmwmM/bXr9gym7PqCUJK5cR2Mt
 Um3Q==
X-Received: by 10.194.161.168 with SMTP id xt8mr20559495wjb.35.1402843171634; 
 Sun, 15 Jun 2014 07:39:31 -0700 (PDT)
Received: from [192.168.2.30] ([2.176.164.233])
 by mx.google.com with ESMTPSA id b44sm26985828eem.45.2014.06.15.07.39.28
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Sun, 15 Jun 2014 07:39:30 -0700 (PDT)
Message-ID: <539DB018.5020702@gmail.com>
Date: Sun, 15 Jun 2014 19:09:20 +0430
From: Hooman Fazaeli <hoomanfazaeli@gmail.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:17.0) Gecko/20130215 Thunderbird/17.0.3
MIME-Version: 1.0
To: Mark van der Meulen <mark@fivenynes.com>
Subject: Re: FreeBSD 9 w/ MPD5 crashes as LNS with 300+ tunnels. Netgraph
 issue?
References: <CFC3BC24.1CB4A%mark@fivenynes.com>
In-Reply-To: <CFC3BC24.1CB4A%mark@fivenynes.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Cc: freebsd-net@freebsd.org, freebsd-bugs@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 15 Jun 2014 14:39:34 -0000

On 6/15/2014 3:39 PM, Mark van der Meulen wrote:
> Hi List,
>
> I¹m wondering if anyone can help me with this problem or at least help
> point me in the direction of where to start looking? I have FreeBSD 9
> based servers which are crashing every 4-10 days and producing crash dumps
> similar to this one: http://pastebin.com/F82Jc08C
>
> All crash dumps seem to involve the net graph code and the current process
> is always ng_queueX.
>
> In summary, we have 4 x FreeBSD server running as LNS(MPD5) for around
> 2000 subscribers with 3 of the servers running a modified version of
> BSDRP, the fourth running a FreeBSD 9 install with what I thought was the
> latest stable source for the kernel because I fetched it from stable/9
> however it shows up as 9.3-BETA in uname(the linked crash dump is from
> that server).
>
> 3 x LNS running modified BSDRP: DELL PowerEdge 2950, 2 x Xeon E5320, 4GB
> RAM, igb Quad Port NIC in LAGG, Quagga, MPD5, IPFW for Host Access
> Control, NTPD, BSNMPD
> 1 x LNS running latest FreeBSD 9 code: HP ProLiant DL380, 2 x Xeon X5465,
> 36GB RAM, em Quad Port NIC in LAGG, BIRD, MPD5, IPFW for Host Access
> Control, NTPD, BSNMPD
>
> The reason I built the fresh server on FreeBSD 9 is because I cannot save
> crash dumps for BSDRP easily. In short the problem is this ­ servers with
> 10-50 clients will run indefinitely(as long as we have had them, which is
> probably about 1.5 years) without errors and serve clients fine, however
> any with over 300 clients appear to only stay online for 4-10 days maximum
> before crashing and rebooting. I have attached the crash file from the
> latest crash on the LNS running the latest FreeBSD 9 code however unsure
> what to do with it and where to look?
>
> When these devices crash they are often doing in excess of
> 200Mbps(anywhere between 200Mbps and 450Mbps), very little load(3-4.5 on
> the first 3, less than 2 on the fourth).
>
> Things I¹ve done to attempt resolution:
>
> - Replaced bce network cards with em network cards. This produced far less
> errors on the interfaces(was many before, now none) and I think caused the
> machines to stay up longer between reboots as before it would happen up to
> once a day.
> - Replaced em network cards with igb network cards. All this did was lower
> load and give us a little more time between reboots.
> - Tried an implementation using FreeBSD 10(this lasted less than 4 hours
> before reboots when under load)
> - Replaced memory
> - Increased memory on LNS4 to 36GB.
> - Various kernel rebuilds
> - Tweaked various kernel settings. This appears to have helped a little
> and given us more time between reboots.
> - Disabled IPv6
> - Disabled IPFW
> - Disabled BSNMPD
> - Disabled Netflow
> - Versions 5.6 and 5.7 of MPD5
>
> Anyone able to help me work out what the crash dump means? It only happens
> on servers running MPD5 (eg. Exact same boxes, exact same code pushing
> 800Mbps+ of routing and no crashes) and I can see the crash relates to net
> graph, however unsure where to go from thereŠ
>
> Thanks,
>
> Mark
>
>
> Relevant Current Settings:
>
> net.inet.ip.fastforwarding=1
> net.inet.ip.fw.default_to_accept=1
> net.bpf.zerocopy_enable=1
> net.inet.raw.maxdgram=16384
> net.inet.raw.recvspace=16384
> hw.intr_storm_threshold=64000
> net.inet.ip.fastforwarding=1
> net.inet.ip.fw.default_to_accept=1
> net.inet.ip.intr_queue_maxlen=10240
> net.inet.ip.redirect=0
> net.inet.ip.sourceroute=0
> net.inet.ip.rtexpire=2
> net.inet.ip.rtminexpire=2
> net.inet.ip.rtmaxcache=256
> net.inet.ip.accept_sourceroute=0
> net.inet.ip.process_options=0
> net.inet.icmp.log_redirect=0
> net.inet.icmp.drop_redirect=1
> net.inet.tcp.drop_synfin=1
> net.inet.tcp.blackhole=2
> net.inet.tcp.sendbuf_max=16777216
> net.inet.tcp.recvbuf_max=16777216
> net.inet.tcp.sendbuf_auto=1
> net.inet.tcp.recvbuf_auto=1
> net.inet.udp.recvspace=262144
> net.inet.udp.blackhole=0
> net.inet.udp.maxdgram=57344
> net.route.netisr_maxqlen=4096
> net.local.stream.recvspace=65536
> net.local.stream.sendspace=65536
> net.graph.maxdata=65536
> net.graph.maxalloc=65536
> net.graph.maxdgram=2096000
> net.graph.recvspace=2096000
> kern.ipc.somaxconn=32768
> kern.ipc.nmbclusters=524288
> kern.ipc.maxsockbuf=26214400
> kern.ipc.shmmax=³2147483648"
> kern.ipc.nmbjumbop=³53200"
> kern.ipc.maxpipekva=³536870912"
> kern.random.sys.harvest.ethernet="0"
> kern.random.sys.harvest.interrupt="0"
> vm.kmem_size=³4096M² # Only on box with over 12G RAM. Otherwise 2G.
>
>
> vm.kmem_size_max=³8192M" # Only on box with over 12G RAM.
> hw.igb.rxd="4096"
> hw.igb.txd="4096"
> hw.em.rxd="4096"
> hw.em.txd="4096"
> hw.igb.max_interrupt_rate=³32000"
>
> hw.igb.rx_process_limit="4096"
> hw.em.rx_process_limit="500"
> net.link.ifqmaxlen="20480"
> net.isr.dispatch="direct"
> net.isr.direct_force="1"
> net.isr.direct="1"
> net.isr.maxthreads="8"
> net.isr.numthreads="4"
> net.isr.bindthreads="1"
> net.isr.maxqlimit="20480"
> net.isr.defaultqlimit="8192"
>
>

The following workarounds have worked for some people.
They may not solve your problem, but are worth giving a try:

1. Increases netgraph limits:
net.graph.maxdata=262140 # /boot/loader.conf
net.graph.maxalloc=262140 # /boot.loader.conf

2. Remove FLOWTABLE kernel option.

It would also help if you put your kernel and core dump somewhere for download so we can have a closer look at panic trace.

-- 

Best regards.
Hooman Fazaeli