From owner-freebsd-net@FreeBSD.ORG Sun Jun 15 14:39:33 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id F0CF9FFE; Sun, 15 Jun 2014 14:39:33 +0000 (UTC) Received: from mail-wi0-x233.google.com (mail-wi0-x233.google.com [IPv6:2a00:1450:400c:c05::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5D32B2FE6; Sun, 15 Jun 2014 14:39:33 +0000 (UTC) Received: by mail-wi0-f179.google.com with SMTP id cc10so2854531wib.6 for ; Sun, 15 Jun 2014 07:39:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=negTkOTcAck1Pxn3npaKYpwgOqHurYW9M+1GUZTojfY=; b=Rc3bAabT4Q1lSfiw7swT4BqdWFnw8K1Dwgl3iv6rfflvvbz+Z8zutZnPGFfBQSfOz2 96W1/DmsjQPWieQYlZ2hErjHcv+xBXyeoSbzxRvh70/q3eiabsqmaSZzyTy4dutkZBIF LolruHvLPpuYzkV0AhbkV04wkVT/UOZpB0T/PGFFASMeUQL9ujIxLonvjWtkFw23cOdq uxER9ELJPdY5eYmqmDJ96AdAqHjEZsFVasqltU96JGvsBf+hfy1s2U9Mgl8jt6Nz3+8b d99mkJJkschimo3pu0lSx2no06THU4xpgaTMjnA2z2GmwmM/bXr9gym7PqCUJK5cR2Mt Um3Q== X-Received: by 10.194.161.168 with SMTP id xt8mr20559495wjb.35.1402843171634; Sun, 15 Jun 2014 07:39:31 -0700 (PDT) Received: from [192.168.2.30] ([2.176.164.233]) by mx.google.com with ESMTPSA id b44sm26985828eem.45.2014.06.15.07.39.28 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 15 Jun 2014 07:39:30 -0700 (PDT) Message-ID: <539DB018.5020702@gmail.com> Date: Sun, 15 Jun 2014 19:09:20 +0430 From: Hooman Fazaeli User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130215 Thunderbird/17.0.3 MIME-Version: 1.0 To: Mark van der Meulen Subject: Re: FreeBSD 9 w/ MPD5 crashes as LNS with 300+ tunnels. Netgraph issue? References: In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit Cc: freebsd-net@freebsd.org, freebsd-bugs@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 Jun 2014 14:39:34 -0000 On 6/15/2014 3:39 PM, Mark van der Meulen wrote: > Hi List, > > Iım wondering if anyone can help me with this problem or at least help > point me in the direction of where to start looking? I have FreeBSD 9 > based servers which are crashing every 4-10 days and producing crash dumps > similar to this one: http://pastebin.com/F82Jc08C > > All crash dumps seem to involve the net graph code and the current process > is always ng_queueX. > > In summary, we have 4 x FreeBSD server running as LNS(MPD5) for around > 2000 subscribers with 3 of the servers running a modified version of > BSDRP, the fourth running a FreeBSD 9 install with what I thought was the > latest stable source for the kernel because I fetched it from stable/9 > however it shows up as 9.3-BETA in uname(the linked crash dump is from > that server). > > 3 x LNS running modified BSDRP: DELL PowerEdge 2950, 2 x Xeon E5320, 4GB > RAM, igb Quad Port NIC in LAGG, Quagga, MPD5, IPFW for Host Access > Control, NTPD, BSNMPD > 1 x LNS running latest FreeBSD 9 code: HP ProLiant DL380, 2 x Xeon X5465, > 36GB RAM, em Quad Port NIC in LAGG, BIRD, MPD5, IPFW for Host Access > Control, NTPD, BSNMPD > > The reason I built the fresh server on FreeBSD 9 is because I cannot save > crash dumps for BSDRP easily. In short the problem is this ­ servers with > 10-50 clients will run indefinitely(as long as we have had them, which is > probably about 1.5 years) without errors and serve clients fine, however > any with over 300 clients appear to only stay online for 4-10 days maximum > before crashing and rebooting. I have attached the crash file from the > latest crash on the LNS running the latest FreeBSD 9 code however unsure > what to do with it and where to look? > > When these devices crash they are often doing in excess of > 200Mbps(anywhere between 200Mbps and 450Mbps), very little load(3-4.5 on > the first 3, less than 2 on the fourth). > > Things Iıve done to attempt resolution: > > - Replaced bce network cards with em network cards. This produced far less > errors on the interfaces(was many before, now none) and I think caused the > machines to stay up longer between reboots as before it would happen up to > once a day. > - Replaced em network cards with igb network cards. All this did was lower > load and give us a little more time between reboots. > - Tried an implementation using FreeBSD 10(this lasted less than 4 hours > before reboots when under load) > - Replaced memory > - Increased memory on LNS4 to 36GB. > - Various kernel rebuilds > - Tweaked various kernel settings. This appears to have helped a little > and given us more time between reboots. > - Disabled IPv6 > - Disabled IPFW > - Disabled BSNMPD > - Disabled Netflow > - Versions 5.6 and 5.7 of MPD5 > > Anyone able to help me work out what the crash dump means? It only happens > on servers running MPD5 (eg. Exact same boxes, exact same code pushing > 800Mbps+ of routing and no crashes) and I can see the crash relates to net > graph, however unsure where to go from thereŠ > > Thanks, > > Mark > > > Relevant Current Settings: > > net.inet.ip.fastforwarding=1 > net.inet.ip.fw.default_to_accept=1 > net.bpf.zerocopy_enable=1 > net.inet.raw.maxdgram=16384 > net.inet.raw.recvspace=16384 > hw.intr_storm_threshold=64000 > net.inet.ip.fastforwarding=1 > net.inet.ip.fw.default_to_accept=1 > net.inet.ip.intr_queue_maxlen=10240 > net.inet.ip.redirect=0 > net.inet.ip.sourceroute=0 > net.inet.ip.rtexpire=2 > net.inet.ip.rtminexpire=2 > net.inet.ip.rtmaxcache=256 > net.inet.ip.accept_sourceroute=0 > net.inet.ip.process_options=0 > net.inet.icmp.log_redirect=0 > net.inet.icmp.drop_redirect=1 > net.inet.tcp.drop_synfin=1 > net.inet.tcp.blackhole=2 > net.inet.tcp.sendbuf_max=16777216 > net.inet.tcp.recvbuf_max=16777216 > net.inet.tcp.sendbuf_auto=1 > net.inet.tcp.recvbuf_auto=1 > net.inet.udp.recvspace=262144 > net.inet.udp.blackhole=0 > net.inet.udp.maxdgram=57344 > net.route.netisr_maxqlen=4096 > net.local.stream.recvspace=65536 > net.local.stream.sendspace=65536 > net.graph.maxdata=65536 > net.graph.maxalloc=65536 > net.graph.maxdgram=2096000 > net.graph.recvspace=2096000 > kern.ipc.somaxconn=32768 > kern.ipc.nmbclusters=524288 > kern.ipc.maxsockbuf=26214400 > kern.ipc.shmmax=³2147483648" > kern.ipc.nmbjumbop=³53200" > kern.ipc.maxpipekva=³536870912" > kern.random.sys.harvest.ethernet="0" > kern.random.sys.harvest.interrupt="0" > vm.kmem_size=³4096M² # Only on box with over 12G RAM. Otherwise 2G. > > > vm.kmem_size_max=³8192M" # Only on box with over 12G RAM. > hw.igb.rxd="4096" > hw.igb.txd="4096" > hw.em.rxd="4096" > hw.em.txd="4096" > hw.igb.max_interrupt_rate=³32000" > > hw.igb.rx_process_limit="4096" > hw.em.rx_process_limit="500" > net.link.ifqmaxlen="20480" > net.isr.dispatch="direct" > net.isr.direct_force="1" > net.isr.direct="1" > net.isr.maxthreads="8" > net.isr.numthreads="4" > net.isr.bindthreads="1" > net.isr.maxqlimit="20480" > net.isr.defaultqlimit="8192" > > The following workarounds have worked for some people. They may not solve your problem, but are worth giving a try: 1. Increases netgraph limits: net.graph.maxdata=262140 # /boot/loader.conf net.graph.maxalloc=262140 # /boot.loader.conf 2. Remove FLOWTABLE kernel option. It would also help if you put your kernel and core dump somewhere for download so we can have a closer look at panic trace. -- Best regards. Hooman Fazaeli