Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 16 Jun 2014 00:01:46 +0800
From:      Julian Elischer <julian@freebsd.org>
To:        freebsd-net@freebsd.org
Subject:   Re: FreeBSD 9 w/ MPD5 crashes as LNS with 300+ tunnels. Netgraph issue?
Message-ID:  <539DC36A.4030309@freebsd.org>
In-Reply-To: <CFC3BC24.1CB4A%mark@fivenynes.com>
References:  <CFC3BC24.1CB4A%mark@fivenynes.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 6/15/14, 7:09 PM, Mark van der Meulen wrote:
> Hi List,
>
> Iım wondering if anyone can help me with this problem or at least help
> point me in the direction of where to start looking? I have FreeBSD 9
> based servers which are crashing every 4-10 days and producing crash dumps
> similar to this one: http://pastebin.com/F82Jc08C
>
> All crash dumps seem to involve the net graph code and the current process
> is always ng_queueX.
I looked at your trace. I see that you have good access to gdb ..
can you show the exact C statements having the problems? and hte 
values of the variables concerned?
  Since I dont' have your sources I dont' knwo if line 3587 on your 
system matches line 3587 on fxp.watson.org
http://fxr.watson.org/fxr/source/netgraph/ng_base.c?v=FREEBSD9#L3587
  if it does, then can you look to see why it got into that clause.
There are six different subconditions that could have made it go there.
it woudl be instructive to know which triggered.


That line in my sources is a TRAP_ERROR() which is defined to nothing, 
so it woudl be nice to see exactly where your gdb says it is.
if it IS there and you have a remote(serial) gdb set up, you could try 
doing what the comment in the sources says:


/* Set this to kdb_enter("X") to catch all errors as they occur */
#ifndef TRAP_ERROR
#define TRAP_ERROR()
#endif

if you do NOT have serial set up,  you could run the server on a byhve 
instance on a freebsd 10 system and set that up for serial debugging. 
but that may be quite a learning curve.. (I've never fully done that 
myself yet).

You don't say how similar the traces are. Or how reproducible.. have 
you seen exactly the same trace more than once?


Julian





> In summary, we have 4 x FreeBSD server running as LNS(MPD5) for around
> 2000 subscribers with 3 of the servers running a modified version of
> BSDRP, the fourth running a FreeBSD 9 install with what I thought was the
> latest stable source for the kernel because I fetched it from stable/9
> however it shows up as 9.3-BETA in uname(the linked crash dump is from
> that server).
>
> 3 x LNS running modified BSDRP: DELL PowerEdge 2950, 2 x Xeon E5320, 4GB
> RAM, igb Quad Port NIC in LAGG, Quagga, MPD5, IPFW for Host Access
> Control, NTPD, BSNMPD
> 1 x LNS running latest FreeBSD 9 code: HP ProLiant DL380, 2 x Xeon X5465,
> 36GB RAM, em Quad Port NIC in LAGG, BIRD, MPD5, IPFW for Host Access
> Control, NTPD, BSNMPD
>
> The reason I built the fresh server on FreeBSD 9 is because I cannot save
> crash dumps for BSDRP easily. In short the problem is this ­ servers with
> 10-50 clients will run indefinitely(as long as we have had them, which is
> probably about 1.5 years) without errors and serve clients fine, however
> any with over 300 clients appear to only stay online for 4-10 days maximum
> before crashing and rebooting. I have attached the crash file from the
> latest crash on the LNS running the latest FreeBSD 9 code however unsure
> what to do with it and where to look?

>
> When these devices crash they are often doing in excess of
> 200Mbps(anywhere between 200Mbps and 450Mbps), very little load(3-4.5 on
> the first 3, less than 2 on the fourth).
>
> Things Iıve done to attempt resolution:
>
> - Replaced bce network cards with em network cards. This produced far less
> errors on the interfaces(was many before, now none) and I think caused the
> machines to stay up longer between reboots as before it would happen up to
> once a day.
> - Replaced em network cards with igb network cards. All this did was lower
> load and give us a little more time between reboots.
> - Tried an implementation using FreeBSD 10(this lasted less than 4 hours
> before reboots when under load)
> - Replaced memory
> - Increased memory on LNS4 to 36GB.
> - Various kernel rebuilds
> - Tweaked various kernel settings. This appears to have helped a little
> and given us more time between reboots.
> - Disabled IPv6
> - Disabled IPFW
> - Disabled BSNMPD
> - Disabled Netflow
> - Versions 5.6 and 5.7 of MPD5
>
> Anyone able to help me work out what the crash dump means? It only happens
> on servers running MPD5 (eg. Exact same boxes, exact same code pushing
> 800Mbps+ of routing and no crashes) and I can see the crash relates to net
> graph, however unsure where to go from thereS(
>
> Thanks,
>
> Mark
>
>
> Relevant Current Settings:
>
> net.inet.ip.fastforwarding=1
> net.inet.ip.fw.default_to_accept=1
> net.bpf.zerocopy_enable=1
> net.inet.raw.maxdgram=16384
> net.inet.raw.recvspace=16384
> hw.intr_storm_threshold=64000
> net.inet.ip.fastforwarding=1
> net.inet.ip.fw.default_to_accept=1
> net.inet.ip.intr_queue_maxlen=10240
> net.inet.ip.redirect=0
> net.inet.ip.sourceroute=0
> net.inet.ip.rtexpire=2
> net.inet.ip.rtminexpire=2
> net.inet.ip.rtmaxcache=256
> net.inet.ip.accept_sourceroute=0
> net.inet.ip.process_options=0
> net.inet.icmp.log_redirect=0
> net.inet.icmp.drop_redirect=1
> net.inet.tcp.drop_synfin=1
> net.inet.tcp.blackhole=2
> net.inet.tcp.sendbuf_max=16777216
> net.inet.tcp.recvbuf_max=16777216
> net.inet.tcp.sendbuf_auto=1
> net.inet.tcp.recvbuf_auto=1
> net.inet.udp.recvspace=262144
> net.inet.udp.blackhole=0
> net.inet.udp.maxdgram=57344
> net.route.netisr_maxqlen=4096
> net.local.stream.recvspace=65536
> net.local.stream.sendspace=65536
> net.graph.maxdata=65536
> net.graph.maxalloc=65536
> net.graph.maxdgram=2096000
> net.graph.recvspace=2096000
> kern.ipc.somaxconn=32768
> kern.ipc.nmbclusters=524288
> kern.ipc.maxsockbuf=26214400
> kern.ipc.shmmax=³2147483648"
> kern.ipc.nmbjumbop=³53200"
> kern.ipc.maxpipekva=³536870912"
> kern.random.sys.harvest.ethernet="0"
> kern.random.sys.harvest.interrupt="0"
> vm.kmem_size=³4096M² # Only on box with over 12G RAM. Otherwise 2G.
>
>
> vm.kmem_size_max=³8192M" # Only on box with over 12G RAM.
> hw.igb.rxd="4096"
> hw.igb.txd="4096"
> hw.em.rxd="4096"
> hw.em.txd="4096"
> hw.igb.max_interrupt_rate=³32000"
>
> hw.igb.rx_process_limit="4096"
> hw.em.rx_process_limit="500"
> net.link.ifqmaxlen="20480"
> net.isr.dispatch="direct"
> net.isr.direct_force="1"
> net.isr.direct="1"
> net.isr.maxthreads="8"
> net.isr.numthreads="4"
> net.isr.bindthreads="1"
> net.isr.maxqlimit="20480"
> net.isr.defaultqlimit="8192"
>
>
>
>
>
>   
>
>
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>
>




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?539DC36A.4030309>