FreeBSD Mail Archives

Date:      Sun, 15 Jun 2014 21:09:56 +1000
From:      Mark van der Meulen <mark@fivenynes.com>
To:        <freebsd-net@freebsd.org>
Cc:        freebsd-bugs@freebsd.org
Subject:   FreeBSD 9 w/ MPD5 crashes as LNS with 300+ tunnels. Netgraph issue? 
Message-ID:  <CFC3BC24.1CB4A%mark@fivenynes.com>

next in thread | raw e-mail | index | archive | help

Hi List,

I=B9m wondering if anyone can help me with this problem or at least help
point me in the direction of where to start looking? I have FreeBSD 9
based servers which are crashing every 4-10 days and producing crash dumps
similar to this one: http://pastebin.com/F82Jc08C

All crash dumps seem to involve the net graph code and the current process
is always ng_queueX.

In summary, we have 4 x FreeBSD server running as LNS(MPD5) for around
2000 subscribers with 3 of the servers running a modified version of
BSDRP, the fourth running a FreeBSD 9 install with what I thought was the
latest stable source for the kernel because I fetched it from stable/9
however it shows up as 9.3-BETA in uname(the linked crash dump is from
that server).

3 x LNS running modified BSDRP: DELL PowerEdge 2950, 2 x Xeon E5320, 4GB
RAM, igb Quad Port NIC in LAGG, Quagga, MPD5, IPFW for Host Access
Control, NTPD, BSNMPD
1 x LNS running latest FreeBSD 9 code: HP ProLiant DL380, 2 x Xeon X5465,
36GB RAM, em Quad Port NIC in LAGG, BIRD, MPD5, IPFW for Host Access
Control, NTPD, BSNMPD

The reason I built the fresh server on FreeBSD 9 is because I cannot save
crash dumps for BSDRP easily. In short the problem is this =AD servers with
10-50 clients will run indefinitely(as long as we have had them, which is
probably about 1.5 years) without errors and serve clients fine, however
any with over 300 clients appear to only stay online for 4-10 days maximum
before crashing and rebooting. I have attached the crash file from the
latest crash on the LNS running the latest FreeBSD 9 code however unsure
what to do with it and where to look?

When these devices crash they are often doing in excess of
200Mbps(anywhere between 200Mbps and 450Mbps), very little load(3-4.5 on
the first 3, less than 2 on the fourth).

Things I=B9ve done to attempt resolution:

- Replaced bce network cards with em network cards. This produced far less
errors on the interfaces(was many before, now none) and I think caused the
machines to stay up longer between reboots as before it would happen up to
once a day.
- Replaced em network cards with igb network cards. All this did was lower
load and give us a little more time between reboots.
- Tried an implementation using FreeBSD 10(this lasted less than 4 hours
before reboots when under load)
- Replaced memory
- Increased memory on LNS4 to 36GB.
- Various kernel rebuilds
- Tweaked various kernel settings. This appears to have helped a little
and given us more time between reboots.
- Disabled IPv6
- Disabled IPFW
- Disabled BSNMPD
- Disabled Netflow
- Versions 5.6 and 5.7 of MPD5

Anyone able to help me work out what the crash dump means? It only happens
on servers running MPD5 (eg. Exact same boxes, exact same code pushing
800Mbps+ of routing and no crashes) and I can see the crash relates to net
graph, however unsure where to go from there=8A

Thanks,

Mark


Relevant Current Settings:

net.inet.ip.fastforwarding=3D1
net.inet.ip.fw.default_to_accept=3D1
net.bpf.zerocopy_enable=3D1
net.inet.raw.maxdgram=3D16384
net.inet.raw.recvspace=3D16384
hw.intr_storm_threshold=3D64000
net.inet.ip.fastforwarding=3D1
net.inet.ip.fw.default_to_accept=3D1
net.inet.ip.intr_queue_maxlen=3D10240
net.inet.ip.redirect=3D0
net.inet.ip.sourceroute=3D0
net.inet.ip.rtexpire=3D2
net.inet.ip.rtminexpire=3D2
net.inet.ip.rtmaxcache=3D256
net.inet.ip.accept_sourceroute=3D0
net.inet.ip.process_options=3D0
net.inet.icmp.log_redirect=3D0
net.inet.icmp.drop_redirect=3D1
net.inet.tcp.drop_synfin=3D1
net.inet.tcp.blackhole=3D2
net.inet.tcp.sendbuf_max=3D16777216
net.inet.tcp.recvbuf_max=3D16777216
net.inet.tcp.sendbuf_auto=3D1
net.inet.tcp.recvbuf_auto=3D1
net.inet.udp.recvspace=3D262144
net.inet.udp.blackhole=3D0
net.inet.udp.maxdgram=3D57344
net.route.netisr_maxqlen=3D4096
net.local.stream.recvspace=3D65536
net.local.stream.sendspace=3D65536
net.graph.maxdata=3D65536
net.graph.maxalloc=3D65536
net.graph.maxdgram=3D2096000
net.graph.recvspace=3D2096000
kern.ipc.somaxconn=3D32768
kern.ipc.nmbclusters=3D524288
kern.ipc.maxsockbuf=3D26214400
kern.ipc.shmmax=3D=B32147483648"
kern.ipc.nmbjumbop=3D=B353200"
kern.ipc.maxpipekva=3D=B3536870912"
kern.random.sys.harvest.ethernet=3D"0"
kern.random.sys.harvest.interrupt=3D"0"
vm.kmem_size=3D=B34096M=B2 # Only on box with over 12G RAM. Otherwise 2G.


vm.kmem_size_max=3D=B38192M" # Only on box with over 12G RAM.
hw.igb.rxd=3D"4096"
hw.igb.txd=3D"4096"
hw.em.rxd=3D"4096"
hw.em.txd=3D"4096"
hw.igb.max_interrupt_rate=3D=B332000"

hw.igb.rx_process_limit=3D"4096"
hw.em.rx_process_limit=3D"500"
net.link.ifqmaxlen=3D"20480"
net.isr.dispatch=3D"direct"
net.isr.direct_force=3D"1"
net.isr.direct=3D"1"
net.isr.maxthreads=3D"8"
net.isr.numthreads=3D"4"
net.isr.bindthreads=3D"1"
net.isr.maxqlimit=3D"20480"
net.isr.defaultqlimit=3D"8192"





=20

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CFC3BC24.1CB4A%mark>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation