Date: Sun, 2 Aug 2009 09:54:40 -0700 (PDT) From: Barney Cordoba <barney_cordoba@yahoo.com> To: freebsd-net@freebsd.org, alexpalias-bsdnet@yahoo.com Subject: Re: em driver input errors Message-ID: <210006.36085.qm@web63904.mail.re1.yahoo.com> In-Reply-To: <11420.28890.qm@web56404.mail.re3.yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--- On Sat, 8/1/09, alexpalias-bsdnet@yahoo.com <alexpalias-bsdnet@yahoo.com> wrote:
> From: alexpalias-bsdnet@yahoo.com <alexpalias-bsdnet@yahoo.com>
> Subject: em driver input errors
> To: freebsd-net@freebsd.org
> Date: Saturday, August 1, 2009, 9:05 AM
> Good day
>
> I'm running a FreeBSD 7.2 router and I am seeing a lot of
> input errors on one of the em interfaces (em0), coupled with
> (at approximately the same times) much fewer errors on em1
> and em2. Monitoring is done with SNMP from another
> machine, and the CPU load as reported via SNMP is mostly
> below 30%, with a couple of spikes up to 35%.
>
> Software description:
>
> - FreeBSD 7.2-RELEASE-p2, amd64
> - bsnmpd with modules: hostres and (from ports) snmp_ucd
> - quagga 0.99.12 (running only zebra and bgpd)
> - netgraph (ng_ether and ng_netflow)
>
> Hardware description:
>
> - Dell machine, dual Xeon 3.20 GHz, 4 GB RAM
> - 2 x built-in gigabit interfaces (em0, em1)
> - 1 x dual-port gigabit interface, PCI-X (em2, em3) [see
> pciconf near the end]
>
>
> The machine receives the global routing table ("netstat -nr
> | wc -l" gives 289115 currently).
>
> All of the em interfaces are just configured "up", with
> various vlan interfaces on them. Note that I use "kpps" to
> mean "thousands of packets per second", sorry if that's the
> wrong shorthand.
>
> - em0 sees a traffic of 10...22 kpps in, and 15...35 kpps
> out. In bits, it's 30...120Mbits/s in, and
> 100...210Mbits/s out. Vlans configured are vlan100 and
> vlan200, and most of the traffic is on vlan100 (vlan200 sees
> 4kpps in / 0.5kpps out maximum, with the average at about
> one third of this). em0 is the external interface, and its
> traffic corresponds to the sum of traffic through em1 and
> em2
>
> - em1 has 5 vlans, and sees about 22kpps in / 11kpps out
> (maximum)
>
> - em2 has a single VLAN, and sees about 4...13kpps both in
> and out (almost equal in/out during most of the day)
>
> - em3 is a backup interface, with 2 VLANS, and is the only
> one which has seen no errors.
>
> Only the vlans on em0 are analyzed by ng_netflow, and the
> errors I'm seeing have started appearing days before
> netgraph was even loaded in the kernel.
>
> Tuning done:
>
> /boot/loader.conf:
> hw.em.rxd=4096
> hw.em.txd=4096
>
> Witout the above we were seeing way more errors, now they
> are reduced, but still come in bursts of over 1000 errors on
> em0.
>
> /etc/sysctl.conf:
> net.inet.ip.fastforwarding=1
> dev.em.0.rx_processing_limit=300
> dev.em.1.rx_processing_limit=300
> dev.em.2.rx_processing_limit=300
> dev.em.3.rx_processing_limit=300
>
> Still seeing errros, after some searching the mailing lists
> we also added:
>
> # the four lines below are repeated for em1, em2, em3
> dev.em.0.rx_int_delay=0
> dev.em.0.rx_abs_int_delay=0
> dev.em.0.tx_int_delay=0
> dev.em.0.tx_abs_int_delay=0
>
> Still getting errors, so I also added:
>
> net.inet.ip.intr_queue_maxlen=4096
> net.route.netisr_maxqlen=1024
>
> and
>
> kern.ipc.nmbclusters=655360
>
>
> Also tried with rx_processing_limit set to -1 on all em
> interfaces, still getting errors.
>
> Looking at the shape of the error and packet graphs, there
> seems to be a correlation between the number of packets per
> second on em0 and the height of the error "spikes" on the
> error graph. These spikes are spread throughout the day,
> with spaces (zones with no errors) of various lengths (10
> minutes ... 2 hours spaces within the last 24 hours), but
> sometimes there are errors even in the lowest kpps times of
> the day.
>
> em0 and em1 error times are correlated, with all errors on
> the graph for em0 having a smaller corresponding error spike
> on em1 at the same time, and sometimes an error spike on
> em2.
>
> The old router was seeing about the same traffic, and had
> em0, em1, re0 and re1 network cards, and was only seeing
> errors on the em cards. It was running
> 7.2-PRERELEASE/i386
>
>
> Any suggestions would be greatly appreciated. Please note
> that this is a live router, and I can't reboot it (unless
> absolutely necessary). Tuning that can be applied without
> rebooting will be tried first.
>
> Here are some more details:
>
> Trimmed output of netstat -ni (sorry if there are line
> breaks):
> Name Mtu Network Address
> Ipkts Ierrs Opkts Oerrs Coll
> em0 1500 <Link#1> 00:14:22:xx:xx:xx
> 19744458839 15494721 24284439443 0 0
> em1 1500 <Link#2> 00:14:22:xx:xx:xx
> 12832245469 123181 10105031790 0 0
> em2 1500 <Link#3> 00:04:23:xx:xx:xx
> 12082552403 10964 10339416865 0 0
> em3 1500 <Link#4> 00:04:23:xx:xx:xx
> 79912337 0 48178737 0 0
>
> Relevant part of pciconf -vl:
>
> em0@pci0:6:7:0: class=0x020000 card=0x016d1028
> chip=0x10768086 rev=0x05 hdr=0x00
> vendor = 'Intel Corporation'
> device = '82541EI Gigabit Ethernet
> Controller'
> class = network
> subclass = ethernet
> em1@pci0:7:8:0: class=0x020000 card=0x016d1028
> chip=0x10768086 rev=0x05 hdr=0x00
> vendor = 'Intel Corporation'
> device = '82541EI Gigabit Ethernet
> Controller'
> class = network
> subclass = ethernet
> em2@pci0:9:4:0: class=0x020000 card=0x10128086
> chip=0x10108086 rev=0x01 hdr=0x00
> vendor = 'Intel Corporation'
> device = '82546EB Dual Port Gigabit Ethernet
> Controller (Copper)'
> class = network
> subclass = ethernet
> em3@pci0:9:4:1: class=0x020000 card=0x10128086
> chip=0x10108086 rev=0x01 hdr=0x00
> vendor = 'Intel Corporation'
> device = '82546EB Dual Port Gigabit Ethernet
> Controller (Copper)'
> class = network
> subclass = ethernet
>
> Kernel messages after sysctl dev.em.0.stats=1:
> (note that I've removed the lines which only showed zeros
> in the second and third outputs)
>
> em0: Excessive collisions = 0
> em0: Sequence errors = 0
> em0: Defer count = 0
> em0: Missed Packets = 15435312
> em0: Receive No Buffers = 16446113
> em0: Receive Length Errors = 0
> em0: Receive errors = 1
> em0: Crc errors = 2
> em0: Alignment errors = 0
> em0: Collision/Carrier extension errors = 0
> em0: RX overruns = 96826
> em0: watchdog timeouts = 0
> em0: RX MSIX IRQ = 0 TX MSIX IRQ = 0 LINK MSIX IRQ = 0
> em0: XON Rcvd = 0
> em0: XON Xmtd = 0
> em0: XOFF Rcvd = 0
> em0: XOFF Xmtd = 0
> em0: Good Packets Rcvd = 19002068797
> em0: Good Packets Xmtd = 23168462599
> em0: TSO Contexts Xmtd = 0
> em0: TSO Contexts Failed = 0
>
> [later]
> em0: Excessive collisions = 0
> em0: Missed Packets = 15459111
> em0: Receive No Buffers = 16447082
> em0: Receive errors = 1
> em0: Crc errors = 2
> em0: RX overruns = 96835
> em0: Good Packets Rcvd = 19165047284
> em0: Good Packets Xmtd = 23386976960
>
> [later]
> em0: Excessive collisions = 0
> em0: Missed Packets = 15470583
> em0: Receive No Buffers = 16447686
> em0: Receive errors = 1
> em0: Crc errors = 2
> em0: RX overruns = 96840
> em0: Good Packets Rcvd = 19255466068
> em0: Good Packets Xmtd = 23519004546
>
Note that "most" pcix motherboards wire onboard NICs to 32bits and 33Mhz, mainly because its apparently easier to do so. Its likely that your
add-on card is running at 64bits and 133Mhz.
32bits/33Mhz isn't really fast enough to manage gigabit traffic flows, as
its max burst is only 1 Gb/s, so you really can't use them for any sort
of primary traffic flow. Check with you MB manufacturer as they usually
don't advertise it.
Barney
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?210006.36085.qm>
