From owner-freebsd-net@FreeBSD.ORG Fri Sep 4 13:01:12 2009 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 556B2106566B for ; Fri, 4 Sep 2009 13:01:12 +0000 (UTC) (envelope-from artis.caune@gmail.com) Received: from mail-fx0-f210.google.com (mail-fx0-f210.google.com [209.85.220.210]) by mx1.freebsd.org (Postfix) with ESMTP id D39368FC12 for ; Fri, 4 Sep 2009 13:01:11 +0000 (UTC) Received: by fxm6 with SMTP id 6so639338fxm.43 for ; Fri, 04 Sep 2009 06:01:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=7BJ1B4G4JyBnODP3zx6zG+BrAQ37wQX/DQlS4QPIpI8=; b=j5KWY3JClUoDe/CEjNFQakrX70N6X20CFnt8g+WM8fwfJhGyG33h9uvRqBOlgcKB8z ND4V/3U1IKTa60dPYycMfXR4YfkhNLicoMG9N/D39GaSLm7GVud7/VS7yGqbPP462y04 WLxdeNR3MOD7nJilVSHNltyTze85qcRLKaTPA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=KDS/6rABgCd8KSLbwHJRpMRnUfDCFCeqDhj8lS/SL3u8liPeOoVhzOc2Lr7xc/wa+7 pThAvuu9931hygqFU3Ok6YgvGYE0RcGr5TKP4jmhfl77uoD/70oJAihOiCKdldODzmgi 1JH81AxjkFNtzfHwkhmmr/uawR20jf9iHweMs= MIME-Version: 1.0 Received: by 10.103.67.32 with SMTP id u32mr4748891muk.133.1252069270728; Fri, 04 Sep 2009 06:01:10 -0700 (PDT) In-Reply-To: <11420.28890.qm@web56404.mail.re3.yahoo.com> References: <11420.28890.qm@web56404.mail.re3.yahoo.com> Date: Fri, 4 Sep 2009 16:01:10 +0300 Message-ID: <9e20d71e0909040601s100688c2m7d7f73eb187f4809@mail.gmail.com> From: Artis Caune To: alexpalias-bsdnet@yahoo.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: freebsd-net@freebsd.org Subject: Re: em driver input errors X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Sep 2009 13:01:12 -0000 2009/8/1 : > Good day > > I'm running a FreeBSD 7.2 router and I am seeing a lot of input errors on= one of the em interfaces (em0), coupled with (at approximately the same ti= mes) much fewer errors on em1 and em2.=C2=A0 Monitoring is done with SNMP f= rom another machine, and the CPU load as reported via SNMP is mostly below = 30%, with a couple of spikes up to 35%. > > Software description: > > - FreeBSD 7.2-RELEASE-p2, amd64 > - bsnmpd with modules: hostres and (from ports) snmp_ucd > - quagga 0.99.12 (running only zebra and bgpd) > - netgraph (ng_ether and ng_netflow) > > Hardware description: > > - Dell machine, dual Xeon 3.20 GHz, 4 GB RAM > - 2 x built-in gigabit interfaces (em0, em1) > - 1 x dual-port gigabit interface, PCI-X (em2, em3) [see pciconf near the= end] > > > The machine receives the global routing table ("netstat -nr | wc -l" give= s 289115 currently). > > All of the em interfaces are just configured "up", with various vlan inte= rfaces on them.=C2=A0 Note that I use "kpps" to mean "thousands of packets = per second", sorry if that's the wrong shorthand. > > - em0 sees a traffic of 10...22 kpps in, and 15...35 kpps out.=C2=A0 In b= its, it's 30...120Mbits/s in, and 100...210Mbits/s out.=C2=A0 Vlans configu= red are vlan100 and vlan200, and most of the traffic is on vlan100 (vlan200= sees 4kpps in / 0.5kpps out maximum, with the average at about one third o= f this).=C2=A0 em0 is the external interface, and its traffic corresponds t= o the sum of traffic through em1 and em2 > > - em1 has 5 vlans, and sees about 22kpps in / 11kpps out (maximum) > > - em2 has a single VLAN, and sees about 4...13kpps both in and out (almos= t equal in/out during most of the day) > > - em3 is a backup interface, with 2 VLANS, and is the only one which has = seen no errors. > > Only the vlans on em0 are analyzed by ng_netflow, and the errors I'm seei= ng have started appearing days before netgraph was even loaded in the kerne= l. > > Tuning done: > > /boot/loader.conf: > hw.em.rxd=3D4096 > hw.em.txd=3D4096 > > Witout the above we were seeing way more errors, now they are reduced, bu= t still come in bursts of over 1000 errors on em0. > > /etc/sysctl.conf: > net.inet.ip.fastforwarding=3D1 > dev.em.0.rx_processing_limit=3D300 > dev.em.1.rx_processing_limit=3D300 > dev.em.2.rx_processing_limit=3D300 > dev.em.3.rx_processing_limit=3D300 > > Still seeing errros, after some searching the mailing lists we also added= : > > # the four lines below are repeated for em1, em2, em3 > dev.em.0.rx_int_delay=3D0 > dev.em.0.rx_abs_int_delay=3D0 > dev.em.0.tx_int_delay=3D0 > dev.em.0.tx_abs_int_delay=3D0 > > Still getting errors, so I also added: > > net.inet.ip.intr_queue_maxlen=3D4096 > net.route.netisr_maxqlen=3D1024 > > and > > kern.ipc.nmbclusters=3D655360 > > > Also tried with rx_processing_limit set to -1 on all em interfaces, still= getting errors. > > Looking at the shape of the error and packet graphs, there seems to be a = correlation between the number of packets per second on em0 and the height = of the error "spikes" on the error graph.=C2=A0 These spikes are spread thr= oughout the day, with spaces (zones with no errors) of various lengths (10 = minutes ... 2 hours spaces within the last 24 hours), but sometimes there a= re errors even in the lowest kpps times of the day. > > em0 and em1 error times are correlated, with all errors on the graph for = em0 having a smaller corresponding error spike on em1 at the same time, and= sometimes an error spike on em2. > > The old router was seeing about the same traffic, and had em0, em1, re0 a= nd re1 network cards, and was only seeing errors on the em cards.=C2=A0 It = was running 7.2-PRERELEASE/i386 > > > Any suggestions would be greatly appreciated.=C2=A0 Please note that this= is a live router, and I can't reboot it (unless absolutely necessary).=C2= =A0 Tuning that can be applied without rebooting will be tried first. Is it still actual? You didn't mention if you are using pf or other firewall. I have similar problem with two boxes replicating zfs pools, when I noticed input errors. After some investigation turns out it was pf overhead, even though I was skipping on interfaces where zfs sedn/recv. With pf enables (and skip) I can copy 50-80MB/s with 50-80Kpps and 0-100+ input drops per second. With pf disabled I can copy constantly with 102 or 93 MB/s and 110-131Kpps, few drops (because 1 CPU almost eaten). --=20 Artis Caune Everything should be made as simple as possible, but not simpler.