From owner-freebsd-net@FreeBSD.ORG Tue Feb 23 17:24:52 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0FF64106568D for ; Tue, 23 Feb 2010 17:24:52 +0000 (UTC) (envelope-from kirk.davis@epsb.ca) Received: from OWA01.EDU.epsb.ca (owa01.epsb.ca [198.161.119.28]) by mx1.freebsd.org (Postfix) with ESMTP id D3B7A8FC12 for ; Tue, 23 Feb 2010 17:24:51 +0000 (UTC) Received: from Exchange26.EDU.epsb.ca ([10.0.5.123]) by OWA01.EDU.epsb.ca with Microsoft SMTPSVC(6.0.3790.3959); Tue, 23 Feb 2010 10:24:50 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Tue, 23 Feb 2010 10:24:50 -0700 Message-ID: <529374128DC1B04D9D037911B8E8F05301C17A5D@Exchange26.EDU.epsb.ca> In-Reply-To: <4B83D021.7020201@killfile.pl> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Intel em0: watchdog timeout Thread-Index: Acq0i9FAMUjgLyQSQjytekEtYevoFAAIBM8w References: <529374128DC1B04D9D037911B8E8F05301C17A51@Exchange26.EDU.epsb.ca> <43416_1266864062_4B82CFBE_43416_81_1_2a41acea1002221043k1b8742c9m8fb484a8e8a4fdda@mail.gmail.com> <529374128DC1B04D9D037911B8E8F05301C17A54@Exchange26.EDU.epsb.ca><2a41acea1002221113v26804200q4f3971c3359dffab@mail.gmail.com> <4B83D021.7020201@killfile.pl> From: "Kirk Davis" To: "Maciej Wierzbicki" X-OriginalArrivalTime: 23 Feb 2010 17:24:50.0928 (UTC) FILETIME=[1B073700:01CAB4AD] Cc: freebsd-net@freebsd.org, jfvogel@gmail.com Subject: RE: Intel em0: watchdog timeout X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Feb 2010 17:24:52 -0000 =20 > From: owner-freebsd-net@freebsd.org=20 >=20 > Jack Vogel wrote on 2010-02-22 20:13: >=20 > > 7.2 seems to be a stable base OS and driver, 8 is better in=20 > some respects, > > but > > has not been without its reported problems. I leave the=20 > choice to you. >=20 > Let me sneak into this thread as I am also suffering from em watchdog=20 > timeouts. In my case there is a 7.2-release doing HAProxy LB=20 > for several=20 > webservers. But as far as I can tell, the watchdogs are not=20 > related to=20 > traffic rate: I can have low traffic rate near 50Mbps having timeouts=20 > every minute and I can have 200-300Mbps with long periods of time=20 > without timeouts, there is no visible regularity in that. em is built=20 > into kernel. Typical watchdog timeout log: This doesn't sound good. I was just about to upgrade the box to 7.2 and see of the problem goes away with with the newer driver. :-/ I have come to the same conclusions about traffic rate. Since the watchdog=20 timeouts started, I have seen the problem at peak times but also in the=20 middle of the night when out traffic is very low. =20 > Feb 22 21:21:31 CSBP kernel: em0: watchdog timeout -- resetting > Feb 22 21:21:31 CSBP kernel: em0: link state changed to DOWN > Feb 22 21:21:34 CSBP kernel: em0: link state changed to UP > Feb 22 21:43:33 CSBP kernel: em0: watchdog timeout -- resetting > Feb 22 21:43:33 CSBP kernel: em0: link state changed to DOWN > Feb 22 21:43:36 CSBP kernel: em0: link state changed to UP >=20 > OK, here is some data: > FreeBSD 7.2-RELEASE-p5 #2: Thu Dec 10 14:21:26 CET 2009 > kern.ipc.nmbclusters=3D"262144" >=20 > I never saw anything close to resource exhausting via netstat -m > 5999/28441/34440 mbufs in use (current/cache/total) > 3240/18468/21708/262144 mbuf clusters in use (current/cache/total/max) > 3239/17881 mbuf+clusters out of packet secondary zone in use=20 > (current/cache) > 2673/10297/12970/204800 4k (page size) jumbo clusters in use=20 > (current/cache/total/max) > 18796K/85234K/104030K bytes allocated to network (current/cache/total) >=20 >=20 > em0: port=20 > 0xa000-0xa01f mem=20 > 0xe9080000-0xe909ffff,0xe9000000-0xe907ffff,0xe90a0000-0xe90a3 > fff irq 16=20 > at device 0.0 on pci2 > em0: Using MSIX interrupts > em1: port=20 > 0xb000-0xb01f mem=20 > 0xeb020000-0xeb03ffff,0xeb000000-0xeb01ffff irq 16 at device=20 > 0.0 on pci3 > em1: Using MSI interrupt >=20 > Feb 23 13:20:43 CSBP kernel: em0: Excessive collisions =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: Sequence errors =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: Defer count =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: Missed Packets =3D 3371167 > Feb 23 13:20:43 CSBP kernel: em0: Receive No Buffers =3D 257 > Feb 23 13:20:43 CSBP kernel: em0: Receive Length Errors =3D 1 > Feb 23 13:20:43 CSBP kernel: em0: Receive errors =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: Crc errors =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: Alignment errors =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: Collision/Carrier extension=20 > errors =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: RX overruns =3D 416328 > Feb 23 13:20:43 CSBP kernel: em0: watchdog timeouts =3D 1210 > Feb 23 13:20:43 CSBP kernel: em0: RX MSIX IRQ =3D 0 TX MSIX IRQ=20 > =3D 0 LINK=20 > MSIX IRQ =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: XON Rcvd =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: XON Xmtd =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: XOFF Rcvd =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: XOFF Xmtd =3D 0 > Feb 23 13:20:43 CSBP kernel: em0: Good Packets Rcvd =3D 9534885245 > Feb 23 13:20:43 CSBP kernel: em0: Good Packets Xmtd =3D 12866598217 > Feb 23 13:20:43 CSBP kernel: em0: TSO Contexts Xmtd =3D 3515091251 > Feb 23 13:20:43 CSBP kernel: em0: TSO Contexts Failed =3D 0 >=20 > Feb 23 13:21:14 CSBP kernel: em1: Excessive collisions =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: Sequence errors =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: Defer count =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: Missed Packets =3D 171 > Feb 23 13:21:14 CSBP kernel: em1: Receive No Buffers =3D 1112 > Feb 23 13:21:14 CSBP kernel: em1: Receive Length Errors =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: Receive errors =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: Crc errors =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: Alignment errors =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: Collision/Carrier extension=20 > errors =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: RX overruns =3D 5 > Feb 23 13:21:14 CSBP kernel: em1: watchdog timeouts =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: RX MSIX IRQ =3D 0 TX MSIX IRQ=20 > =3D 0 LINK=20 > MSIX IRQ =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: XON Rcvd =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: XON Xmtd =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: XOFF Rcvd =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: XOFF Xmtd =3D 0 > Feb 23 13:21:14 CSBP kernel: em1: Good Packets Rcvd =3D 11350337360 > Feb 23 13:21:14 CSBP kernel: em1: Good Packets Xmtd =3D 9594728760 > Feb 23 13:21:14 CSBP kernel: em1: TSO Contexts Xmtd =3D 30554321 > Feb 23 13:21:14 CSBP kernel: em1: TSO Contexts Failed =3D 0 >=20 > This is neither em0-hardware problem nor em0-type problem, because I=20 > tested both cases - I've used different em0 (the same model as my em1=20 > above) with the same result. >=20 > There is one additional thing I should write here: with=20 > current em0 card=20 > watchdog timeouts results in 1-2 minutes of non-responsive network, I=20 > mean when the watchdog occured, the box was not reachable for 1 to 2=20 > minutes. I managed to lower 1-2 minutes of nonresponsive state to=20 > "acceptable" 2-3 seconds by this: kern.ipc.nmbjumbop=3D204800 I have not tried this. Our outages are very quick, so quick in fact that the BGP instance running in the box doesn't notice. I have seen one or=20 two times that it has lasted up to a min. but most of the time it is very fast. >=20 > When I put NIC of the same type as em1, the watchdogs still=20 > occurs, but=20 > the box is non-responsive for 2-3 seconds only "by default", without=20 > modifying kern.ipc.nmbjumbop. >=20 > What else can I do (or report) to narrow the problem, or are=20 > there any=20 > patches I should try? :-) >=20 > Thanks & regards > --=20 ---- Kirk=20