Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 23 Feb 2010 10:24:50 -0700
From:      "Kirk Davis" <kirk.davis@epsb.ca>
To:        "Maciej Wierzbicki" <voovoos-fnet@killfile.pl>
Cc:        freebsd-net@freebsd.org, jfvogel@gmail.com
Subject:   RE: Intel em0: watchdog timeout
Message-ID:  <529374128DC1B04D9D037911B8E8F05301C17A5D@Exchange26.EDU.epsb.ca>
In-Reply-To: <4B83D021.7020201@killfile.pl>
References:  <529374128DC1B04D9D037911B8E8F05301C17A51@Exchange26.EDU.epsb.ca>	<43416_1266864062_4B82CFBE_43416_81_1_2a41acea1002221043k1b8742c9m8fb484a8e8a4fdda@mail.gmail.com>	<529374128DC1B04D9D037911B8E8F05301C17A54@Exchange26.EDU.epsb.ca><2a41acea1002221113v26804200q4f3971c3359dffab@mail.gmail.com> <4B83D021.7020201@killfile.pl>

next in thread | previous in thread | raw e-mail | index | archive | help
=20

> From: owner-freebsd-net@freebsd.org=20
>=20
> Jack Vogel wrote on 2010-02-22 20:13:
>=20
> > 7.2 seems to be a stable base OS and driver, 8 is better in=20
> some respects,
> > but
> > has not been without its reported problems. I leave the=20
> choice to you.
>=20
> Let me sneak into this thread as I am also suffering from em watchdog=20
> timeouts. In my case there is a 7.2-release doing HAProxy LB=20
> for several=20
> webservers. But as far as I can tell, the watchdogs are not=20
> related to=20
> traffic rate: I can have low traffic rate near 50Mbps having timeouts=20
> every minute and I can have 200-300Mbps with long periods of time=20
> without timeouts, there is no visible regularity in that. em is built=20
> into kernel. Typical watchdog timeout log:

This doesn't sound good.  I was just about to upgrade the box to 7.2 and
see of the problem goes away with with the newer driver. :-/

I have come to the same conclusions about traffic rate. Since the
watchdog=20
timeouts started, I have seen the problem at peak times but also in the=20
middle of the night when out traffic is very low.

=20
> Feb 22 21:21:31 CSBP kernel: em0: watchdog timeout -- resetting
> Feb 22 21:21:31 CSBP kernel: em0: link state changed to DOWN
> Feb 22 21:21:34 CSBP kernel: em0: link state changed to UP
> Feb 22 21:43:33 CSBP kernel: em0: watchdog timeout -- resetting
> Feb 22 21:43:33 CSBP kernel: em0: link state changed to DOWN
> Feb 22 21:43:36 CSBP kernel: em0: link state changed to UP
>=20
> OK, here is some data:
> FreeBSD 7.2-RELEASE-p5 #2: Thu Dec 10 14:21:26 CET 2009
> kern.ipc.nmbclusters=3D"262144"
>=20
> I never saw anything close to resource exhausting via netstat -m
> 5999/28441/34440 mbufs in use (current/cache/total)
> 3240/18468/21708/262144 mbuf clusters in use (current/cache/total/max)
> 3239/17881 mbuf+clusters out of packet secondary zone in use=20
> (current/cache)
> 2673/10297/12970/204800 4k (page size) jumbo clusters in use=20
> (current/cache/total/max)
> 18796K/85234K/104030K bytes allocated to network (current/cache/total)
>=20
>=20
> em0: <Intel(R) PRO/1000 Network Connection 6.9.6> port=20
> 0xa000-0xa01f mem=20
> 0xe9080000-0xe909ffff,0xe9000000-0xe907ffff,0xe90a0000-0xe90a3
> fff irq 16=20
> at device 0.0 on pci2
> em0: Using MSIX interrupts
> em1: <Intel(R) PRO/1000 Network Connection 6.9.6> port=20
> 0xb000-0xb01f mem=20
> 0xeb020000-0xeb03ffff,0xeb000000-0xeb01ffff irq 16 at device=20
> 0.0 on pci3
> em1: Using MSI interrupt
>=20
> Feb 23 13:20:43 CSBP kernel: em0: Excessive collisions =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: Sequence errors =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: Defer count =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: Missed Packets =3D 3371167
> Feb 23 13:20:43 CSBP kernel: em0: Receive No Buffers =3D 257
> Feb 23 13:20:43 CSBP kernel: em0: Receive Length Errors =3D 1
> Feb 23 13:20:43 CSBP kernel: em0: Receive errors =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: Crc errors =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: Alignment errors =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: Collision/Carrier extension=20
> errors =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: RX overruns =3D 416328
> Feb 23 13:20:43 CSBP kernel: em0: watchdog timeouts =3D 1210
> Feb 23 13:20:43 CSBP kernel: em0: RX MSIX IRQ =3D 0 TX MSIX IRQ=20
> =3D 0 LINK=20
> MSIX IRQ =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: XON Rcvd =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: XON Xmtd =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: XOFF Rcvd =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: XOFF Xmtd =3D 0
> Feb 23 13:20:43 CSBP kernel: em0: Good Packets Rcvd =3D 9534885245
> Feb 23 13:20:43 CSBP kernel: em0: Good Packets Xmtd =3D 12866598217
> Feb 23 13:20:43 CSBP kernel: em0: TSO Contexts Xmtd =3D 3515091251
> Feb 23 13:20:43 CSBP kernel: em0: TSO Contexts Failed =3D 0
>=20
> Feb 23 13:21:14 CSBP kernel: em1: Excessive collisions =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: Sequence errors =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: Defer count =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: Missed Packets =3D 171
> Feb 23 13:21:14 CSBP kernel: em1: Receive No Buffers =3D 1112
> Feb 23 13:21:14 CSBP kernel: em1: Receive Length Errors =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: Receive errors =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: Crc errors =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: Alignment errors =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: Collision/Carrier extension=20
> errors =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: RX overruns =3D 5
> Feb 23 13:21:14 CSBP kernel: em1: watchdog timeouts =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: RX MSIX IRQ =3D 0 TX MSIX IRQ=20
> =3D 0 LINK=20
> MSIX IRQ =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: XON Rcvd =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: XON Xmtd =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: XOFF Rcvd =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: XOFF Xmtd =3D 0
> Feb 23 13:21:14 CSBP kernel: em1: Good Packets Rcvd =3D 11350337360
> Feb 23 13:21:14 CSBP kernel: em1: Good Packets Xmtd =3D 9594728760
> Feb 23 13:21:14 CSBP kernel: em1: TSO Contexts Xmtd =3D 30554321
> Feb 23 13:21:14 CSBP kernel: em1: TSO Contexts Failed =3D 0
>=20
> This is neither em0-hardware problem nor em0-type problem, because I=20
> tested both cases - I've used different em0 (the same model as my em1=20
> above) with the same result.
>=20
> There is one additional thing I should write here: with=20
> current em0 card=20
> watchdog timeouts results in 1-2 minutes of non-responsive network, I=20
> mean when the watchdog occured, the box was not reachable for 1 to 2=20
> minutes. I managed to lower 1-2 minutes of nonresponsive state to=20
> "acceptable" 2-3 seconds by this: kern.ipc.nmbjumbop=3D204800

I have not tried this. Our outages are very quick, so quick in fact that

the BGP instance running in the box doesn't notice.  I have seen one or=20
two times that it has lasted up to a min. but most of the time it is
very
fast.

>=20
> When I put NIC of the same type as em1, the watchdogs still=20
> occurs, but=20
> the box is non-responsive for 2-3 seconds only "by default", without=20
> modifying kern.ipc.nmbjumbop.
>=20
> What else can I do (or report) to narrow the problem, or are=20
> there any=20
> patches I should try? :-)
>=20
> Thanks & regards
> --=20


---- Kirk=20



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?529374128DC1B04D9D037911B8E8F05301C17A5D>