Date: Wed, 27 Sep 2006 08:17:42 -0500 From: Stephen Montgomery-Smith <stephen@math.missouri.edu> To: Scott Long <scottl@samsco.org> Cc: freebsd-stable@freebsd.org, Oliver Brandmueller <ob@e-Gitt.NET> Subject: Re: 6.2 SHOWSTOPPER - em completely unusable on 6.2 Message-ID: <451A79F6.5000501@math.missouri.edu> In-Reply-To: <451A4189.5020906@samsco.org> References: <451A1375.5080202@gneto.com> <20060927071538.GF22229@e-Gitt.NET> <451A4189.5020906@samsco.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Scott Long wrote: > Oliver Brandmueller wrote: > >> Hi, >> >> On Wed, Sep 27, 2006 at 08:00:21AM +0200, Martin Nilsson wrote: >> >>> I get tons of these: >>> em0: watchdog timeout -- resetting >>> em0: link state changed to DOWN >>> em0: link state changed to UP >>> >>> mailbox# pciconf -lv >>> em0@pci13:0:0: class=0x020000 card=0x108c15d9 chip=0x108c8086 >>> rev=0x03 hdr=0x00 >>> vendor = 'Intel Corporation' >>> device = 'PRO/1000 PM' >>> class = network >>> subclass = ethernet >>> em1@pci14:0:0: class=0x020000 card=0x109a15d9 chip=0x109a8086 >>> rev=0x00 hdr=0x00 >>> vendor = 'Intel Corporation' >>> class = network >>> subclass = ethernet >>> >> >> [...] >> >>> I have only seen them on em0. Yesterday I tried sysutils/cpuburn on >>> similar boxes that are netbooted with NFS mounted drives and >>> everytime I loaded the two CPU cores the network went down. >> >> >> >> I see the same. >> >> Very much on this one, where I workaround the problem by using polling, >> it's a UP machine. >> >> FreeBSD nessie 6.2-PRERELEASE FreeBSD 6.2-PRERELEASE #3: Fri Sep 15 >> 09:48:36 CEST 2006 root@nessie:/usr/obj/usr/src/sys/NESSIE i386 >> >> em0@pci2:1:0: class=0x020000 card=0x10198086 chip=0x10198086 >> rev=0x00 hdr=0x00 >> vendor = 'Intel Corporation' >> device = '82547EI Gigabit Ethernet Controller (LOM)' >> class = network >> subclass = ethernet >> >> irq18: em0 uhci2 3319 0 >> >> >> Another machine, also UP, but with two interfaces. The problem is not >> as apparent as on the first machine, but it's there. This machine is >> not as loaded usually (CPU wise) as the first machine. The problem is >> ONLY on em1: >> >> FreeBSD hudson 6.2-PRERELEASE FreeBSD 6.2-PRERELEASE #48: Thu Sep 14 >> 10:19:46 CEST 2006 root@hudson:/usr/obj/usr/src/sys/NFS-32-FBSD6 >> i386 >> >> em0@pci1:1:0: class=0x020000 card=0x10758086 chip=0x10758086 >> rev=0x00 hdr=0x00 >> vendor = 'Intel Corporation' >> device = '82547EI Gigabit Ethernet Controller' >> class = network >> subclass = ethernet >> >> em1@pci3:2:0: class=0x020000 card=0x10768086 chip=0x10768086 >> rev=0x00 hdr=0x00 >> vendor = 'Intel Corporation' >> device = '82547EI Gigabit Ethernet Controller' >> class = network >> subclass = ethernet >> >> irq17: em1 ichsmb0 950121879 855 >> irq18: em0 71437344 64 >> >> >> The problem appeared after the em updates during the last weeks in the >> kernel and has not been observed before this. em is always loaded as a >> module in my kernels. The problem seems to occur more often if the >> machine's CPU is busy. >> >> >> I have several SMP machines with the following em interfaces, which >> DON'T show the problem, but they also have different chipset on the em >> interface. Most of the kernels were built between Sep 7 and Sep 19. >> >> 3 times this: >> em0@pci4:5:0: class=0x020000 card=0x34248086 chip=0x10108086 >> rev=0x01 hdr=0x00 >> em1@pci4:5:1: class=0x020000 card=0x34248086 chip=0x10108086 >> rev=0x01 hdr=0x00 >> irq23: em0 970303432 750 >> >> >> >> 3 times this: >> em0@pci4:5:0: class=0x020000 card=0x34258086 chip=0x100e8086 >> rev=0x02 hdr=0x00 >> irq23: em0 292477376 435 >> >> >> So I can observe at least 3 interesting differences: >> >> - the interface showing the problems shares the interrupt >> - for me it happens on UP machines only >> - the chips are different >> >> What I can't do: moving the interfaces between machines, these are >> onboard interfaces. >> >> What I could do: I could try to unload the USB driver or the ichsmb >> driver on the machiens, where the interrupts are shared. Anyway, the >> USB is not used currently (I have it enabled to be prepared to hook up >> a USB Mass Storage device, which never happend since the problem >> occured). The ichsmb also is usually not queried. >> >> Any suggestions on how I could help? >> >> - Olli >> >> > > Well, the best I can say at the moment is, "Wow." =-( I guess the > thing to do here is to figure out if the problem lies with the em > interrupt handler not getting run, or the taskqueue not getting run. > Since you've stated that it seems to be related to shared interrupts, > the first possibility is more likely. However, I'm not sure why the > symptom would only be showing up now. The Intel docs say that the > 82547EI are a bit interesting, and I wonder if assumptions that we > make about PCI ordering aren't true (or if there are bugs that make > our assumptions invalid). > > Does this happen after there has been a lot of disk activity, like a > large tar extraction? Are you using the SMBus interface at all, or is > it sitting completely idle? I have experienced this problem also. It happens when the system is definitely not idle. So I am simulataneously dung large internet transfers (via em), using the graphics card with OpenGL, and building the kde port. I have actually had this problem for a month or so, so if it is a software fault it was introduced into the OS quite recently. (I tend to rebuild RELENG_6 about twice a month.) Stephen
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?451A79F6.5000501>