Date: Fri, 12 Sep 2003 13:51:56 -0700 (PDT) From: John Polstra <jdp@polstra.com> To: Mike Tancsa <mike@sentex.net> Cc: freebsd-stable@freebsd.org Subject: Re: recent stability problems with fxp driver Message-ID: <XFMail.20030912135156.jdp@polstra.com> In-Reply-To: <6.0.0.22.0.20030912134112.05891060@209.112.4.2>
next in thread | previous in thread | raw e-mail | index | archive | help
On 12-Sep-2003 Mike Tancsa wrote: > At 12:26 PM 12/09/2003, Info Account wrote: >>I've spent the past four days or so updating machines here to 4.8/9-stable via >>cvsup, and have done a complete make buildworld/kernel on each machine (some >>SMP, some single processor). It seems something is broken with the latest fxp >>driver, on each machine (different mobos and hardware configs) heavy network >>traffic with fxp NICs causes timeouts and random kernel panics. > > I have a few boxes pushing over 50Mb with fxp cards and havent seen this > problem. What type of fxp cards do you have ? What does > pciconf -v -l > show for the Intel types ? > > Also, I have found in the past that I would see this behavior if I changed > NICs and didnt do a PCIconfig reset in the MB BIOS. There is something > about Intel nics and Adaptec and 3ware cards that particularly require > this. Also, make sure that you dont have some duplex mismatches on the > nics. I have seen where excessive errors combined with high traffic will > cause panics. > > Also, please post the actual error messages on each of the machines. The problem is real, at least on some hardware. I had to give up on using the two integrated fxp devices on my Dell 1550 -- which is a real bummer, since it's a 1U box that only has two PCI slots. With the latest -stable driver, I couldn't fetch a 560 MB file from another machine on the LAN using FTP without killing the fxp device. The messages vary in detail, but this will give you the general idea: Sep 12 10:18:22 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x90 0x0 Sep 12 10:18:31 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x90 0x0 Sep 12 10:18:32 thin su: jdp to root on /dev/ttyp1 Sep 12 10:18:39 thin /kernel: fxp0: DMA timeout Sep 12 10:18:39 thin last message repeated 2 times Sep 12 10:18:49 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:18:51 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:18:54 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:18:56 thin /kernel: fxp0: device timeout Sep 12 10:18:56 thin /kernel: fxp0: DMA timeout Sep 12 10:19:10 thin last message repeated 5 times Sep 12 10:19:10 thin /kernel: fxp0: SCB timeout: 0x1 0x20 0x80 0x0 Sep 12 10:19:13 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:19:14 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:19:15 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:19:16 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:19:36 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:19:38 thin /kernel: fxp0: device timeout Sep 12 10:19:38 thin /kernel: fxp0: DMA timeout Sep 12 10:19:38 thin last message repeated 2 times Sep 12 10:19:52 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:19:54 thin /kernel: fxp0: device timeout Sep 12 10:19:54 thin /kernel: fxp0: DMA timeout Sep 12 10:19:54 thin last message repeated 2 times Sep 12 10:20:00 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:20:21 thin /kernel: fxp0: device timeout Sep 12 10:20:21 thin /kernel: fxp0: DMA timeout Sep 12 10:20:21 thin last message repeated 2 times Sep 12 10:20:29 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:20:35 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:20:35 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:21:04 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x90 0x0 Sep 12 10:21:09 thin /kernel: fxp0: device timeout Sep 12 10:21:09 thin /kernel: fxp0: DMA timeout Sep 12 10:21:09 thin last message repeated 2 times Sep 12 10:21:09 thin /kernel: fxp0: command queue timeout Sep 12 10:21:12 thin shutdown: reboot by jdp: This morning I tried regressing the driver to earlier versions in an attempt to find the commit that broke it. Not good news: RELENG_4_8_0_RELEASE bad RELENG_4_7_0_RELEASE bad RELENG_4_6_0_RELEASE bad RELENG_4_4_0_RELEASE bad RELENG_4_2_0_RELEASE bad RELENG_4_1_0_RELEASE bad The problem is easier to reproduce in recent versions of the driver than in older versions. With the current -stable driver, I can almost always kill the chips with a single transfer of that 560 MB file. With the 4.7.0 driver, it takes about 5 transfers before it fails. With the 4.2.0 driver, it took 15+ transfers. The devices are Intel 82559 chips. Here's their pciconf output: none0@pci0:1:0: class=0x020000 card=0x00da1028 chip=0x12298086 rev=0x08 hdr=0x00 vendor = 'Intel Corporation' device = '82557/8/9 EtherExpress PRO/100(B) Ethernet Adapter' class = network subclass = ethernet none1@pci0:2:0: class=0x020000 card=0x00da1028 chip=0x12298086 rev=0x08 hdr=0x00 vendor = 'Intel Corporation' device = '82557/8/9 EtherExpress PRO/100(B) Ethernet Adapter' class = network subclass = ethernet Maybe the problem really is in the Dell 1550. I have various flavors of fxp card in several other machines, and I never have trouble with them. I did check my firmware and BIOS versions, though, and they're fully up-to-date. I have a suspicion that our driver may not be dealing properly with Dell's power management or IPMI stuff, but it's just a vague suspicion without any real evidence. John
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?XFMail.20030912135156.jdp>