From owner-freebsd-stable@FreeBSD.ORG Fri Sep 12 13:52:00 2003 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5E50A16A4BF for ; Fri, 12 Sep 2003 13:52:00 -0700 (PDT) Received: from blake.polstra.com (mail.polstra.com [206.213.73.132]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9474743FF5 for ; Fri, 12 Sep 2003 13:51:58 -0700 (PDT) (envelope-from jdp@polstra.com) Received: from strings.polstra.com (strings.polstra.com [206.213.73.20]) by blake.polstra.com (8.12.9/8.12.9) with ESMTP id h8CKpuZj031862; Fri, 12 Sep 2003 13:51:56 -0700 (PDT) (envelope-from jdp@polstra.com) Message-ID: X-Mailer: XFMail 1.5.4 on FreeBSD X-Priority: 3 (Normal) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit MIME-Version: 1.0 In-Reply-To: <6.0.0.22.0.20030912134112.05891060@209.112.4.2> Date: Fri, 12 Sep 2003 13:51:56 -0700 (PDT) From: John Polstra To: Mike Tancsa X-Bogosity: No, tests=bogofilter, spamicity=0.499707, version=0.14.5 cc: Info Account cc: freebsd-stable@freebsd.org Subject: Re: recent stability problems with fxp driver X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Sep 2003 20:52:00 -0000 On 12-Sep-2003 Mike Tancsa wrote: > At 12:26 PM 12/09/2003, Info Account wrote: >>I've spent the past four days or so updating machines here to 4.8/9-stable via >>cvsup, and have done a complete make buildworld/kernel on each machine (some >>SMP, some single processor). It seems something is broken with the latest fxp >>driver, on each machine (different mobos and hardware configs) heavy network >>traffic with fxp NICs causes timeouts and random kernel panics. > > I have a few boxes pushing over 50Mb with fxp cards and havent seen this > problem. What type of fxp cards do you have ? What does > pciconf -v -l > show for the Intel types ? > > Also, I have found in the past that I would see this behavior if I changed > NICs and didnt do a PCIconfig reset in the MB BIOS. There is something > about Intel nics and Adaptec and 3ware cards that particularly require > this. Also, make sure that you dont have some duplex mismatches on the > nics. I have seen where excessive errors combined with high traffic will > cause panics. > > Also, please post the actual error messages on each of the machines. The problem is real, at least on some hardware. I had to give up on using the two integrated fxp devices on my Dell 1550 -- which is a real bummer, since it's a 1U box that only has two PCI slots. With the latest -stable driver, I couldn't fetch a 560 MB file from another machine on the LAN using FTP without killing the fxp device. The messages vary in detail, but this will give you the general idea: Sep 12 10:18:22 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x90 0x0 Sep 12 10:18:31 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x90 0x0 Sep 12 10:18:32 thin su: jdp to root on /dev/ttyp1 Sep 12 10:18:39 thin /kernel: fxp0: DMA timeout Sep 12 10:18:39 thin last message repeated 2 times Sep 12 10:18:49 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:18:51 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:18:54 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:18:56 thin /kernel: fxp0: device timeout Sep 12 10:18:56 thin /kernel: fxp0: DMA timeout Sep 12 10:19:10 thin last message repeated 5 times Sep 12 10:19:10 thin /kernel: fxp0: SCB timeout: 0x1 0x20 0x80 0x0 Sep 12 10:19:13 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:19:14 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:19:15 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:19:16 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:19:36 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:19:38 thin /kernel: fxp0: device timeout Sep 12 10:19:38 thin /kernel: fxp0: DMA timeout Sep 12 10:19:38 thin last message repeated 2 times Sep 12 10:19:52 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:19:54 thin /kernel: fxp0: device timeout Sep 12 10:19:54 thin /kernel: fxp0: DMA timeout Sep 12 10:19:54 thin last message repeated 2 times Sep 12 10:20:00 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:20:21 thin /kernel: fxp0: device timeout Sep 12 10:20:21 thin /kernel: fxp0: DMA timeout Sep 12 10:20:21 thin last message repeated 2 times Sep 12 10:20:29 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x50 0x0 Sep 12 10:20:35 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:20:35 thin /kernel: fxp0: SCB timeout: 0x80 0x0 0x50 0x0 Sep 12 10:21:04 thin /kernel: fxp0: SCB timeout: 0x70 0x0 0x90 0x0 Sep 12 10:21:09 thin /kernel: fxp0: device timeout Sep 12 10:21:09 thin /kernel: fxp0: DMA timeout Sep 12 10:21:09 thin last message repeated 2 times Sep 12 10:21:09 thin /kernel: fxp0: command queue timeout Sep 12 10:21:12 thin shutdown: reboot by jdp: This morning I tried regressing the driver to earlier versions in an attempt to find the commit that broke it. Not good news: RELENG_4_8_0_RELEASE bad RELENG_4_7_0_RELEASE bad RELENG_4_6_0_RELEASE bad RELENG_4_4_0_RELEASE bad RELENG_4_2_0_RELEASE bad RELENG_4_1_0_RELEASE bad The problem is easier to reproduce in recent versions of the driver than in older versions. With the current -stable driver, I can almost always kill the chips with a single transfer of that 560 MB file. With the 4.7.0 driver, it takes about 5 transfers before it fails. With the 4.2.0 driver, it took 15+ transfers. The devices are Intel 82559 chips. Here's their pciconf output: none0@pci0:1:0: class=0x020000 card=0x00da1028 chip=0x12298086 rev=0x08 hdr=0x00 vendor = 'Intel Corporation' device = '82557/8/9 EtherExpress PRO/100(B) Ethernet Adapter' class = network subclass = ethernet none1@pci0:2:0: class=0x020000 card=0x00da1028 chip=0x12298086 rev=0x08 hdr=0x00 vendor = 'Intel Corporation' device = '82557/8/9 EtherExpress PRO/100(B) Ethernet Adapter' class = network subclass = ethernet Maybe the problem really is in the Dell 1550. I have various flavors of fxp card in several other machines, and I never have trouble with them. I did check my firmware and BIOS versions, though, and they're fully up-to-date. I have a suspicion that our driver may not be dealing properly with Dell's power management or IPMI stuff, but it's just a vague suspicion without any real evidence. John