From owner-freebsd-bugs Mon Jun 12 14:40: 9 2000 Delivered-To: freebsd-bugs@freebsd.org Received: from freefall.freebsd.org (freefall.FreeBSD.ORG [204.216.27.21]) by hub.freebsd.org (Postfix) with ESMTP id 75C8037BC6A for ; Mon, 12 Jun 2000 14:40:01 -0700 (PDT) (envelope-from gnats@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.9.3/8.9.2) id OAA19365; Mon, 12 Jun 2000 14:40:01 -0700 (PDT) (envelope-from gnats@FreeBSD.org) Received: from devnull.ussc.alltheweb.com (devnull.ussc.alltheweb.com [216.35.112.83]) by hub.freebsd.org (Postfix) with ESMTP id 8E67C37BB6F for ; Mon, 12 Jun 2000 14:37:18 -0700 (PDT) (envelope-from gij@ussc.alltheweb.com) Received: (from gij@localhost) by devnull.ussc.alltheweb.com (8.9.3/8.9.3) id VAA04550; Mon, 12 Jun 2000 21:37:17 GMT (envelope-from gij) Message-Id: <200006122137.VAA04550@devnull.ussc.alltheweb.com> Date: Mon, 12 Jun 2000 21:37:17 GMT From: gij@jk.priv.no Reply-To: gij@jk.priv.no To: FreeBSD-gnats-submit@freebsd.org X-Send-Pr-Version: 3.2 Subject: i386/19226: SCSI timeouts during heavy load Sender: owner-freebsd-bugs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org >Number: 19226 >Category: i386 >Synopsis: SCSI timeouts during heavy load >Confidential: no >Severity: serious >Priority: high >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Mon Jun 12 14:40:01 PDT 2000 >Closed-Date: >Last-Modified: >Originator: Geir Inge Jensen >Release: FreeBSD 4.0-STABLE i386 >Organization: None, only personal opinions expressed. >Environment: Dell PowerEdge 2450 Dual 600MHz. Dell PowerVault 200S. Two AHA29160 SCSI cards, both connected to the PowerVault. 3 internal IBM DMVS 18GB disks. 8 external disks in the PowerVault (same disks). Relavant dmesg output: CPU: Pentium III/Pentium III Xeon (598.11-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0x681 Stepping = 1 Features=0x383fbff real memory = 1073741824 (1048576K bytes) avail memory = 1039880192 (1015508K bytes) Programming 16 pins in IOAPIC #0 Programming 16 pins in IOAPIC #1 IOAPIC #1 intpin 0 -> irq 2 IOAPIC #1 intpin 1 -> irq 11 IOAPIC #1 intpin 2 -> irq 13 IOAPIC #1 intpin 4 -> irq 16 IOAPIC #1 intpin 5 -> irq 17 IOAPIC #1 intpin 6 -> irq 18 IOAPIC #1 intpin 7 -> irq 19 IOAPIC #1 intpin 14 -> irq 10 IOAPIC #1 intpin 15 -> irq 5 FreeBSD/SMP: Multiprocessor motherboard cpu0 (BSP): apic id: 1, version: 0x00040011, at 0xfee00000 cpu1 (AP): apic id: 0, version: 0x00040011, at 0xfee00000 io0 (APIC): apic id: 2 (really 0), version: 0x000f0011, at 0xfec00000 Reprogramming APIC ID! io1 (APIC): apic id: 3 (really 0), version: 0x000f0011, at 0xfec01000 Reprogramming APIC ID! Preloaded elf kernel "kernel" at 0xc033b000. ccd0-3: Concatenated disk drivers Pentium Pro MTRR support enabled SMP: AP CPU #1 Launched! npx0: on motherboard npx0: INT 16 interface pcib0: on motherboard pci0: on pcib0 ahc0: port 0xec00-0xecff mem 0xfe003000-0x fe003fff irq 11 at device 4.0 on pci0 ahc0: aic7892 Wide Channel A, SCSI Id=7, 16/255 SCBs ahc1: port 0xe800-0xe8ff mem 0xfe002000-0x fe002fff irq 18 at device 8.0 on pci0 ahc1: aic7892 Wide Channel A, SCSI Id=7, 16/255 SCBs pci0: at 14.0 isab0: at device 15.0 on pci0 isa0: on isab0 atapci0: port 0x8b0-0x8bf at device 15.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ohci0: mem 0xfe000000-0xfe000fff irq 5 at device 15.2 on pci0 usb0: OHCI version 1.0, legacy support usb0: on ohci0 usb0: USB revision 1.0 uhub0: (unknown) OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 4 ports with 4 removable, self powered pcib1: on motherboard pci1: on pcib1 pcib2: at device 2.0 on pci1 pci2: on pcib2 ahc2: port 0xdc00-0xdcff mem 0xf8fff000- 0xf8ffffff irq 5 at device 4.0 on pci2 ahc2: aic7899 Wide Channel A, SCSI Id=7, 16/255 SCBs ahc3: port 0xd800-0xd8ff mem 0xf8ffe000- 0xf8ffefff irq 10 at device 4.1 on pci2 ahc3: aic7899 Wide Channel B, SCSI Id=7, 16/255 SCBs fxp0: port 0xccc0-0xccff mem 0xfa00000 0-0xfa0fffff,0xfa100000-0xfa100fff irq 2 at device 8.0 on pci1 fxp0: Ethernet address 00:b0:d0:20:cd:90 fdc0: at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: at port 0x60,0x64 on isa0 atkbd0: irq 1 on atkbdc0 psm0: irq 12 on atkbdc0 psm0: model IntelliMouse, device ID 3 vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 sc0: on isa0 sc0: VGA <16 virtual consoles, flags=0x200> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 16550A sio1 at port 0x2f8-0x2ff irq 3 on isa0 sio1: type 16550A ppc0: at port 0x378-0x37f irq 7 on isa0 ppc0: Generic chipset (ECP/PS2/NIBBLE) in COMPATIBLE mode ppc0: FIFO with 16/16/8 bytes threshold lpt0: on ppbus0 lpt0: Interrupt-driven port APIC_IO: routing 8254 via 8259 and IOAPIC #0 intpin 0 acd0: CDROM at ata0-master using PIO4 pass2 at ahc2 bus 0 target 6 lun 0 pass2: Fixed Processor SCSI-2 device pass2: 3.300MB/s transfers pass7 at ahc0 bus 0 target 15 lun 0 pass7: Removable Processor SCSI-3 device pass7: 3.300MB/s transfers pass12 at ahc1 bus 0 target 15 lun 0 pass12: Removable Processor SCSI-3 device pass12: 3.300MB/s transfers pass14 at ahc3 bus 0 target 6 lun 0 pass14: Fixed Processor SCSI-2 device pass14: 3.300MB/s transfers >Description: After a while, during heavy disk I/O, the following appears: (da2:ahc0:0:0:0): SCB 0x33 - timed out while idle, SEQADDR == 0x157 (da2:ahc0:0:0:0): Queuing a BDR SCB (da6:ahc1:0:8:0): SCB 0x7c - timed out while idle, SEQADDR == 0x157 (da6:ahc1:0:8:0): Queuing a BDR SCB (da2:ahc0:0:0:0): SCB 0x33 - timed out while idle, SEQADDR == 0x157 (da2:ahc0:0:0:0): no longer in timeout, status = 34b ahc0: Issued Channel A Bus Reset. 7 SCBs aborted (da6:ahc1:0:8:0): SCB 0x7c - timed out while idle, SEQADDR == 0x157 (da6:ahc1:0:8:0): no longer in timeout, status = 34b ahc1: Issued Channel A Bus Reset. 7 SCBs aborted And so on. At this time, you don't have any contact with the PowerVault. Of course, the ccd freaks out with this: ccd0: error 5 on component 0 block 80 (ccd block 64) Notice that the error occurs on both buses at the same time! It can take several hours before this happens. But we can reproduce it with some patience and heavy load. The SCB's differ slightly from occasion to occasion. This is what we have tried to pinpoint the cause: - Replace all scsi cables. - Terminate the bus'es in the bios. - Replace the AHA29160's with other AHA29160's. - Replace the AHA29160's with AHA2940U2W's. - Replace the internal PCI bus the cards plugs into (PCI tray). - Replace the ES Expander Modules in the PowerVault. - Replace the PowerVault. - Replace the PowerVault with a known good (and older revision) PowerVault (we have several of these running on Dell PowerEdge 4350 with 3.3-STABLE on them). These older systems run fine. - Test with 4.0-STABLE UP kernel. - Test with 5.0-CURRENT UP kernel. - Keep both external SCSI cards, but use only one of them. - Remove one of the external SCSI cards, and use the internal 7899, channel B, as well against the PowerVault (ie. two buses against it). - Running RedHat 6.2 with 2.2.14-5 kernel on the same system. None of the above actions cured it. After some hours, it fails. Note that the old PowerVault we tested from earlier systems contained other disks (Seagate and Quantum), which works fine under 3.3-STABLE. From this testing, we have these conclusions: - There is nothing wrong with the PowerVault and the diskdrives. - There is nothing wrong with the SCSI cards. We also have some success stories: - Run the PowerVault from a single PCI card (ie. remove the other). - Run the PowerVault only from the internal 7899, channel B. - linux-2.2.14-6.1.1 kernel (provided by Dell) with original HW setup. - linux-2.2.15 kernel with original HW setup. To me, it sounds like a PCI problem (or maybe in the RCC LE chip). It could also be a problem in the AIC7xxx driver, but it even failed with the AHA2940U2W cards (which works fine in our 3.3 systems). But I am only guessing here. However, Linux has obviously found a fix. >How-To-Repeat: Access every disk in the system, and produce a lot of I/O. I open all disk devices in raw mode and do a lot of random seeks and reads. However, we have experienced this error on mostly idle machines also. >Fix: >Release-Note: >Audit-Trail: >Unformatted: To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-bugs" in the body of the message