Date: Mon, 12 Jun 2000 21:37:17 GMT From: gij@jk.priv.no To: FreeBSD-gnats-submit@freebsd.org Subject: i386/19226: SCSI timeouts during heavy load Message-ID: <200006122137.VAA04550@devnull.ussc.alltheweb.com>
next in thread | raw e-mail | index | archive | help
>Number: 19226
>Category: i386
>Synopsis: SCSI timeouts during heavy load
>Confidential: no
>Severity: serious
>Priority: high
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Mon Jun 12 14:40:01 PDT 2000
>Closed-Date:
>Last-Modified:
>Originator: Geir Inge Jensen
>Release: FreeBSD 4.0-STABLE i386
>Organization:
None, only personal opinions expressed.
>Environment:
Dell PowerEdge 2450 Dual 600MHz. Dell PowerVault 200S. Two AHA29160
SCSI cards, both connected to the PowerVault.
3 internal IBM DMVS 18GB disks. 8 external disks in the PowerVault
(same disks).
Relavant dmesg output:
CPU: Pentium III/Pentium III Xeon (598.11-MHz 686-class CPU)
Origin = "GenuineIntel" Id = 0x681 Stepping = 1
Features=0x383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CM
OV,PAT,PSE36,MMX,FXSR,XMM>
real memory = 1073741824 (1048576K bytes)
avail memory = 1039880192 (1015508K bytes)
Programming 16 pins in IOAPIC #0
Programming 16 pins in IOAPIC #1
IOAPIC #1 intpin 0 -> irq 2
IOAPIC #1 intpin 1 -> irq 11
IOAPIC #1 intpin 2 -> irq 13
IOAPIC #1 intpin 4 -> irq 16
IOAPIC #1 intpin 5 -> irq 17
IOAPIC #1 intpin 6 -> irq 18
IOAPIC #1 intpin 7 -> irq 19
IOAPIC #1 intpin 14 -> irq 10
IOAPIC #1 intpin 15 -> irq 5
FreeBSD/SMP: Multiprocessor motherboard
cpu0 (BSP): apic id: 1, version: 0x00040011, at 0xfee00000
cpu1 (AP): apic id: 0, version: 0x00040011, at 0xfee00000
io0 (APIC): apic id: 2 (really 0), version: 0x000f0011, at 0xfec00000
Reprogramming APIC ID!
io1 (APIC): apic id: 3 (really 0), version: 0x000f0011, at 0xfec01000
Reprogramming APIC ID!
Preloaded elf kernel "kernel" at 0xc033b000.
ccd0-3: Concatenated disk drivers
Pentium Pro MTRR support enabled
SMP: AP CPU #1 Launched!
npx0: <math processor> on motherboard
npx0: INT 16 interface
pcib0: <RCC LE host to PCI bridge> on motherboard
pci0: <PCI bus> on pcib0
ahc0: <Adaptec 29160 Ultra160 SCSI adapter> port 0xec00-0xecff mem 0xfe003000-0x
fe003fff irq 11 at device 4.0 on pci0
ahc0: aic7892 Wide Channel A, SCSI Id=7, 16/255 SCBs
ahc1: <Adaptec 29160 Ultra160 SCSI adapter> port 0xe800-0xe8ff mem 0xfe002000-0x
fe002fff irq 18 at device 8.0 on pci0
ahc1: aic7892 Wide Channel A, SCSI Id=7, 16/255 SCBs
pci0: <ATI model 4759 graphics accelerator> at 14.0
isab0: <PCI to ISA bridge (vendor=1166 device=0200)> at device 15.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Unknown PCI ATA controller (generic mode)> port 0x8b0-0x8bf at device
15.1 on pci0
ata0: at 0x1f0 irq 14 on atapci0
ohci0: <OHCI (generic) USB controller> mem 0xfe000000-0xfe000fff irq 5 at device
15.2 on pci0
usb0: OHCI version 1.0, legacy support
usb0: <OHCI (generic) USB controller> on ohci0
usb0: USB revision 1.0
uhub0: (unknown) OHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 4 ports with 4 removable, self powered
pcib1: <RCC LE host to PCI bridge> on motherboard
pci1: <PCI bus> on pcib1
pcib2: <PCI to PCI bridge (vendor=8086 device=0962)> at device 2.0 on pci1
pci2: <PCI bus> on pcib2
ahc2: <Adaptec aic7899 Ultra160 SCSI adapter> port 0xdc00-0xdcff mem 0xf8fff000-
0xf8ffffff irq 5 at device 4.0 on pci2
ahc2: aic7899 Wide Channel A, SCSI Id=7, 16/255 SCBs
ahc3: <Adaptec aic7899 Ultra160 SCSI adapter> port 0xd800-0xd8ff mem 0xf8ffe000-
0xf8ffefff irq 10 at device 4.1 on pci2
ahc3: aic7899 Wide Channel B, SCSI Id=7, 16/255 SCBs
fxp0: <Intel EtherExpress Pro 10/100B Ethernet> port 0xccc0-0xccff mem 0xfa00000
0-0xfa0fffff,0xfa100000-0xfa100fff irq 2 at device 8.0 on pci1
fxp0: Ethernet address 00:b0:d0:20:cd:90
fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0
fdc0: FIFO enabled, 8 bytes threshold
fd0: <1440-KB 3.5" drive> on fdc0 drive 0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
atkbd0: <AT Keyboard> irq 1 on atkbdc0
psm0: <PS/2 Mouse> irq 12 on atkbdc0
psm0: model IntelliMouse, device ID 3
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
sc0: <System console> on isa0
sc0: VGA <16 virtual consoles, flags=0x200>
sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0
sio0: type 16550A
sio1 at port 0x2f8-0x2ff irq 3 on isa0
sio1: type 16550A
ppc0: <Parallel port> at port 0x378-0x37f irq 7 on isa0
ppc0: Generic chipset (ECP/PS2/NIBBLE) in COMPATIBLE mode
ppc0: FIFO with 16/16/8 bytes threshold
lpt0: <Printer> on ppbus0
lpt0: Interrupt-driven port
APIC_IO: routing 8254 via 8259 and IOAPIC #0 intpin 0
acd0: CDROM <TOSHIBA CD-ROM XM-7002B> at ata0-master using PIO4
pass2 at ahc2 bus 0 target 6 lun 0
pass2: <DELL 2x2 U2W SCSI BP 1.15> Fixed Processor SCSI-2 device
pass2: 3.300MB/s transfers
pass7 at ahc0 bus 0 target 15 lun 0
pass7: <Dell 8 BAY U2W CU 0203> Removable Processor SCSI-3 device
pass7: 3.300MB/s transfers
pass12 at ahc1 bus 0 target 15 lun 0
pass12: <Dell 8 BAY U2W CU 0203> Removable Processor SCSI-3 device
pass12: 3.300MB/s transfers
pass14 at ahc3 bus 0 target 6 lun 0
pass14: <DELL 2x2 U2W SCSI BP 1.15> Fixed Processor SCSI-2 device
pass14: 3.300MB/s transfers
>Description:
After a while, during heavy disk I/O, the following appears:
(da2:ahc0:0:0:0): SCB 0x33 - timed out while idle, SEQADDR == 0x157
(da2:ahc0:0:0:0): Queuing a BDR SCB
(da6:ahc1:0:8:0): SCB 0x7c - timed out while idle, SEQADDR == 0x157
(da6:ahc1:0:8:0): Queuing a BDR SCB
(da2:ahc0:0:0:0): SCB 0x33 - timed out while idle, SEQADDR == 0x157
(da2:ahc0:0:0:0): no longer in timeout, status = 34b
ahc0: Issued Channel A Bus Reset. 7 SCBs aborted
(da6:ahc1:0:8:0): SCB 0x7c - timed out while idle, SEQADDR == 0x157
(da6:ahc1:0:8:0): no longer in timeout, status = 34b
ahc1: Issued Channel A Bus Reset. 7 SCBs aborted
And so on. At this time, you don't have any contact with the PowerVault.
Of course, the ccd freaks out with this:
ccd0: error 5 on component 0 block 80 (ccd block 64)
Notice that the error occurs on both buses at the same time! It can
take several hours before this happens. But we can reproduce it with
some patience and heavy load. The SCB's differ slightly from occasion
to occasion.
This is what we have tried to pinpoint the cause:
- Replace all scsi cables.
- Terminate the bus'es in the bios.
- Replace the AHA29160's with other AHA29160's.
- Replace the AHA29160's with AHA2940U2W's.
- Replace the internal PCI bus the cards plugs into (PCI tray).
- Replace the ES Expander Modules in the PowerVault.
- Replace the PowerVault.
- Replace the PowerVault with a known good (and older revision) PowerVault
(we have several of these running on Dell PowerEdge 4350 with
3.3-STABLE on them). These older systems run fine.
- Test with 4.0-STABLE UP kernel.
- Test with 5.0-CURRENT UP kernel.
- Keep both external SCSI cards, but use only one of them.
- Remove one of the external SCSI cards, and use the internal 7899,
channel B, as well against the PowerVault (ie. two buses against it).
- Running RedHat 6.2 with 2.2.14-5 kernel on the same system.
None of the above actions cured it. After some hours, it fails. Note that
the old PowerVault we tested from earlier systems contained other disks
(Seagate and Quantum), which works fine under 3.3-STABLE.
From this testing, we have these conclusions:
- There is nothing wrong with the PowerVault and the diskdrives.
- There is nothing wrong with the SCSI cards.
We also have some success stories:
- Run the PowerVault from a single PCI card (ie. remove the other).
- Run the PowerVault only from the internal 7899, channel B.
- linux-2.2.14-6.1.1 kernel (provided by Dell) with original HW setup.
- linux-2.2.15 kernel with original HW setup.
To me, it sounds like a PCI problem (or maybe in the RCC LE chip). It
could also be a problem in the AIC7xxx driver, but it even failed with
the AHA2940U2W cards (which works fine in our 3.3 systems). But I am
only guessing here. However, Linux has obviously found a fix.
>How-To-Repeat:
Access every disk in the system, and produce a lot of I/O. I open all
disk devices in raw mode and do a lot of random seeks and reads.
However, we have experienced this error on mostly idle machines also.
>Fix:
>Release-Note:
>Audit-Trail:
>Unformatted:
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-bugs" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200006122137.VAA04550>
