Date: Sun, 17 Mar 2002 19:48:54 -0600 From: "Douglas K. Rand" <rand@meridian-enviro.com> To: freebsd-stable@freebsd.org, freebsd-hardware@freebsd.org Cc: bryanh@meridian-enviro.com Subject: 3Ware, Western Digital disks, and stray interrupts Message-ID: <87wuwave61.wl@delta.meridian-enviro.com>
next in thread | raw e-mail | index | archive | help
We have two pretty much identical systems: Both are Tyan Tiger MP S2469 boards with a 3ware 7450 controller and Western Digital WD1000 100GB disks. One system has 4 disks in a RAID 10 configuration, and the other has 2 disks in a RAID 1 configuration. One system only has a single Athlon MP CPU, while the other has 2 Athlon MP CPUs. We have gone through 5 of the WD1000 disks so far, with a 6th that just failed the other day. The first 3 failures we tested with Western Digital's drive fitness test, which reported all thee drives to be OK. The first disk that failed we tried to put back in and have the 3ware controller rebuild, but the rebuild failed after 2 hours. We've stopped testing the disks, and just send them back to Western Digital. All the failures have been drive timeouts: Dec 29 16:55:31 doppler[kern.crit] /kernel: twe0: AEN: <twe0: drive error for unknown unit 2> Jan 1 23:36:04 doppler[kern.crit] /kernel: twe0: AEN: <twe0: drive timeout for unknown unit 3> Feb 22 18:19:21 doppler[kern.crit] /kernel: twe0: AEN: <twe0: drive timeout for unknown unit 1> Mar 7 20:18:44 vault[kern.crit] /kernel: twe0: AEN: <twed0: drive timeout> Mar 16 21:42:02 vault[kern.crit] /kernel: twe0: AEN: <twed0: drive timeout> The last two messages were somewhat massaged by me, that comes later... So, the first question: Has anybody else seen such a horrible failure rate witht he WD1000 disks? The other problem we are having, which /may/ be related, is that the second system (vault, the single CPU box) has had 2 failures that coincide with a spate of "stray irq 7" messages. We are using swatch to watch for the twe messages, but the two failures on vault have had the kernel log mixed with the stray irq 7 messages: Mar 7 20:18:44 vault[kern.crit] /kernel: t Mar 7 20:18:44 vault[kern.err] /kernel: stray irq 7 Mar 7 20:18:44 vault[kern.crit] /kernel: we0 Mar 7 20:18:44 vault[kern.err] /kernel: stray irq 7 Mar 7 20:18:44 vault[kern.crit] /kernel: too many stray irq 7's; not logging any more Mar 7 20:18:44 vault[kern.crit] /kernel: : AEN: <twed0: drive timeout> Mar 16 21:42:02 vault[kern.crit] /kernel: tw Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: e0: Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: A Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: EN: <tw Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: ed0 Mar 16 21:42:02 vault[kern.err] /kernel: stray irq 7 Mar 16 21:42:02 vault[kern.crit] /kernel: too many stray irq 7's; not logging any more Mar 16 21:42:02 vault[kern.crit] /kernel: : drive timeout> In both cases, there aren't any kernel logs for 2 hours on either side of this message. We have the parallel port disabled in the BIOS, and after the last failure took irq 7 away from the PCI and PnP devices. (None of the previous dmesg for the system report any devices using irq 7.) I've put the current dmesg at the end. So, is the 3ware controller causing the stray irq 7 messages when the disk failes, or are the stray irq 7 messages causing the 3ware controller to timeout the disk? Any help would be appreciated. Pretty soon Western Digital is gonna stop taking our phone calls. Either that, or we'll loose 2 disks before we get the first one fixed. ;^) Copyright (c) 1992-2002 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.5-RELEASE #1: Wed Feb 13 17:10:19 CST 2002 rand@vault.meridian-enviro.com:/usr/obj/usr/src/sys/VAULT Timecounter "i8254" frequency 1193182 Hz Timecounter "TSC" frequency 1400054127 Hz CPU: AMD Athlon(tm) MP Processor 1600+ (1400.05-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x662 Stepping = 2 Features=0x383fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE> AMD Features=0xc0480000<<b19>,AMIE,DSP,3DNow!> real memory = 268435456 (262144K bytes) avail memory = 258568192 (252508K bytes) Preloaded elf kernel "kernel" at 0xc02a8000. Pentium Pro MTRR support enabled Using $PIR table, 268435454 entries at 0xc00fdf10 npx0: <math processor> on motherboard npx0: INT 16 interface pcib0: <Host to PCI bridge> on motherboard pci0: <PCI bus> on pcib0 pcib1: <PCI to PCI bridge (vendor=1022 device=700d)> at device 1.0 on pci0 pci1: <PCI bus> on pcib1 pci1: <Number Nine model 5348 graphics accelerator> at 5.0 irq 10 isab0: <PCI to ISA bridge (vendor=1022 device=7410)> at device 7.0 on pci0 isa0: <ISA bus> on isab0 atapci0: <AMD 766 ATA100 controller> port 0xf000-0xf00f at device 7.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 chip1: <PCI to Other bridge (vendor=1022 device=7413)> at device 7.3 on pci0 twe0: <3ware Storage Controller> port 0x1430-0x143f mem 0xf4000000-0xf47fffff,0xf4901000-0xf490100f irq 5 at device 8.0 on pci0 twe0: 4 ports, Firmware FE7X 1.03.09.027, BIOS BE7X 1.07.02.002 fxp0: <Intel Pro 10/100B/100+ Ethernet> port 0x1400-0x141f mem 0xf4800000-0xf48fffff,0xf4903000-0xf4903fff irq 5 at device 12.0 on pci0 fxp0: Ethernet address 00:90:27:18:d7:45 inphy0: <i82555 10/100 media interface> on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto ahc0: <Adaptec 2940 Ultra SCSI adapter> port 0x1000-0x10ff mem 0xf4900000-0xf4900fff irq 11 at device 13.0 on pci0 aic7880: Ultra Wide Channel A, SCSI Id=7, 16/255 SCBs orm0: <Option ROMs> at iomem 0xc0000-0xc7fff,0xc8000-0xc8fff,0xc9800-0xc9fff on isa0 fdc0: <NEC 72065B or clone> at port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on isa0 fdc0: FIFO enabled, 8 bytes threshold fd0: <1440-KB 3.5" drive> on fdc0 drive 0 atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0 vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 sc0: <System console> at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x100> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 16550A, console sio1 at port 0x2f8-0x2ff irq 3 on isa0 sio1: type 16550A Waiting 2 seconds for SCSI devices to settle twed0: <TwinStor, Rebuilding> on twe0 twed0: 95395MB (195369520 sectors) twe0: command interrupt sa0 at ahc0 bus 0 target 6 lun 0 sa0: <HP C1557A U812> Removable Sequential Access SCSI-2 device sa0: 10.000MB/s transfers (10.000MHz, offset 15) Mounting root from ufs:/dev/twed0s1a ch0 at ahc0 bus 0 target 6 lun 1 ch0: <HP C1557A U812> Removable Changer SCSI-2 device ch0: 10.000MB/s transfers (10.000MHz, offset 15) ch0: 6 slots, 1 drive, 0 pickers, 0 portals twe0: AEN: <twed0: rebuild started> twe0: AEN: <twed0: rebuild done> ch: warning: could not map element source address 0d to a valid element type pid 3332 (db_metar), uid 7002: exited on signal 11 (core dumped) ch: warning: could not map element source address 0d to a valid element type ch: warning: could not map element source address 0d to a valid element type tw stray irq 7 e0: stray irq 7 A stray irq 7 EN: <tw stray irq 7 ed0 stray irq 7 too many stray irq 7's; not logging any more : drive timeout> ch: warning: could not map element source address 0d to a valid element type To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?87wuwave61.wl>