From owner-freebsd-stable@FreeBSD.ORG Tue Jan 11 13:40:06 2005 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5B32916A4CE for ; Tue, 11 Jan 2005 13:40:06 +0000 (GMT) Received: from schubert.byrnehq.com (dsl-33-12.dsl.netsource.ie [213.79.33.12]) by mx1.FreeBSD.org (Postfix) with ESMTP id F2A0B43D54 for ; Tue, 11 Jan 2005 13:40:04 +0000 (GMT) (envelope-from tonyb@byrnehq.com) Received: from localhost (mauer.directski.com. [212.147.140.194]) by schubert.byrnehq.com (8.13.1/8.13.1) with ESMTP id j0BDf6M9010737 for ; Tue, 11 Jan 2005 13:41:07 GMT (envelope-from tonyb@byrnehq.com) Date: Tue, 11 Jan 2005 13:40:14 +0000 From: Tony Byrne Organization: ByrneHQ X-Priority: 3 (Normal) Message-ID: <1433078378.20050111134014@byrnehq.com> To: freebsd-stable@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ByrneHQ-SA-Hits: 1.455 X-Scanned-By: MIMEDefang 2.49 on 192.168.10.254 Subject: MegaRAID 'Bad Slot' Kernel message and crash. X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: Tony Byrne List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 11 Jan 2005 13:40:06 -0000 Folks, I kicked off a thread just before the holidays regarding some problems we are having with an Intel SRCU42X RAID controller in a dual processor production server originally under 5.3-STABLE and now under 4.10-STABLE. The thread ran out of steam, with no resolution to the problem, but I'm hoping that with extra information I might get to the bottom of it. Basically, after some amount of uptime the kernel will emit a "amr0: Bad slot x completed" message and pretty soon after this the box goes into a partially unresponsive state forcing us to reboot it. So far the only thing triggering the problem is the nightly jobs, where the amount of IO is higher than during the day. Before deployment, we tested the box with 5.3-STABLE and managed to trigger the problem twice. This forced us to try 4.10-STABLE which was fine in testing and for a number of weeks after deployment. However, just before new year we saw our first Bad Slot and crash under 4.10. Since then it has happened 3 more times. We have upgraded the firmware to the latest version available from Intel, and if anything this has made the problem worse. We're beginning to suspect a dud card but could do with a few "works fine for us" style posts to build confidence in the support for the card under FreeBSD. The amr driver doesn't explicitly support the card, but it's a rebadged MegaRAID 320 as far as we can tell. Scott Long has posted to say that he is seeing similar problems, but I'm wondering if it really is a problem with the driver, wouldn't more of you be having problems? The machine had 3 disks configured as a single RAID5 array. A fourth disk is configured as a hot-standby. The card is equipped with 128Mb of battery-backed cache. Write-back caching is enabled on the card. Read-ahead caching is enabled in non-adaptive mode. Is anyone else using a SRCU42X RAID card and seeing similar problems to ours? What about other cards supported by the amr driver? We could just change the controller, but the problem we are having is pretty random and the feedback gap between change and outcome is long. We'd like to have more information to work with before deciding the next step. uname -a FreeBSD xxxxx 4.10-STABLE FreeBSD 4.10-STABLE #7: Tue Nov 16 12:50:42 GMT 2004 dmesg Copyright (c) 1992-2004 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 4.10-STABLE #7: Tue Nov 16 12:50:42 GMT 2004 dermot@pooh.traveldev.com:/usr/obj/usr/src/sys/POOH Timecounter "i8254" frequency 1193182 Hz CPU: Intel(R) Xeon(TM) CPU 3.20GHz (3189.72-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf25 Stepping = 5 Features=0xbfebfbff Hyperthreading: 2 logical CPUs real memory = 4026466304 (3932096K bytes) Programming 24 pins in IOAPIC #0 IOAPIC #0 intpin 2 -> irq 0 Programming 24 pins in IOAPIC #1 Programming 24 pins in IOAPIC #2 FreeBSD/SMP: Multiprocessor motherboard: 4 CPUs cpu0 (BSP): apic id: 0, version: 0x00050014, at 0xfee00000 cpu1 (AP): apic id: 1, version: 0x00050014, at 0xfee00000 cpu2 (AP): apic id: 6, version: 0x00050014, at 0xfee00000 cpu3 (AP): apic id: 7, version: 0x00050014, at 0xfee00000 io0 (APIC): apic id: 8, version: 0x00178020, at 0xfec00000 io1 (APIC): apic id: 9, version: 0x00178020, at 0xfec81000 io2 (APIC): apic id: 10, version: 0x00178020, at 0xfec81400 Preloaded elf kernel "kernel" at 0xc03cc000. Preloaded userconfig_script "/boot/kernel.conf" at 0xc03cc09c. Warning: Pentium 4 CPU: PSE disabled Pentium Pro MTRR support enabled md0: Malloc disk Using $PIR table, 19 entries at 0xc00f3630 npx0: on motherboard npx0: INT 16 interface pcib0: on motherboard IOAPIC #0 intpin 16 -> irq 2 IOAPIC #0 intpin 19 -> irq 16 pci0: on pcib0 pci0: (vendor=0x8086, dev=0x2541) at 0.1 pcib1: at device 3.0 on pci0 pci2: on pcib1 pci2: (vendor=0x8086, dev=0x1461) at 28.0 pcib2: at device 29.0 on pci2 IOAPIC #2 intpin 2 -> irq 18 IOAPIC #2 intpin 1 -> irq 19 pci5: on pcib2 ahd0: port 0x4000-0x40ff,0x3800-0x38ff mem 0xfe9e0000-0xfe9e1fff irq 18 at device 7.0 on pci5 aic7902: Ultra320 Wide Channel A, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs ahd1: port 0x3400-0x34ff,0x3000-0x30ff mem 0xfe9f0000-0xfe9f1fff irq 19 at device 7.1 on pci5 aic7902: Ultra320 Wide Channel B, SCSI Id=7, PCI-X 67-100Mhz, 512 SCBs pci2: (vendor=0x8086, dev=0x1461) at 30.0 pcib3: at device 31.0 on pci2 IOAPIC #1 intpin 6 -> irq 20 IOAPIC #1 intpin 7 -> irq 21 pci3: on pcib3 em0: port 0x2040-0x207f mem 0xfe6c0000-0xfe6dffff irq 20 at device 7.0 on pci 3 em0: Speed:N/A Duplex:N/A em1: port 0x2000-0x203f mem 0xfe6e0000-0xfe6fffff irq 21 at device 7.1 on pci 3 em1: Speed:N/A Duplex:N/A pcib4: at device 9.0 on pci3 IOAPIC #1 intpin 3 -> irq 22 pci4: on pcib4 amr0: mem 0xfe580000-0xfe5fffff,0xfbef0000-0xfbefffff irq 22 at device 0.0 on pci4 amr0: Firmware 413Y, BIOS H420, 128MB RAM pci0: (vendor=0x8086, dev=0x2546) at 3.1 uhci0: port 0x5020-0x503f irq 2 at device 29.0 on pci0 usb0: on uhci0 usb0: USB revision 1.0 uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 2 ports with 2 removable, self powered uhci1: port 0x5000-0x501f irq 16 at device 29.1 on pci0 usb1: on uhci1 usb1: USB revision 1.0 uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub1: 2 ports with 2 removable, self powered pcib5: at device 30.0 on pci0 pci1: on pcib5 pci1: at 12.0 irq 17 isab0: at device 31.0 on pci0 isa0: on isab0 atapci0: port 0x3a0-0x3af,0-0x3,0-0x7,0-0x3,0-0x7 irq 0 at device 31.1 on pci0 ata0: at 0x1f0 irq 14 on atapci0 ata1: at 0x170 irq 15 on atapci0 pci0: (vendor=0x8086, dev=0x2483) at 31.3 irq 17 orm0: