Date: Tue, 16 Nov 2010 10:29:40 -0500 From: Michael Boers <michaelscotttech@gmail.com> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: freebsd-fs@freebsd.org Subject: Re: zfs mirror recognizing disk failures Message-ID: <99CF1585-9D89-4F66-B85C-67EA30DD0BD9@gmail.com> In-Reply-To: <20101116135838.GA91324@icarus.home.lan> References: <25DC6C26-52FB-447A-AEB0-8549DA8F53E7@gmail.com> <AANLkTi=mqgjj%2BdWVvZKmUcZWPtZSF2wA=upYy%2B1dEhRe@mail.gmail.com> <441E3529-6178-404E-8A2D-2CF9BBC4170C@gmail.com> <20101116135838.GA91324@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On Nov 16, 2010, at 8:58 AM, Jeremy Chadwick wrote: > On Tue, Nov 16, 2010 at 08:32:35AM -0500, Michael Boers wrote: >> To answer Jermey's question of "what happened next?" >> >> The machine was not serving web requests >> The machine was not responsive via ssh >> The machine was pingable >> >> after waiting about 15 minutes, I used the ipmi protocol to power >> down the machine. >> When it came back up, I found the enclosed errors in the log. >> >> If I am following your comments correctly, the fault for this lies >> in the mpt system not giving up which could either be a driver or a >> firmware issue. Is that correct? >> >> How do I protect against that? > > The fault, in my opinion -- and I urge others (especially those > familiar > with the driver) to correct me, because I am often wrong -- lies with > either with the controller itself, or mpt(4), not truly "giving up" > after repetitive errors. It could be a firmware bug/quirk, sure. It > could be a lot of things, or a combination of things. I don't want to > rule out anything. > > For example, at my workplace we use Solaris with Adaptec controllers, > using a multitude of Fujitsu disks. Everything is SCSI-3. We > regularly > (at least once a week, usually more than that) see disk problems where > either the disk falls off the bus unexpectedly, the drive itself > "wedges" (resulting in the controller getting stuck in an infinite > loop > trying to talk to it) and won't unwedge without a full power-cycle > (soft > reset doesn't work), or in certain bad block circumstances the drive > wedges long enough for the controller driver to break in a strange way > (resulting in a system panic). Each situation appears to be > different; > there's definitely situations where the disk is responsible, others > which look like the controller is responsible, and others which look > like driver issues. > > I'm not familiar (read: have not used) mpt(4) controllers, but if my > memory serves me right, people post about problems with them from time > to time on FreeBSD. Each incident has to be addressed separately. > > If you're asking for a workaround or "what should I do", the > solution is > to either change controllers (read: avoid mpt(4)), or figure out how/ > why > the disk became wedged (or if it even did in the first place). > > Your original post contains no useful information about the hardware > itself (mpt handles many controllers yet we know not what model, we > know > nothing about disk da2, etc.). You're going to need to provide this. > Relevant dmesg output, camcontrol devlist, camcontrol inquiry, and > smartctl -a output for the disk would be useful (assuming the > controller > supports passthrough). Thanks for the detailed response, it has given me some things to think about. You are right, I had not posted too much about the machine in question. For those interested now or who may run across this in the archives, I provide it now (edited and partially reconstructed from backups of the log files): The machine is a Dell PowerEdge 2970 with SAS 6/iR Integrated, x6 Backplane Aug 24 05:40:41 caprica kernel: FreeBSD 8.0-RELEASE #0: Fri Jan 29 14:17:29 EST 2010 Aug 24 05:40:41 caprica kernel: CPU: Quad-Core AMD Opteron(tm) Processor 2387 (2793.03-MHz K8-class CPU) Aug 24 05:40:41 caprica kernel: real memory = 17179869184 (16384 MB) Aug 24 05:40:41 caprica kernel: FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs Aug 24 05:40:41 caprica kernel: FreeBSD/SMP: 1 package(s) x 4 core(s) Aug 24 05:40:41 caprica kernel: mpt0: <LSILogic SAS/SATA Adapter> port 0xec00-0xecff mem 0xe9fec000-0xe9feffff,0xe9ff0000-0xe9ffffff irq 37 at device 0.0 on pci7 Aug 24 05:40:41 caprica kernel: mpt0: [ITHREAD] Aug 24 05:40:41 caprica kernel: mpt0: MPI Version=1.5.18.0 Aug 24 05:40:41 caprica kernel: mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 ) Aug 24 05:40:41 caprica kernel: mpt0: 0 Active Volumes (2 Max) Aug 24 05:40:41 caprica kernel: mpt0: 0 Hidden Drive Members (14 Max) Aug 24 05:40:41 caprica kernel: ZFS filesystem version 13 Aug 24 05:40:41 caprica kernel: ZFS storage pool version 13 Aug 24 05:40:41 caprica kernel: Timecounters tick every 1.000 msec Aug 24 05:40:41 caprica kernel: da0: <ATA WDC WD1602ABKS-1 3B04> Fixed Direct Access SCSI-5 device Aug 24 05:40:41 caprica kernel: da0: 300.000MB/s transfers Aug 24 05:40:41 caprica kernel: da0: Command Queueing enabled Aug 24 05:40:41 caprica kernel: da0: 152587MB (312500000 512 byte sectors: 255H 63S/T 19452C) Aug 24 05:40:41 caprica kernel: da1 at mpt0 bus 0 target 1 lun 0 Aug 24 05:40:41 caprica kernel: da1: <ATA WDC WD5002ABYS-1 3B04> Fixed Direct Access SCSI-5 device Aug 24 05:40:41 caprica kernel: da1: 300.000MB/s transfers Aug 24 05:40:41 caprica kernel: da1: Command Queueing enabled Aug 24 05:40:41 caprica kernel: da1: 476940MB (976773168 512 byte sectors: 255H 63S/T 60801C) Aug 24 05:40:41 caprica kernel: ses0 at mpt0 bus 0 target 8 lun 0 Aug 24 05:40:41 caprica kernel: ses0: <DP BACKPLANE 1.05> Fixed Enclosure Services SCSI-5 device Aug 24 05:40:41 caprica kernel: ses0: 300.000MB/s transfers Aug 24 05:40:41 caprica kernel: ses0: SCSI-3 SES Device added the mirror disks later Oct 15 10:47:21 caprica kernel: da2 at mpt0 bus 0 target 3 lun 0 Oct 15 10:47:21 caprica kernel: da2: <ATA WDC WD5002ABYS-1 3B04> Fixed Direct Access SCSI-5 device Oct 15 10:47:21 caprica kernel: da2: 300.000MB/s transfers Oct 15 10:47:21 caprica kernel: da2: Command Queueing enabled Oct 15 10:47:21 caprica kernel: da2: 476940MB (976773168 512 byte sectors: 255H 63S/T 60801C) Oct 15 10:47:21 caprica kernel: da3 at mpt0 bus 0 target 2 lun 0 Oct 15 10:47:21 caprica kernel: da3: <ATA WDC WD1602ABKS-1 3B05> Fixed Direct Access SCSI-5 device Oct 15 10:47:21 caprica kernel: da3: 300.000MB/s transfers Oct 15 10:47:21 caprica kernel: da3: Command Queueing enabled Oct 15 10:47:21 caprica kernel: da3: 152587MB (312500000 512 byte sectors: 255H 63S/T 19452C) started getting the occasional error on da3 (did not realize until after the crash. Now using swatch to check for mpt errors) Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): WRITE(10). CDB: 2a 0 2 4 58 a2 0 0 80 0 Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): CAM Status: SCSI Status Error Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): SCSI Status: Check Condition Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): UNIT ATTENTION asc: 29,0 Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): Power on, reset, or bus device reset occurred Oct 18 03:43:58 caprica kernel: (da3:mpt0:0:2:0): Retrying Command (per Sense Data) Camcontrol output (partially reconstructed as the drives are currently on my desk) <ATA WDC WD1602ABKS-1 3B04> at scbus0 target 0 lun 0 (pass0,da0) <ATA WDC WD5002ABYS-1 3B04> at scbus0 target 1 lun 0 (pass1,da1) <ATA WDC WD5002ABYS-1 3B04> at scbus0 target 2 lun 0 (pass2,da2) <ATA WDC WD1602ABKS-1 3B04> at scbus0 target 3 lun 0 (pass2,da3) <DP BACKPLANE 1.05> at scbus0 target 8 lun 0 (ses0,pass4) This is all I can provide at this time. I appreciate all of the help provided thus far and in future. I am going to check into BIOS updates for the SAS 6/iR and I am in the process of moving to 8.1 for better mpt support. Thanks, again > > Finally, be aware that trying to chase down a problem of this nature > is > often time-consuming. Sometimes it's not worth it at all, and instead > better spent replacing all of the hardware involved. If it happens > again after that, change vendors or hardware controllers (or disks) > used. That's just how it goes. I tend to stick to Intel ICHxx or ESB > SATA controllers for this reason; they're well-tested on FreeBSD. > And I > don't use hardware RAID at all for many reasons (separate topic). > > -- > | Jeremy Chadwick jdc@parodius.com | > | Parodius Networking http://www.parodius.com/ | > | UNIX Systems Administrator Mountain View, CA, USA | > | Making life hard for others since 1977. PGP: 4BD6C0CB | >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?99CF1585-9D89-4F66-B85C-67EA30DD0BD9>