Date: Tue, 16 Nov 2010 05:58:38 -0800 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Michael Boers <michaelscotttech@gmail.com> Cc: freebsd-fs@freebsd.org Subject: Re: zfs mirror recognizing disk failures Message-ID: <20101116135838.GA91324@icarus.home.lan> In-Reply-To: <441E3529-6178-404E-8A2D-2CF9BBC4170C@gmail.com> References: <25DC6C26-52FB-447A-AEB0-8549DA8F53E7@gmail.com> <AANLkTi=mqgjj%2BdWVvZKmUcZWPtZSF2wA=upYy%2B1dEhRe@mail.gmail.com> <441E3529-6178-404E-8A2D-2CF9BBC4170C@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Nov 16, 2010 at 08:32:35AM -0500, Michael Boers wrote: > To answer Jermey's question of "what happened next?" > > The machine was not serving web requests > The machine was not responsive via ssh > The machine was pingable > > after waiting about 15 minutes, I used the ipmi protocol to power > down the machine. > When it came back up, I found the enclosed errors in the log. > > If I am following your comments correctly, the fault for this lies > in the mpt system not giving up which could either be a driver or a > firmware issue. Is that correct? > > How do I protect against that? The fault, in my opinion -- and I urge others (especially those familiar with the driver) to correct me, because I am often wrong -- lies with either with the controller itself, or mpt(4), not truly "giving up" after repetitive errors. It could be a firmware bug/quirk, sure. It could be a lot of things, or a combination of things. I don't want to rule out anything. For example, at my workplace we use Solaris with Adaptec controllers, using a multitude of Fujitsu disks. Everything is SCSI-3. We regularly (at least once a week, usually more than that) see disk problems where either the disk falls off the bus unexpectedly, the drive itself "wedges" (resulting in the controller getting stuck in an infinite loop trying to talk to it) and won't unwedge without a full power-cycle (soft reset doesn't work), or in certain bad block circumstances the drive wedges long enough for the controller driver to break in a strange way (resulting in a system panic). Each situation appears to be different; there's definitely situations where the disk is responsible, others which look like the controller is responsible, and others which look like driver issues. I'm not familiar (read: have not used) mpt(4) controllers, but if my memory serves me right, people post about problems with them from time to time on FreeBSD. Each incident has to be addressed separately. If you're asking for a workaround or "what should I do", the solution is to either change controllers (read: avoid mpt(4)), or figure out how/why the disk became wedged (or if it even did in the first place). Your original post contains no useful information about the hardware itself (mpt handles many controllers yet we know not what model, we know nothing about disk da2, etc.). You're going to need to provide this. Relevant dmesg output, camcontrol devlist, camcontrol inquiry, and smartctl -a output for the disk would be useful (assuming the controller supports passthrough). Finally, be aware that trying to chase down a problem of this nature is often time-consuming. Sometimes it's not worth it at all, and instead better spent replacing all of the hardware involved. If it happens again after that, change vendors or hardware controllers (or disks) used. That's just how it goes. I tend to stick to Intel ICHxx or ESB SATA controllers for this reason; they're well-tested on FreeBSD. And I don't use hardware RAID at all for many reasons (separate topic). -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101116135838.GA91324>