From owner-freebsd-fs@FreeBSD.ORG Tue Nov 16 13:58:41 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 09D32106564A for ; Tue, 16 Nov 2010 13:58:41 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta15.westchester.pa.mail.comcast.net (qmta15.westchester.pa.mail.comcast.net [76.96.59.228]) by mx1.freebsd.org (Postfix) with ESMTP id A7BBC8FC08 for ; Tue, 16 Nov 2010 13:58:40 +0000 (UTC) Received: from omta19.westchester.pa.mail.comcast.net ([76.96.62.98]) by qmta15.westchester.pa.mail.comcast.net with comcast id Xo841f00527AodY5FpyggG; Tue, 16 Nov 2010 13:58:40 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta19.westchester.pa.mail.comcast.net with comcast id Xpyf1f00P3LrwQ23fpygVv; Tue, 16 Nov 2010 13:58:40 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 4E24F9B427; Tue, 16 Nov 2010 05:58:38 -0800 (PST) Date: Tue, 16 Nov 2010 05:58:38 -0800 From: Jeremy Chadwick To: Michael Boers Message-ID: <20101116135838.GA91324@icarus.home.lan> References: <25DC6C26-52FB-447A-AEB0-8549DA8F53E7@gmail.com> <441E3529-6178-404E-8A2D-2CF9BBC4170C@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <441E3529-6178-404E-8A2D-2CF9BBC4170C@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: zfs mirror recognizing disk failures X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Nov 2010 13:58:41 -0000 On Tue, Nov 16, 2010 at 08:32:35AM -0500, Michael Boers wrote: > To answer Jermey's question of "what happened next?" > > The machine was not serving web requests > The machine was not responsive via ssh > The machine was pingable > > after waiting about 15 minutes, I used the ipmi protocol to power > down the machine. > When it came back up, I found the enclosed errors in the log. > > If I am following your comments correctly, the fault for this lies > in the mpt system not giving up which could either be a driver or a > firmware issue. Is that correct? > > How do I protect against that? The fault, in my opinion -- and I urge others (especially those familiar with the driver) to correct me, because I am often wrong -- lies with either with the controller itself, or mpt(4), not truly "giving up" after repetitive errors. It could be a firmware bug/quirk, sure. It could be a lot of things, or a combination of things. I don't want to rule out anything. For example, at my workplace we use Solaris with Adaptec controllers, using a multitude of Fujitsu disks. Everything is SCSI-3. We regularly (at least once a week, usually more than that) see disk problems where either the disk falls off the bus unexpectedly, the drive itself "wedges" (resulting in the controller getting stuck in an infinite loop trying to talk to it) and won't unwedge without a full power-cycle (soft reset doesn't work), or in certain bad block circumstances the drive wedges long enough for the controller driver to break in a strange way (resulting in a system panic). Each situation appears to be different; there's definitely situations where the disk is responsible, others which look like the controller is responsible, and others which look like driver issues. I'm not familiar (read: have not used) mpt(4) controllers, but if my memory serves me right, people post about problems with them from time to time on FreeBSD. Each incident has to be addressed separately. If you're asking for a workaround or "what should I do", the solution is to either change controllers (read: avoid mpt(4)), or figure out how/why the disk became wedged (or if it even did in the first place). Your original post contains no useful information about the hardware itself (mpt handles many controllers yet we know not what model, we know nothing about disk da2, etc.). You're going to need to provide this. Relevant dmesg output, camcontrol devlist, camcontrol inquiry, and smartctl -a output for the disk would be useful (assuming the controller supports passthrough). Finally, be aware that trying to chase down a problem of this nature is often time-consuming. Sometimes it's not worth it at all, and instead better spent replacing all of the hardware involved. If it happens again after that, change vendors or hardware controllers (or disks) used. That's just how it goes. I tend to stick to Intel ICHxx or ESB SATA controllers for this reason; they're well-tested on FreeBSD. And I don't use hardware RAID at all for many reasons (separate topic). -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |