From owner-freebsd-fs@FreeBSD.ORG Tue Nov 16 08:47:34 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 077BF106566C for ; Tue, 16 Nov 2010 08:47:34 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta05.emeryville.ca.mail.comcast.net (qmta05.emeryville.ca.mail.comcast.net [76.96.30.48]) by mx1.freebsd.org (Postfix) with ESMTP id E1AAB8FC08 for ; Tue, 16 Nov 2010 08:47:33 +0000 (UTC) Received: from omta10.emeryville.ca.mail.comcast.net ([76.96.30.28]) by qmta05.emeryville.ca.mail.comcast.net with comcast id Xkgk1f0020cQ2SLA5knZBa; Tue, 16 Nov 2010 08:47:33 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta10.emeryville.ca.mail.comcast.net with comcast id XknY1f0043LrwQ28WknYBN; Tue, 16 Nov 2010 08:47:33 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 704DB9B427; Tue, 16 Nov 2010 00:47:32 -0800 (PST) Date: Tue, 16 Nov 2010 00:47:32 -0800 From: Jeremy Chadwick To: Michael Boers Message-ID: <20101116084732.GA85887@icarus.home.lan> References: <25DC6C26-52FB-447A-AEB0-8549DA8F53E7@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <25DC6C26-52FB-447A-AEB0-8549DA8F53E7@gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: freebsd-fs@freebsd.org Subject: Re: zfs mirror recognizing disk failures X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 16 Nov 2010 08:47:34 -0000 On Mon, Nov 15, 2010 at 05:03:30PM -0500, Michael Boers wrote: > Is there anything I can do to make a zfs mirror quicker to give up > on a flaky disk? > > I recently had a 100% zfs system crash when started to have some > disk errors. I had hoped that by having a mirror, the system would > survive this type of error. Instead it just hung. > > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): SYNCHRONIZE > CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): CAM Status: SCSI > Status Error > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): SCSI Status: Check > Condition > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): ABORTED COMMAND > asc:0,0 > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): No additional > sense information > Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): Retries Exhausted > Nov 11 10:05:53 caprica kernel: mpt0: request > 0xffffff80003c87a0:2838 timed out for ccb 0xffffff0103acc000 > (req->ccb 0xffffff0103acc000) > Nov 11 10:05:53 caprica kernel: mpt0: request > 0xffffff80003c5110:2839 timed out for ccb 0xffffff035cab0800 > (req->ccb 0xffffff035cab0800) > Nov 11 10:05:53 caprica kernel: mpt0: attempting to abort req > 0xffffff80003c87a0:2838 function 0 > Nov 11 10:05:53 caprica kernel: mpt0: request > 0xffffff80003bef30:2840 timed out for ccb 0xffffff0007986800 > (req->ccb 0xffffff0007986800) > Nov 11 10:05:53 caprica kernel: mpt0: request > 0xffffff80003c8560:2841 timed out for ccb 0xffffff032d985000 > (req->ccb 0xffffff032d985000) > Nov 11 10:05:53 caprica kernel: mpt0: request > 0xffffff80003bf320:2842 timed out for ccb 0xffffff0103af2000 > (req->ccb 0xffffff0103af2000) > Nov 11 10:05:53 caprica kernel: mpt0: request > 0xffffff80003cbda0:2843 timed out for ccb 0xffffff0103b0b000 > (req->ccb 0xffffff0103b0b000) > Nov 11 10:05:53 caprica kernel: mpt0: request > 0xffffff80003bfd40:2844 timed out for ccb 0xffffff00102bf800 > (req->ccb 0xffffff00102bf800) > Nov 11 10:05:53 caprica kernel: mpt0: request > 0xffffff80003cad50:2845 timed out for ccb 0xffffff01e6f33000 > (req->ccb 0xffffff01e6f33000) > Nov 11 10:05:53 caprica kernel: mpt0: request > 0xffffff80003caf00:2846 timed out for ccb 0xffffff01e6f24800 > (req->ccb 0xffffff01e6f24800) > Nov 11 10:05:53 caprica kernel: mpt0: request > 0xffffff80003ccd60:2847 timed out for ccb 0xffffff01308a4000 > (req->ccb 0xffffff01308a4000) > > Is this a type of error zfs can survive or do I need a hardware > mirror to handle this type of problem? This looks to me like a problem/quirk with mpt(4) and not ZFS. What happened after this point? Didn't the mpt driver drop the disk off the bus (in CAM)? ZFS would notice that when it happens. So, I think this looks like a problem with either the mpt cards or the driver. What I'm stating: ZFS shouldn't be responsible for "figuring out if communication with the disk is messed up" -- that's the job of the storage controller and the storage controller driver. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |