From owner-freebsd-fs@FreeBSD.ORG  Tue Nov 16 08:47:34 2010
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 077BF106566C
	for <freebsd-fs@freebsd.org>; Tue, 16 Nov 2010 08:47:34 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta05.emeryville.ca.mail.comcast.net
	(qmta05.emeryville.ca.mail.comcast.net [76.96.30.48])
	by mx1.freebsd.org (Postfix) with ESMTP id E1AAB8FC08
	for <freebsd-fs@freebsd.org>; Tue, 16 Nov 2010 08:47:33 +0000 (UTC)
Received: from omta10.emeryville.ca.mail.comcast.net ([76.96.30.28])
	by qmta05.emeryville.ca.mail.comcast.net with comcast
	id Xkgk1f0020cQ2SLA5knZBa; Tue, 16 Nov 2010 08:47:33 +0000
Received: from koitsu.dyndns.org ([98.248.41.155])
	by omta10.emeryville.ca.mail.comcast.net with comcast
	id XknY1f0043LrwQ28WknYBN; Tue, 16 Nov 2010 08:47:33 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 704DB9B427; Tue, 16 Nov 2010 00:47:32 -0800 (PST)
Date: Tue, 16 Nov 2010 00:47:32 -0800
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Michael Boers <michaelscotttech@gmail.com>
Message-ID: <20101116084732.GA85887@icarus.home.lan>
References: <25DC6C26-52FB-447A-AEB0-8549DA8F53E7@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <25DC6C26-52FB-447A-AEB0-8549DA8F53E7@gmail.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: freebsd-fs@freebsd.org
Subject: Re: zfs mirror recognizing disk failures
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 16 Nov 2010 08:47:34 -0000

On Mon, Nov 15, 2010 at 05:03:30PM -0500, Michael Boers wrote:
> Is there anything I can do to make a zfs mirror quicker to give up
> on a flaky disk?
> 
> I recently had a 100% zfs system crash when started to have some
> disk errors.  I had hoped that by having a mirror, the system would
> survive this type of error.  Instead it just hung.
> 
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): SYNCHRONIZE
> CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): CAM Status: SCSI
> Status Error
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): SCSI Status: Check
> Condition
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): ABORTED COMMAND
> asc:0,0
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): No additional
> sense information
> Nov 11 10:05:01 caprica kernel: (da2:mpt0:0:3:0): Retries Exhausted
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003c87a0:2838 timed out for ccb 0xffffff0103acc000
> (req->ccb 0xffffff0103acc000)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003c5110:2839 timed out for ccb 0xffffff035cab0800
> (req->ccb 0xffffff035cab0800)
> Nov 11 10:05:53 caprica kernel: mpt0: attempting to abort req
> 0xffffff80003c87a0:2838 function 0
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003bef30:2840 timed out for ccb 0xffffff0007986800
> (req->ccb 0xffffff0007986800)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003c8560:2841 timed out for ccb 0xffffff032d985000
> (req->ccb 0xffffff032d985000)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003bf320:2842 timed out for ccb 0xffffff0103af2000
> (req->ccb 0xffffff0103af2000)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003cbda0:2843 timed out for ccb 0xffffff0103b0b000
> (req->ccb 0xffffff0103b0b000)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003bfd40:2844 timed out for ccb 0xffffff00102bf800
> (req->ccb 0xffffff00102bf800)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003cad50:2845 timed out for ccb 0xffffff01e6f33000
> (req->ccb 0xffffff01e6f33000)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003caf00:2846 timed out for ccb 0xffffff01e6f24800
> (req->ccb 0xffffff01e6f24800)
> Nov 11 10:05:53 caprica kernel: mpt0: request
> 0xffffff80003ccd60:2847 timed out for ccb 0xffffff01308a4000
> (req->ccb 0xffffff01308a4000)
> 
> Is this a type of error zfs can survive or do I need a hardware
> mirror to handle this type of problem?

This looks to me like a problem/quirk with mpt(4) and not ZFS.  What
happened after this point?  Didn't the mpt driver drop the disk off the
bus (in CAM)?  ZFS would notice that when it happens.  So, I think this
looks like a problem with either the mpt cards or the driver.

What I'm stating: ZFS shouldn't be responsible for "figuring out if
communication with the disk is messed up" -- that's the job of the
storage controller and the storage controller driver.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |