From owner-freebsd-stable@FreeBSD.ORG  Wed May  5 15:17:09 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id D7FFA1065673
	for <freebsd-stable@freebsd.org>; Wed,  5 May 2010 15:17:09 +0000 (UTC)
	(envelope-from jdc@koitsu.dyndns.org)
Received: from qmta06.emeryville.ca.mail.comcast.net
	(qmta06.emeryville.ca.mail.comcast.net [76.96.30.56])
	by mx1.freebsd.org (Postfix) with ESMTP id C05758FC2E
	for <freebsd-stable@freebsd.org>; Wed,  5 May 2010 15:17:08 +0000 (UTC)
Received: from omta15.emeryville.ca.mail.comcast.net ([76.96.30.71])
	by qmta06.emeryville.ca.mail.comcast.net with comcast
	id Dns21e0011Y3wxoA6rH9dE; Wed, 05 May 2010 15:17:09 +0000
Received: from koitsu.dyndns.org ([98.248.46.159])
	by omta15.emeryville.ca.mail.comcast.net with comcast
	id DrH81e00F3S48mS8brH8Rp; Wed, 05 May 2010 15:17:09 +0000
Received: by icarus.home.lan (Postfix, from userid 1000)
	id 2BD139B425; Wed,  5 May 2010 08:17:07 -0700 (PDT)
Date: Wed, 5 May 2010 08:17:07 -0700
From: Jeremy Chadwick <freebsd@jdc.parodius.com>
To: Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Message-ID: <20100505151707.GA68166@icarus.home.lan>
References: <4BE16784.8050400@omnilan.de>
 <4BE18729.3050209@omnilan.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4BE18729.3050209@omnilan.de>
User-Agent: Mutt/1.5.20 (2009-06-14)
Cc: FreeBSD Stable <freebsd-stable@freebsd.org>
Subject: Re: ZFS (zpool) doesn't detect failed drive
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 05 May 2010 15:17:10 -0000

On Wed, May 05, 2010 at 04:56:41PM +0200, Harald Schmalzbauer wrote:
> Harald Schmalzbauer schrieb am 05.05.2010 14:41 (localtime):
> >Hello,
> >
> >one drive of my mirror failed today, but 'zpool staus' shows it "online".
> >Every process using a ZFS mount hangs. Also 'zpool offline
> >/dev/ad1' hangs infinitely.
> ...
> Sorry, I made an error with zpool create. Somehow the little word
> "mirror" must have been lost. So the pool wasn't a mirror but a
> stripe. Then of course I can't make one vdev offline. Sorry for the
> noise.
> But I took the opportunity to do some tests with that failing drive
> and created a _real_ mirror. That works without failures, but using
> the mirror again leads to:
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left)
> ata3: port is not ready (timeout 10000ms) tfd = 00000080
> ata3: hardware reset timeout
> ad1: FAILURE - device detached
> 
> Now zpool reporsts the vdev ad1 still online although it has been
> detached and 'atacontrol list' doesn't show it anymore:
> 
> zpool status
>   pool: URUBAmirrorP1
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
>         attempt was made to correct the error.  Applications are
> unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
>         using 'zpool clear' or replace the device with 'zpool replace'.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: none requested
> config:
> 
>         NAME        STATE     READ WRITE CKSUM
>         URUBAmirrorP1  ONLINE       0     0     0
>           mirror    ONLINE       0     0     0
>             ad1     ONLINE       3  302K     0
>             ad2     ONLINE       0     0     0
> 
> errors: No known data errors
> 
> atacontrol list
> ATA channel 2:
>     Master:  ad0 <TRANSCEND/20090520> SATA revision 1.x
>     Slave:       no device present
> ATA channel 3:
>     Master:      no device present
>     Slave:       no device present
> ATA channel 4:
>     Master:  ad2 <SAMSUNG HD154UI/1AG01118> SATA revision 2.x
>     Slave:       no device present
> ATA channel 5:
>     Master:  ad3 <ST3750640NS/3.AEG> SATA revision 1.x
>     Slave:       no device present
> 
> How should such a failure be handled?
> Do I have to manually mark the drive offline for zpool?

You shouldn't have to; this should happen automatically when the
underlying device goes away.  GEOM should see the device gone, and ZFS
should therefore be marking the pool as DEGRADED and the ad1 disk as
FAULTED (or possibly OFFLINE).

Is AHCI in use + enabled (in the BIOS) on this system?  If not, I could
see this being a potential problem but have no idea where it should be
fixed.  If AHCI is available/in use, can you try using ahci_load="yes"
in /boot/loader.conf[1] to see if CAM handles this situation better?

Quick atacontrol<-->camcontrol conversion chart:

atacontrol list          = camcontrol devlist
atacontrol cap <disk>    = camcontrol identify <disk>
atacontrol detach <chan> = not needed AFAIK (just yank the disk)
atacontrol attach <chan> = may not be needed, but if disk doesn't
                           reappear try "camcontrol reset" or
                           "camcontrol rescan"

[1]: WARNING: this will change your device names from ad0->ada0,
ad1->ada1, etc., so you may have to boot single-user and fix /etc/fstab.
No need to mess with ZFS after the device naming changes; ZFS will taste
metadata on all disks attached and automatically load the pools (one
thing about ZFS I greatly appreciate.  :-) )

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |