From owner-freebsd-stable@FreeBSD.ORG Wed May 5 15:17:09 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D7FFA1065673 for ; Wed, 5 May 2010 15:17:09 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta06.emeryville.ca.mail.comcast.net (qmta06.emeryville.ca.mail.comcast.net [76.96.30.56]) by mx1.freebsd.org (Postfix) with ESMTP id C05758FC2E for ; Wed, 5 May 2010 15:17:08 +0000 (UTC) Received: from omta15.emeryville.ca.mail.comcast.net ([76.96.30.71]) by qmta06.emeryville.ca.mail.comcast.net with comcast id Dns21e0011Y3wxoA6rH9dE; Wed, 05 May 2010 15:17:09 +0000 Received: from koitsu.dyndns.org ([98.248.46.159]) by omta15.emeryville.ca.mail.comcast.net with comcast id DrH81e00F3S48mS8brH8Rp; Wed, 05 May 2010 15:17:09 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id 2BD139B425; Wed, 5 May 2010 08:17:07 -0700 (PDT) Date: Wed, 5 May 2010 08:17:07 -0700 From: Jeremy Chadwick To: Harald Schmalzbauer Message-ID: <20100505151707.GA68166@icarus.home.lan> References: <4BE16784.8050400@omnilan.de> <4BE18729.3050209@omnilan.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4BE18729.3050209@omnilan.de> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: FreeBSD Stable Subject: Re: ZFS (zpool) doesn't detect failed drive X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 May 2010 15:17:10 -0000 On Wed, May 05, 2010 at 04:56:41PM +0200, Harald Schmalzbauer wrote: > Harald Schmalzbauer schrieb am 05.05.2010 14:41 (localtime): > >Hello, > > > >one drive of my mirror failed today, but 'zpool staus' shows it "online". > >Every process using a ZFS mount hangs. Also 'zpool offline > >/dev/ad1' hangs infinitely. > ... > Sorry, I made an error with zpool create. Somehow the little word > "mirror" must have been lost. So the pool wasn't a mirror but a > stripe. Then of course I can't make one vdev offline. Sorry for the > noise. > But I took the opportunity to do some tests with that failing drive > and created a _real_ mirror. That works without failures, but using > the mirror again leads to: > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ad1: TIMEOUT - FLUSHCACHE48 retrying (1 retry left) > ata3: port is not ready (timeout 10000ms) tfd = 00000080 > ata3: hardware reset timeout > ad1: FAILURE - device detached > > Now zpool reporsts the vdev ad1 still online although it has been > detached and 'atacontrol list' doesn't show it anymore: > > zpool status > pool: URUBAmirrorP1 > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are > unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > URUBAmirrorP1 ONLINE 0 0 0 > mirror ONLINE 0 0 0 > ad1 ONLINE 3 302K 0 > ad2 ONLINE 0 0 0 > > errors: No known data errors > > atacontrol list > ATA channel 2: > Master: ad0 SATA revision 1.x > Slave: no device present > ATA channel 3: > Master: no device present > Slave: no device present > ATA channel 4: > Master: ad2 SATA revision 2.x > Slave: no device present > ATA channel 5: > Master: ad3 SATA revision 1.x > Slave: no device present > > How should such a failure be handled? > Do I have to manually mark the drive offline for zpool? You shouldn't have to; this should happen automatically when the underlying device goes away. GEOM should see the device gone, and ZFS should therefore be marking the pool as DEGRADED and the ad1 disk as FAULTED (or possibly OFFLINE). Is AHCI in use + enabled (in the BIOS) on this system? If not, I could see this being a potential problem but have no idea where it should be fixed. If AHCI is available/in use, can you try using ahci_load="yes" in /boot/loader.conf[1] to see if CAM handles this situation better? Quick atacontrol<-->camcontrol conversion chart: atacontrol list = camcontrol devlist atacontrol cap = camcontrol identify atacontrol detach = not needed AFAIK (just yank the disk) atacontrol attach = may not be needed, but if disk doesn't reappear try "camcontrol reset" or "camcontrol rescan" [1]: WARNING: this will change your device names from ad0->ada0, ad1->ada1, etc., so you may have to boot single-user and fix /etc/fstab. No need to mess with ZFS after the device naming changes; ZFS will taste metadata on all disks attached and automatically load the pools (one thing about ZFS I greatly appreciate. :-) ) -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |