From owner-freebsd-stable@FreeBSD.ORG Sat Nov 27 15:30:25 2010 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CA3811065670 for ; Sat, 27 Nov 2010 15:30:25 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta06.westchester.pa.mail.comcast.net (qmta06.westchester.pa.mail.comcast.net [76.96.62.56]) by mx1.freebsd.org (Postfix) with ESMTP id 77E828FC1A for ; Sat, 27 Nov 2010 15:30:24 +0000 (UTC) Received: from omta10.westchester.pa.mail.comcast.net ([76.96.62.28]) by qmta06.westchester.pa.mail.comcast.net with comcast id cFUh1f0050cZkys56FWR4r; Sat, 27 Nov 2010 15:30:25 +0000 Received: from koitsu.dyndns.org ([98.248.41.155]) by omta10.westchester.pa.mail.comcast.net with comcast id cFWP1f0073LrwQ23WFWPpj; Sat, 27 Nov 2010 15:30:25 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id EF13E9B422; Sat, 27 Nov 2010 07:30:21 -0800 (PST) Date: Sat, 27 Nov 2010 07:30:21 -0800 From: Jeremy Chadwick To: Gareth de Vaux Message-ID: <20101127153021.GA2788@icarus.home.lan> References: <20101127132249.GA80611@lordcow.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20101127132249.GA80611@lordcow.org> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: stable@freebsd.org Subject: Re: ZFS raidz recovery X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Nov 2010 15:30:25 -0000 On Sat, Nov 27, 2010 at 03:22:49PM +0200, Gareth de Vaux wrote: > Hi all, I'm trying to simulate a disk fail and replacement in > a raidz array and failing myself. What'm I doing wrong? Here's > a transcript with interspersed commentary: > > root@file:~# zpool status > pool: raid > state: ONLINE > scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:20:06 2010 > config: > > NAME STATE READ WRITE CKSUM > raid ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > ad12 ONLINE 0 0 0 > ad13 ONLINE 0 0 0 > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 0 > > errors: No known data errors > root@file:~# zpool offline raid ad12 > > reboot > dd if=/dev/zero of=/dev/ad12 .. > > root@file:~# zpool replace raid ad12 > cannot replace ad12 with ad12: ad12 is busy > root@file:~# zpool replace -f raid ad12 > cannot replace ad12 with ad12: ad12 is busy > > The handbook suggests 'replace' but I guess this is only > if the disk is physically replaced and gets a new identifier? > Trying with 'online': > > root@file:~# zpool online raid ad12 > root@file:~# zpool status > pool: raid > state: ONLINE > scrub: resilver completed after 0h0m with 0 errors on Sat Nov 27 13:29:14 2010 > config: > > NAME STATE READ WRITE CKSUM > raid ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > ad12 ONLINE 0 0 0 15.5K resilvered > ad13 ONLINE 0 0 0 > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 0 > > errors: No known data errors > > Output remains as such, is this normal? > > root@file:~# zpool scrub raid > root@file:~# zpool status > pool: raid > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:37 2010 > config: > > NAME STATE READ WRITE CKSUM > raid ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > ad12 ONLINE 0 0 2.11K 87.7M repaired > ad13 ONLINE 0 0 0 > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 0 > > errors: No known data errors > root@file:~# zpool scrub raid > root@file:~# zpool status > pool: raid > state: ONLINE > status: One or more devices has experienced an unrecoverable error. An > attempt was made to correct the error. Applications are unaffected. > action: Determine if the device needs to be replaced, and clear the errors > using 'zpool clear' or replace the device with 'zpool replace'. > see: http://www.sun.com/msg/ZFS-8000-9P > scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:55 2010 > config: > > NAME STATE READ WRITE CKSUM > raid ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > ad12 ONLINE 0 0 2.11K > ad13 ONLINE 0 0 0 > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 0 > > errors: No known data errors > > These are checksum errors? So the disk hasn't been integrated > properly? > > root@file:~# zpool clear raid ad12 > root@file:~# zpool status > pool: raid > state: ONLINE > scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:39:09 2010 > config: > > NAME STATE READ WRITE CKSUM > raid ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > ad12 ONLINE 0 0 0 > ad13 ONLINE 0 0 0 > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 0 > > errors: No known data errors > root@file:~# zpool status -x > all pools are healthy > > To make sure this's the case I fail a different disk: > > root@file:~# zpool offline raid ad6 > root@file:~# zpool status > pool: raid > state: DEGRADED > status: One or more devices has been taken offline by the administrator. > Sufficient replicas exist for the pool to continue functioning in a > degraded state. > action: Online the device using 'zpool online' or replace the device with > 'zpool replace'. > scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:40:52 2010 > config: > > NAME STATE READ WRITE CKSUM > raid DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > ad12 ONLINE 0 0 0 > ad13 ONLINE 0 0 0 > ad4 ONLINE 0 0 0 > ad6 OFFLINE 0 0 0 > > errors: No known data errors > > on reboot the status changes: > > root@file:~# zpool status > pool: raid > state: FAULTED > status: The pool metadata is corrupted and the pool cannot be opened. > action: Destroy and re-create the pool from a backup source. > see: http://www.sun.com/msg/ZFS-8000-72 > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > raid FAULTED 0 0 1 corrupted data > raidz1 DEGRADED 0 0 6 > ad12 OFFLINE 0 0 0 > ad13 ONLINE 0 0 0 > ad4 ONLINE 0 0 0 > ad6 ONLINE 0 0 1 > > > The same happens if I recreate the array and try again. uname -a please -- it matters greatly. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |