Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 27 Nov 2010 07:30:21 -0800
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Gareth de Vaux <bsd@lordcow.org>
Cc:        stable@freebsd.org
Subject:   Re: ZFS raidz recovery
Message-ID:  <20101127153021.GA2788@icarus.home.lan>
In-Reply-To: <20101127132249.GA80611@lordcow.org>
References:  <20101127132249.GA80611@lordcow.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Nov 27, 2010 at 03:22:49PM +0200, Gareth de Vaux wrote:
> Hi all, I'm trying to simulate a disk fail and replacement in
> a raidz array and failing myself. What'm I doing wrong? Here's
> a transcript with interspersed commentary:
> 
> root@file:~# zpool status
>   pool: raid
>  state: ONLINE
>  scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:20:06 2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0     0
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     0
> 
> errors: No known data errors
> root@file:~# zpool offline raid ad12
> 
> reboot
> dd if=/dev/zero of=/dev/ad12 ..
> 
> root@file:~# zpool replace raid ad12
> cannot replace ad12 with ad12: ad12 is busy
> root@file:~# zpool replace -f raid ad12
> cannot replace ad12 with ad12: ad12 is busy
> 
> 	The handbook suggests 'replace' but I guess this is only
> 	if the disk is physically replaced and gets a new identifier?
> 	Trying with 'online':
> 
> root@file:~# zpool online raid ad12
> root@file:~# zpool status
>   pool: raid
>  state: ONLINE
>  scrub: resilver completed after 0h0m with 0 errors on Sat Nov 27 13:29:14 2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0     0  15.5K resilvered
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     0
> 
> errors: No known data errors
> 
> 	Output remains as such, is this normal?
> 
> root@file:~# zpool scrub raid
> root@file:~# zpool status
>   pool: raid
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
> 	attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> 	using 'zpool clear' or replace the device with 'zpool replace'.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:37 2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0 2.11K  87.7M repaired
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     0
> 
> errors: No known data errors
> root@file:~# zpool scrub raid
> root@file:~# zpool status
>   pool: raid
>  state: ONLINE
> status: One or more devices has experienced an unrecoverable error.  An
> 	attempt was made to correct the error.  Applications are unaffected.
> action: Determine if the device needs to be replaced, and clear the errors
> 	using 'zpool clear' or replace the device with 'zpool replace'.
>    see: http://www.sun.com/msg/ZFS-8000-9P
>  scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:55 2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0 2.11K
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
>             ad6     ONLINE       0     0     0
> 
> errors: No known data errors
> 
> 	These are checksum errors? So the disk hasn't been integrated
> 	properly?
> 
> root@file:~# zpool clear raid ad12
> root@file:~# zpool status
>   pool: raid
>  state: ONLINE
>  scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:39:09 2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        ONLINE       0     0     0
> 	  raidz1    ONLINE       0     0     0
> 	    ad12    ONLINE       0     0     0
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     0
> 
> errors: No known data errors
> root@file:~# zpool status -x
> all pools are healthy
> 
> 	To make sure this's the case I fail a different disk:
> 
> root@file:~# zpool offline raid ad6
> root@file:~# zpool status   
>   pool: raid
>  state: DEGRADED
> status: One or more devices has been taken offline by the administrator.
> 	Sufficient replicas exist for the pool to continue functioning in a
> 	degraded state.
> action: Online the device using 'zpool online' or replace the device with
> 	'zpool replace'.
>  scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:40:52 2010
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        DEGRADED     0     0     0
> 	  raidz1    DEGRADED     0     0     0
> 	    ad12    ONLINE       0     0     0
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     OFFLINE      0     0     0
> 
> errors: No known data errors
> 
> 	on reboot the status changes:
> 
> root@file:~# zpool status
>   pool: raid
>  state: FAULTED
> status: The pool metadata is corrupted and the pool cannot be opened.
> action: Destroy and re-create the pool from a backup source.
>    see: http://www.sun.com/msg/ZFS-8000-72
>  scrub: none requested
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	raid        FAULTED      0     0     1  corrupted data
> 	  raidz1    DEGRADED     0     0     6
> 	    ad12    OFFLINE      0     0     0
> 	    ad13    ONLINE       0     0     0
> 	    ad4     ONLINE       0     0     0
> 	    ad6     ONLINE       0     0     1
> 
> 
> The same happens if I recreate the array and try again.

uname -a please -- it matters greatly.

-- 
| Jeremy Chadwick                                   jdc@parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101127153021.GA2788>