From owner-freebsd-stable@FreeBSD.ORG  Sat Nov 27 14:13:18 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 593E6106566C
	for <stable@freebsd.org>; Sat, 27 Nov 2010 14:13:18 +0000 (UTC)
	(envelope-from lordcow@lordcow.org)
Received: from lordcow.org (lordcow.org [41.203.5.188])
	by mx1.freebsd.org (Postfix) with ESMTP id A468F8FC1E
	for <stable@freebsd.org>; Sat, 27 Nov 2010 14:13:15 +0000 (UTC)
Received: from lordcow.org (localhost [127.0.0.1])
	by lordcow.org (8.14.4/8.14.4) with ESMTP id oARDMsZo080663
	(version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO)
	for <stable@freebsd.org>; Sat, 27 Nov 2010 15:22:54 +0200 (SAST)
	(envelope-from lordcow@lordcow.org)
Received: (from lordcow@localhost)
	by lordcow.org (8.14.4/8.14.4/Submit) id oARDMn3h080662
	for stable@freebsd.org; Sat, 27 Nov 2010 15:22:49 +0200 (SAST)
	(envelope-from lordcow)
Date: Sat, 27 Nov 2010 15:22:49 +0200
From: Gareth de Vaux <bsd@lordcow.org>
To: stable@freebsd.org
Message-ID: <20101127132249.GA80611@lordcow.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.4.2.3i
X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00,
	FILL_THIS_FORM autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lordcow.org
Cc: 
Subject: ZFS raidz recovery
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 27 Nov 2010 14:13:18 -0000

Hi all, I'm trying to simulate a disk fail and replacement in
a raidz array and failing myself. What'm I doing wrong? Here's
a transcript with interspersed commentary:

root@file:~# zpool status
  pool: raid
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:20:06 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad12    ONLINE       0     0     0
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     0

errors: No known data errors
root@file:~# zpool offline raid ad12

reboot
dd if=/dev/zero of=/dev/ad12 ..

root@file:~# zpool replace raid ad12
cannot replace ad12 with ad12: ad12 is busy
root@file:~# zpool replace -f raid ad12
cannot replace ad12 with ad12: ad12 is busy

	The handbook suggests 'replace' but I guess this is only
	if the disk is physically replaced and gets a new identifier?
	Trying with 'online':

root@file:~# zpool online raid ad12
root@file:~# zpool status
  pool: raid
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Sat Nov 27 13:29:14 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad12    ONLINE       0     0     0  15.5K resilvered
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     0

errors: No known data errors

	Output remains as such, is this normal?

root@file:~# zpool scrub raid
root@file:~# zpool status
  pool: raid
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:37 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad12    ONLINE       0     0 2.11K  87.7M repaired
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     0

errors: No known data errors
root@file:~# zpool scrub raid
root@file:~# zpool status
  pool: raid
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:55 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad12    ONLINE       0     0 2.11K
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
            ad6     ONLINE       0     0     0

errors: No known data errors

	These are checksum errors? So the disk hasn't been integrated
	properly?

root@file:~# zpool clear raid ad12
root@file:~# zpool status
  pool: raid
 state: ONLINE
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:39:09 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        ONLINE       0     0     0
	  raidz1    ONLINE       0     0     0
	    ad12    ONLINE       0     0     0
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     0

errors: No known data errors
root@file:~# zpool status -x
all pools are healthy

	To make sure this's the case I fail a different disk:

root@file:~# zpool offline raid ad6
root@file:~# zpool status   
  pool: raid
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
 scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:40:52 2010
config:

	NAME        STATE     READ WRITE CKSUM
	raid        DEGRADED     0     0     0
	  raidz1    DEGRADED     0     0     0
	    ad12    ONLINE       0     0     0
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     OFFLINE      0     0     0

errors: No known data errors

	on reboot the status changes:

root@file:~# zpool status
  pool: raid
 state: FAULTED
status: The pool metadata is corrupted and the pool cannot be opened.
action: Destroy and re-create the pool from a backup source.
   see: http://www.sun.com/msg/ZFS-8000-72
 scrub: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	raid        FAULTED      0     0     1  corrupted data
	  raidz1    DEGRADED     0     0     6
	    ad12    OFFLINE      0     0     0
	    ad13    ONLINE       0     0     0
	    ad4     ONLINE       0     0     0
	    ad6     ONLINE       0     0     1


The same happens if I recreate the array and try again.