From owner-freebsd-stable@FreeBSD.ORG Sat Nov 27 14:13:18 2010 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 593E6106566C for ; Sat, 27 Nov 2010 14:13:18 +0000 (UTC) (envelope-from lordcow@lordcow.org) Received: from lordcow.org (lordcow.org [41.203.5.188]) by mx1.freebsd.org (Postfix) with ESMTP id A468F8FC1E for ; Sat, 27 Nov 2010 14:13:15 +0000 (UTC) Received: from lordcow.org (localhost [127.0.0.1]) by lordcow.org (8.14.4/8.14.4) with ESMTP id oARDMsZo080663 (version=TLSv1/SSLv3 cipher=DHE-DSS-AES256-SHA bits=256 verify=NO) for ; Sat, 27 Nov 2010 15:22:54 +0200 (SAST) (envelope-from lordcow@lordcow.org) Received: (from lordcow@localhost) by lordcow.org (8.14.4/8.14.4/Submit) id oARDMn3h080662 for stable@freebsd.org; Sat, 27 Nov 2010 15:22:49 +0200 (SAST) (envelope-from lordcow) Date: Sat, 27 Nov 2010 15:22:49 +0200 From: Gareth de Vaux To: stable@freebsd.org Message-ID: <20101127132249.GA80611@lordcow.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.3i X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00, FILL_THIS_FORM autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on lordcow.org Cc: Subject: ZFS raidz recovery X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Nov 2010 14:13:18 -0000 Hi all, I'm trying to simulate a disk fail and replacement in a raidz array and failing myself. What'm I doing wrong? Here's a transcript with interspersed commentary: root@file:~# zpool status pool: raid state: ONLINE scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:20:06 2010 config: NAME STATE READ WRITE CKSUM raid ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad12 ONLINE 0 0 0 ad13 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 errors: No known data errors root@file:~# zpool offline raid ad12 reboot dd if=/dev/zero of=/dev/ad12 .. root@file:~# zpool replace raid ad12 cannot replace ad12 with ad12: ad12 is busy root@file:~# zpool replace -f raid ad12 cannot replace ad12 with ad12: ad12 is busy The handbook suggests 'replace' but I guess this is only if the disk is physically replaced and gets a new identifier? Trying with 'online': root@file:~# zpool online raid ad12 root@file:~# zpool status pool: raid state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Sat Nov 27 13:29:14 2010 config: NAME STATE READ WRITE CKSUM raid ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad12 ONLINE 0 0 0 15.5K resilvered ad13 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 errors: No known data errors Output remains as such, is this normal? root@file:~# zpool scrub raid root@file:~# zpool status pool: raid state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:37 2010 config: NAME STATE READ WRITE CKSUM raid ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad12 ONLINE 0 0 2.11K 87.7M repaired ad13 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 errors: No known data errors root@file:~# zpool scrub raid root@file:~# zpool status pool: raid state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-9P scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:30:55 2010 config: NAME STATE READ WRITE CKSUM raid ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad12 ONLINE 0 0 2.11K ad13 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 errors: No known data errors These are checksum errors? So the disk hasn't been integrated properly? root@file:~# zpool clear raid ad12 root@file:~# zpool status pool: raid state: ONLINE scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:39:09 2010 config: NAME STATE READ WRITE CKSUM raid ONLINE 0 0 0 raidz1 ONLINE 0 0 0 ad12 ONLINE 0 0 0 ad13 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 0 errors: No known data errors root@file:~# zpool status -x all pools are healthy To make sure this's the case I fail a different disk: root@file:~# zpool offline raid ad6 root@file:~# zpool status pool: raid state: DEGRADED status: One or more devices has been taken offline by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using 'zpool online' or replace the device with 'zpool replace'. scrub: scrub completed after 0h0m with 0 errors on Sat Nov 27 13:40:52 2010 config: NAME STATE READ WRITE CKSUM raid DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 ad12 ONLINE 0 0 0 ad13 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 OFFLINE 0 0 0 errors: No known data errors on reboot the status changes: root@file:~# zpool status pool: raid state: FAULTED status: The pool metadata is corrupted and the pool cannot be opened. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-72 scrub: none requested config: NAME STATE READ WRITE CKSUM raid FAULTED 0 0 1 corrupted data raidz1 DEGRADED 0 0 6 ad12 OFFLINE 0 0 0 ad13 ONLINE 0 0 0 ad4 ONLINE 0 0 0 ad6 ONLINE 0 0 1 The same happens if I recreate the array and try again.