From owner-freebsd-fs@FreeBSD.ORG Tue Sep 25 01:23:50 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5D52B16A417 for ; Tue, 25 Sep 2007 01:23:50 +0000 (UTC) (envelope-from amdmi3@amdmi3.ru) Received: from cp65.agava.net (cp65.agava.net [89.108.66.215]) by mx1.freebsd.org (Postfix) with ESMTP id DC1A213C447 for ; Tue, 25 Sep 2007 01:23:49 +0000 (UTC) (envelope-from amdmi3@amdmi3.ru) Received: from [213.148.20.85] (helo=nexii.panopticon) by cp65.agava.net with esmtpsa (TLSv1:AES256-SHA:256) (Exim 4.44 (FreeBSD)) id 1IZxro-00087Q-Ix for freebsd-fs@freebsd.org; Tue, 25 Sep 2007 04:01:32 +0400 Received: from hades.panopticon (hades.panopticon [192.168.0.2]) by nexii.panopticon (Postfix) with ESMTP id 9EF1F17041 for ; Tue, 25 Sep 2007 04:00:38 +0400 (MSD) Received: by hades.panopticon (Postfix, from userid 1000) id 16D0E404B; Tue, 25 Sep 2007 04:02:55 +0400 (MSD) Date: Tue, 25 Sep 2007 04:02:54 +0400 From: Dmitry Marakasov To: freebsd-fs@freebsd.org Message-ID: <20070925000254.GA35310@hades.panopticon> Mail-Followup-To: freebsd-fs@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=koi8-r Content-Disposition: inline User-Agent: Mutt/1.5.16 (2007-06-09) X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - cp65.agava.net X-AntiAbuse: Original Domain - freebsd.org X-AntiAbuse: Originator/Caller UID/GID - [0 0] / [26 6] X-AntiAbuse: Sender Address Domain - amdmi3.ru X-Source: X-Source-Args: X-Source-Dir: Subject: Shooting yourself in the foot with ZFS: is quite easy X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Sep 2007 01:23:50 -0000 Hi! I'm just playing with ZFS in qemu, and I think I've found some bug in the logic which can lead to shoot-yourself-in-the-foot condition, which can be avoided. First of all, I've constructed raidz array: --- # mdconfig -a -tswap -s64m md0 # mdconfig -a -tswap -s64m md1 # mdconfig -a -tswap -s64m md2 # zpool create pool raidz md{0,1,2} --- Next, I've brought one of the devices offline and rewrote part of it. Let's imagine that I've needed some space for emergency situation. --- # zpool offline pool md0 Bringing device md0 offline # zpool status ... NAME STATE READ WRITE CKSUM pool DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 md0 OFFLINE 0 0 0 md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 ... # dd if=/dev/zero of=/dev/md0 bs=1m count=1 1+0 records in 1+0 records out 1048576 bytes transferred in 0.084011 secs (12481402 bytes/sec) --- Now, how do I put md0 back to the pool? `zpool online pool md0' seems reasonable, and the pool will recover itself on scrub, but I'm paranoid and I want to recreate data on md0 completely. But: --- # zpool replace pool md0 cannot replace md0 with md0: md0 is busy # zpool replace -f pool md0 cannot replace md0 with md0: md0 is busy --- Seems like it's looking onto ondisk data (remains of ZFS) and thinks that it's still used in the pool, because if I erase the whole device with dd, it thinks of md0 as a new disk and replaces it without problems: --- # dd if=/dev/zero of=/dev/md0 bs=1m dd: /dev/md0: end of device 65+0 records in 64+0 records out 67108864 bytes transferred in 10.154127 secs (6609023 bytes/sec) # zpool replace pool md0 # zpool status ... NAME STATE READ WRITE CKSUM pool DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 replacing DEGRADED 0 0 0 md0/old OFFLINE 0 0 0 md0 ONLINE 0 0 0 md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 ... # zpool status ... NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 md0 ONLINE 0 0 0 md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 ... --- This behaviour is, I think, undesired and one shoule be able to replace offline device by itself any time. Which is worse: --- # zpool offline pool md0 Bringing device md0 offline # dd if=/dev/zero of=/dev/md0 bs=1m dd: /dev/md0: end of device 65+0 records in 64+0 records out 67108864 bytes transferred in 8.076568 secs (8309082 bytes/sec) # zpool online pool md0 Bringing device md0 online # zpool status pool: pool state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Mon Sep 24 23:21:49 2007 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 md0 UNAVAIL 0 0 0 corrupted data md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 errors: No known data errors # zpool replace pool md0 invalid vdev specification use '-f' to override the following errors: md0 is in use (r1w1e1) # zpool replace -f pool md0 invalid vdev specification the following errors must be manually repaired: md0 is in use (r1w1e1) # zpool scrub pool # zpool status pool: pool state: ONLINE status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: http://www.sun.com/msg/ZFS-8000-4J scrub: resilver completed with 0 errors on Mon Sep 24 23:22:22 2007 config: NAME STATE READ WRITE CKSUM pool ONLINE 0 0 0 raidz1 ONLINE 0 0 0 md0 UNAVAIL 0 0 0 corrupted data md1 ONLINE 0 0 0 md2 ONLINE 0 0 0 errors: No known data errors # zpool offline md0 missing device name usage: offline [-t] ... # zpool offline pool md0 cannot offline md0: no valid replicas # mdconfig -du0 mdconfig: ioctl(/dev/mdctl): Device busy --- This is very confusing: md0 is UNAVAIL, but the table says pool is ONLINE (not DEGRADED!), though status says it's degraded. Still I neither can bring the device offline nor replace it with itself (though replacing it with equal md3 worked). My opinion is that such situation should be avoided. First of all, zpool behaviour with one of disks in UNAVAIL state seems to be clear bug (array shown as ONLINE, unavility of brind unavail device offline etc.). Also, ZFS should not trust any ondisk contents after bringing a disk online. The best solution is completely recreating ZFS data structures on disk in such case. This should solve both cases: 1) `zfs replace ` won't say that the offline disk is busy 2) One won't need to clear disk with dd to recreate it 3) `zfs online ` won't lead to UNAVAIL state. 4) I think there could be more potential problems with current behavour: for example, what happens if I replace a disk in raidz with another disk, that was used in another raidz before? As I understand, currently `zfs offline`/`zfs online` on a disk leads to it's resilvering anyway? -- Best regards, Dmitry Marakasov mailto:amdmi3@amdmi3.ru