From owner-freebsd-fs@FreeBSD.ORG Sun May 16 19:22:03 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2E79F106570A for ; Sun, 16 May 2010 19:22:03 +0000 (UTC) (envelope-from morganw@chemikals.org) Received: from warped.bluecherry.net (unknown [IPv6:2001:440:eeee:fffb::2]) by mx1.freebsd.org (Postfix) with ESMTP id 810A88FC13 for ; Sun, 16 May 2010 19:22:02 +0000 (UTC) Received: from volatile.chemikals.org (adsl-67-243-123.shv.bellsouth.net [98.67.243.123]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by warped.bluecherry.net (Postfix) with ESMTPSA id 43C528093416; Sun, 16 May 2010 14:22:01 -0500 (CDT) Received: from localhost (morganw@localhost [127.0.0.1]) by volatile.chemikals.org (8.14.4/8.14.4) with ESMTP id o4GJLv5r090111; Sun, 16 May 2010 14:21:57 -0500 (CDT) (envelope-from morganw@chemikals.org) Date: Sun, 16 May 2010 14:21:56 -0500 (CDT) From: Wes Morgan X-X-Sender: morganw@volatile To: Todd Wasson In-Reply-To: <0B97967D-1057-4414-BBD4-4F1AA2659A5D@duke.edu> Message-ID: References: <0B97967D-1057-4414-BBD4-4F1AA2659A5D@duke.edu> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Scanned: clamav-milter 0.95.3 at warped X-Virus-Status: Clean Cc: freebsd-fs@freebsd.org Subject: Re: zfs drive replacement issues X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 16 May 2010 19:22:03 -0000 On Sun, 16 May 2010, Todd Wasson wrote: > Hi everyone, I've run into some problems replacing a problematic drive in my pool, and am hopeful someone out there can shed some light on things for me, since reading previous threads and posts around the net hasn't helped me so far. The story goes like this: for a couple of years now (since 7.0-RC something) I've had a pool of four devices: two 400GB drives and two 400GB slices from 500GB drives. I've recently seen errors with one of the 400GB drives like this: > > May 11 21:33:08 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 retry left) LBA=29369344 > May 11 21:33:15 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 retry left) LBA=58819968 > May 11 21:33:23 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 retry left) LBA=80378624 > May 11 21:34:01 newmonkey root: ZFS: vdev I/O failure, zpool=tank path=/dev/ad6 offset=262144 size=8192 error=6 > May 11 21:34:01 newmonkey kernel: ad6: FAILURE - device detached > > ...which also led to a bunch of IO errors showing up for that device in > "zpool status" and prompted me to replace that drive. Since finding a > 400GB drive was a pain, I decided to replace it with at 400GB slice from > a new 500GB drive. This is when I made what I think was the first > critical mistake: I forgot to "zpool offline" it before doing the > replacement, so I just exported the pool, physically replaced the drive, > made a 400GB slice on it with fdisk, and, noticing that it now referred > to the old device by an ID number instead of its "ad6" identifier, did a > "zpool replace tank 10022540361666252397 /dev/ad6s1". > > This actually prompted a scrub for some reason, and not a resilver. > I'm not sure why. However, I noticed that during a scrub I was seeing a > lot of IO errors in "zpool status" on the new device (making me suspect > that maybe the old drive wasn't bad after all, but I think I'll sort > that out afterwards). Additionally, the device won't resilver, and now > it's stuck in a constant state of "replacing". When I try to "zpool > detach" or "zpool offline" either device (old or new) it says there > isn't a replica and refuses. I've finally resorted to putting the > original drive back in to try and make some progress, and now this is > what my zpool status looks like: > > pool: tank > state: DEGRADED > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > tank DEGRADED 0 0 0 > raidz1 DEGRADED 0 0 0 > ad8 ONLINE 0 0 0 > ad10s1 ONLINE 0 0 0 > ad12s1 ONLINE 0 0 0 > replacing DEGRADED 0 0 8 > ad6 ONLINE 0 0 0 > 1667724779240260276 UNAVAIL 0 204 0 was /dev/ad6s1 > > When I do "zpool detach tank 1667724779240260276" it says "cannot detach > 1667724779240260276: no valid replicas". It says the same thing for a > "zpool offline tank 1667724779240260276". Note the IO errors in the new > drive (which is now disconnected), which was ad6s1. It could be a bad > controller, a bad cable, or any number of things, but I can't actually > test it because I can't get rid of the device from the zfs pool. > > So, does anyone have any suggestions? Can I cancel the "replacing" > operation somehow? Do I have to buy a new device, back up the whole > pool, delete it, and rebuild it? Any help is greatly appreciated! Strange, you should be able to cancel the replacement with detach. There is some kind of DTL issue preventing it. I don't know what it is precisely, but there are known cases where legitimate detach operations are prevented. Try reconnecting the disk you partitioned, leaving the original ad6 still attached, and you should be able to detach the new one. Are you planning on using the leftover space on the 500gb drive for something else? If not, then don't bother partitioning it. Just give the whole device to the pool and it will only use 400gb of it.