From owner-freebsd-fs@FreeBSD.ORG Sun May 16 18:41:02 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 34FA91065677 for ; Sun, 16 May 2010 18:41:02 +0000 (UTC) (envelope-from tsw5@duke.edu) Received: from smtp.duke.edu (smtp-04.oit.duke.edu [152.3.174.85]) by mx1.freebsd.org (Postfix) with ESMTP id F19448FC0A for ; Sun, 16 May 2010 18:41:01 +0000 (UTC) Received: from smtp.duke.edu (localhost.localdomain [127.0.0.1]) by localhost (Postfix) with SMTP id A2B00358E87 for ; Sun, 16 May 2010 14:26:21 -0400 (EDT) Received: from shoggoth.wintermute (cpe-066-057-248-190.nc.res.rr.com [66.57.248.190]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by smtp.duke.edu (Postfix) with ESMTP id 5D155358E86 for ; Sun, 16 May 2010 14:26:21 -0400 (EDT) From: Todd Wasson Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Date: Sun, 16 May 2010 14:26:19 -0400 Message-Id: <0B97967D-1057-4414-BBD4-4F1AA2659A5D@duke.edu> To: freebsd-fs@freebsd.org Mime-Version: 1.0 (Apple Message framework v1078) X-Mailer: Apple Mail (2.1078) X-PMX-Version: 5.4.2.338381, Antispam-Engine: 2.6.0.325393, Antispam-Data: 2010.5.16.181815 Subject: zfs drive replacement issues X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 16 May 2010 18:41:02 -0000 Hi everyone, I've run into some problems replacing a problematic drive = in my pool, and am hopeful someone out there can shed some light on = things for me, since reading previous threads and posts around the net = hasn't helped me so far. The story goes like this: for a couple of = years now (since 7.0-RC something) I've had a pool of four devices: two = 400GB drives and two 400GB slices from 500GB drives. I've recently seen = errors with one of the 400GB drives like this: May 11 21:33:08 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 = retry left) LBA=3D29369344 May 11 21:33:15 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 = retry left) LBA=3D58819968 May 11 21:33:23 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 = retry left) LBA=3D80378624 May 11 21:34:01 newmonkey root: ZFS: vdev I/O failure, zpool=3Dtank = path=3D/dev/ad6 offset=3D262144 size=3D8192 error=3D6 May 11 21:34:01 newmonkey kernel: ad6: FAILURE - device detached ...which also led to a bunch of IO errors showing up for that device in = "zpool status" and prompted me to replace that drive. Since finding a = 400GB drive was a pain, I decided to replace it with at 400GB slice from = a new 500GB drive. This is when I made what I think was the first = critical mistake: I forgot to "zpool offline" it before doing the = replacement, so I just exported the pool, physically replaced the drive, = made a 400GB slice on it with fdisk, and, noticing that it now referred = to the old device by an ID number instead of its "ad6" identifier, did a = "zpool replace tank 10022540361666252397 /dev/ad6s1". This actually prompted a scrub for some reason, and not a resilver. I'm = not sure why. However, I noticed that during a scrub I was seeing a lot = of IO errors in "zpool status" on the new device (making me suspect that = maybe the old drive wasn't bad after all, but I think I'll sort that out = afterwards). Additionally, the device won't resilver, and now it's = stuck in a constant state of "replacing". When I try to "zpool detach" = or "zpool offline" either device (old or new) it says there isn't a = replica and refuses. I've finally resorted to putting the original = drive back in to try and make some progress, and now this is what my = zpool status looks like: pool: tank state: DEGRADED scrub: none requested config: NAME STATE READ WRITE CKSUM tank DEGRADED 0 0 0 raidz1 DEGRADED 0 0 0 ad8 ONLINE 0 0 0 ad10s1 ONLINE 0 0 0 ad12s1 ONLINE 0 0 0 replacing DEGRADED 0 0 8 ad6 ONLINE 0 0 0 1667724779240260276 UNAVAIL 0 204 0 was = /dev/ad6s1 When I do "zpool detach tank 1667724779240260276" it says "cannot detach = 1667724779240260276: no valid replicas". It says the same thing for a = "zpool offline tank 1667724779240260276". Note the IO errors in the new = drive (which is now disconnected), which was ad6s1. It could be a bad = controller, a bad cable, or any number of things, but I can't actually = test it because I can't get rid of the device from the zfs pool. So, does anyone have any suggestions? Can I cancel the "replacing" = operation somehow? Do I have to buy a new device, back up the whole = pool, delete it, and rebuild it? Any help is greatly appreciated! Thanks! Todd=