From owner-freebsd-fs@FreeBSD.ORG  Sun May 16 18:41:02 2010
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 34FA91065677
	for <freebsd-fs@freebsd.org>; Sun, 16 May 2010 18:41:02 +0000 (UTC)
	(envelope-from tsw5@duke.edu)
Received: from smtp.duke.edu (smtp-04.oit.duke.edu [152.3.174.85])
	by mx1.freebsd.org (Postfix) with ESMTP id F19448FC0A
	for <freebsd-fs@freebsd.org>; Sun, 16 May 2010 18:41:01 +0000 (UTC)
Received: from smtp.duke.edu (localhost.localdomain [127.0.0.1])
	by localhost (Postfix) with SMTP id A2B00358E87
	for <freebsd-fs@freebsd.org>; Sun, 16 May 2010 14:26:21 -0400 (EDT)
Received: from shoggoth.wintermute (cpe-066-057-248-190.nc.res.rr.com
	[66.57.248.190]) (using TLSv1 with cipher AES128-SHA (128/128 bits))
	(No client certificate requested)
	by smtp.duke.edu (Postfix) with ESMTP id 5D155358E86
	for <freebsd-fs@freebsd.org>; Sun, 16 May 2010 14:26:21 -0400 (EDT)
From: Todd Wasson <tsw5@duke.edu>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Date: Sun, 16 May 2010 14:26:19 -0400
Message-Id: <0B97967D-1057-4414-BBD4-4F1AA2659A5D@duke.edu>
To: freebsd-fs@freebsd.org
Mime-Version: 1.0 (Apple Message framework v1078)
X-Mailer: Apple Mail (2.1078)
X-PMX-Version: 5.4.2.338381, Antispam-Engine: 2.6.0.325393,
	Antispam-Data: 2010.5.16.181815
Subject: zfs drive replacement issues
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 16 May 2010 18:41:02 -0000

Hi everyone, I've run into some problems replacing a problematic drive =
in my pool, and am hopeful someone out there can shed some light on =
things for me, since reading previous threads and posts around the net =
hasn't helped me so far.  The story goes like this: for a couple of =
years now (since 7.0-RC something) I've had a pool of four devices: two =
400GB drives and two 400GB slices from 500GB drives.  I've recently seen =
errors with one of the 400GB drives like this:

May 11 21:33:08 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 =
retry left) LBA=3D29369344
May 11 21:33:15 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 =
retry left) LBA=3D58819968
May 11 21:33:23 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 =
retry left) LBA=3D80378624
May 11 21:34:01 newmonkey root: ZFS: vdev I/O failure, zpool=3Dtank =
path=3D/dev/ad6 offset=3D262144 size=3D8192 error=3D6
May 11 21:34:01 newmonkey kernel: ad6: FAILURE - device detached

...which also led to a bunch of IO errors showing up for that device in =
"zpool status" and prompted me to replace that drive.  Since finding a =
400GB drive was a pain, I decided to replace it with at 400GB slice from =
a new 500GB drive.  This is when I made what I think was the first =
critical mistake: I forgot to "zpool offline" it before doing the =
replacement, so I just exported the pool, physically replaced the drive, =
made a 400GB slice on it with fdisk, and, noticing that it now referred =
to the old device by an ID number instead of its "ad6" identifier, did a =
"zpool replace tank 10022540361666252397 /dev/ad6s1".

This actually prompted a scrub for some reason, and not a resilver.  I'm =
not sure why.  However, I noticed that during a scrub I was seeing a lot =
of IO errors in "zpool status" on the new device (making me suspect that =
maybe the old drive wasn't bad after all, but I think I'll sort that out =
afterwards).  Additionally, the device won't resilver, and now it's =
stuck in a constant state of "replacing".  When I try to "zpool detach" =
or "zpool offline" either device (old or new) it says there isn't a =
replica and refuses.  I've finally resorted to putting the original =
drive back in to try and make some progress, and now this is what my =
zpool status looks like:

  pool: tank
 state: DEGRADED
 scrub: none requested
config:

        NAME                       STATE     READ WRITE CKSUM
        tank                       DEGRADED     0     0     0
          raidz1                   DEGRADED     0     0     0
            ad8                    ONLINE       0     0     0
            ad10s1                 ONLINE       0     0     0
            ad12s1                 ONLINE       0     0     0
            replacing              DEGRADED     0     0     8
              ad6                  ONLINE       0     0     0
              1667724779240260276  UNAVAIL      0   204     0  was =
/dev/ad6s1

When I do "zpool detach tank 1667724779240260276" it says "cannot detach =
1667724779240260276: no valid replicas".  It says the same thing for a =
"zpool offline tank 1667724779240260276".  Note the IO errors in the new =
drive (which is now disconnected), which was ad6s1.  It could be a bad =
controller, a bad cable, or any number of things, but I can't actually =
test it because I can't get rid of the device from the zfs pool.

So, does anyone have any suggestions?  Can I cancel the "replacing" =
operation somehow?  Do I have to buy a new device, back up the whole =
pool, delete it, and rebuild it?  Any help is greatly appreciated!

Thanks!


Todd=