Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 16 May 2010 14:21:56 -0500 (CDT)
From:      Wes Morgan <morganw@chemikals.org>
To:        Todd Wasson <tsw5@duke.edu>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: zfs drive replacement issues
Message-ID:  <alpine.BSF.2.00.1005161352240.17439@ibyngvyr>
In-Reply-To: <0B97967D-1057-4414-BBD4-4F1AA2659A5D@duke.edu>
References:  <0B97967D-1057-4414-BBD4-4F1AA2659A5D@duke.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 16 May 2010, Todd Wasson wrote:

> Hi everyone, I've run into some problems replacing a problematic drive in my pool, and am hopeful someone out there can shed some light on things for me, since reading previous threads and posts around the net hasn't helped me so far.  The story goes like this: for a couple of years now (since 7.0-RC something) I've had a pool of four devices: two 400GB drives and two 400GB slices from 500GB drives.  I've recently seen errors with one of the 400GB drives like this:
>
> May 11 21:33:08 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 retry left) LBA=29369344
> May 11 21:33:15 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 retry left) LBA=58819968
> May 11 21:33:23 newmonkey kernel: ad6: TIMEOUT - READ_DMA retrying (1 retry left) LBA=80378624
> May 11 21:34:01 newmonkey root: ZFS: vdev I/O failure, zpool=tank path=/dev/ad6 offset=262144 size=8192 error=6
> May 11 21:34:01 newmonkey kernel: ad6: FAILURE - device detached
>
> ...which also led to a bunch of IO errors showing up for that device in
> "zpool status" and prompted me to replace that drive.  Since finding a
> 400GB drive was a pain, I decided to replace it with at 400GB slice from
> a new 500GB drive.  This is when I made what I think was the first
> critical mistake: I forgot to "zpool offline" it before doing the
> replacement, so I just exported the pool, physically replaced the drive,
> made a 400GB slice on it with fdisk, and, noticing that it now referred
> to the old device by an ID number instead of its "ad6" identifier, did a
> "zpool replace tank 10022540361666252397 /dev/ad6s1".
>
> This actually prompted a scrub for some reason, and not a resilver.
> I'm not sure why.  However, I noticed that during a scrub I was seeing a
> lot of IO errors in "zpool status" on the new device (making me suspect
> that maybe the old drive wasn't bad after all, but I think I'll sort
> that out afterwards).  Additionally, the device won't resilver, and now
> it's stuck in a constant state of "replacing".  When I try to "zpool
> detach" or "zpool offline" either device (old or new) it says there
> isn't a replica and refuses.  I've finally resorted to putting the
> original drive back in to try and make some progress, and now this is
> what my zpool status looks like:
>
>   pool: tank
>  state: DEGRADED
>  scrub: none requested
> config:
>
>         NAME                       STATE     READ WRITE CKSUM
>         tank                       DEGRADED     0     0     0
>           raidz1                   DEGRADED     0     0     0
>             ad8                    ONLINE       0     0     0
>             ad10s1                 ONLINE       0     0     0
>             ad12s1                 ONLINE       0     0     0
>             replacing              DEGRADED     0     0     8
>               ad6                  ONLINE       0     0     0
>               1667724779240260276  UNAVAIL      0   204     0  was /dev/ad6s1
>
> When I do "zpool detach tank 1667724779240260276" it says "cannot detach
> 1667724779240260276: no valid replicas".  It says the same thing for a
> "zpool offline tank 1667724779240260276".  Note the IO errors in the new
> drive (which is now disconnected), which was ad6s1.  It could be a bad
> controller, a bad cable, or any number of things, but I can't actually
> test it because I can't get rid of the device from the zfs pool.
>
> So, does anyone have any suggestions?  Can I cancel the "replacing"
> operation somehow?  Do I have to buy a new device, back up the whole
> pool, delete it, and rebuild it?  Any help is greatly appreciated!

Strange, you should be able to cancel the replacement with detach. There
is some kind of DTL issue preventing it. I don't know what it is
precisely, but there are known cases where legitimate detach operations
are prevented.

Try reconnecting the disk you partitioned, leaving the original ad6 still
attached, and you should be able to detach the new one.

Are you planning on using the leftover space on the 500gb drive for
something else? If not, then don't bother partitioning it. Just give the
whole device to the pool and it will only use 400gb of it.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1005161352240.17439>