Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 19 Oct 2010 16:30:41 +0100
From:      Karl Pielorz <kpielorz_lst@tdx.co.uk>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: ZFS 'read-only' device / pool scan / import?
Message-ID:  <7BEF90D9F4D4CB985F3573C3@HexaDeca64.dmpriest.net.uk>
In-Reply-To: <20101019151602.GA61733@icarus.home.lan>
References:  <AE519076FDEA1259C5DEA689@HexaDeca64.dmpriest.net.uk> <20101019151602.GA61733@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help

--On 19 October 2010 08:16 -0700 Jeremy Chadwick <freebsd@jdc.parodius.com> 
wrote:

> Experts here might be able to help, but you're really going to need to
> provide every little detail, in chronological order.  What commands were
> done, what output was seen, what physical actions took place, etc..
>
> 1) Restoring from backups is probably your best bet (IMHO; this is what I
> would do as well).

I didn't provide much detail - as there isn't much detail left to provide 
(the pools been destroyed / rebuilt) - how it got messed up is almost 
certainly a case of human error / controller 'oddity' with failed devices 
[which is now suitably noted for that machine!]...

It was more a 'for future reference' kind of question - does attempting to 
import a pool (or even running something as simple as a 'zfs status' when 
ZFS has not been 'loaded') actually write to the disks? i.e. could it cause 
a pool that is currently 'messed up' to become permanently 'messed up' - 
because ZFS will change metadata on the pool, if 'at the time' it deems 
devices to be faulted / corrupt etc. - And, if it does - is there any way 
of doing a 'test mount/import' (i.e. with the underlying devices only being 
opened 'read only' - or does [as I suspect] ZFS *need* r/w access to those 
devices as part of the work to actually import/mount.

> There's a lot of other things I could add to the item list here
> (probably reach 9 or 10 if I tried), but in general the above sounds
> like its what happened.  raidz2 would have been able to save you in this
> situation, but would require at least 4 disks.

It was RAIDZ2 - it got totally screwed:

"
    vol         UNAVAIL      0     0     0  insufficient replicas
          raidz2    UNAVAIL      0     0     0  insufficient replicas
            da3     FAULTED      0     0     0  corrupted data
            da4     FAULTED      0     0     0  corrupted data
            da5     FAULTED      0     0     0  corrupted data
            da6     FAULTED      0     0     0  corrupted data
            da7     FAULTED      0     0     0  corrupted data
            da8     FAULTED      0     0     0  corrupted data
          raidz2    UNAVAIL      0     0     0  insufficient replicas
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da9     FAULTED      0     0     0  corrupted data
            da10    FAULTED      0     0     0  corrupted data
            da11    FAULTED      0     0     0  corrupted data
            da11    ONLINE       0     0     0
"


As there is such a large aspect of human error (and controller behaviour), 
I don't think it's worth digging into any deeper. It's the first pool we've 
ever "lost" under ZFS, and like I said a combination of the controller 
collapsing devices, and humans replacing wrong disks, 'twas doomed to fail 
from the start.

We've replaced failed drives on this system before - but never rebooted 
after a failure, before a replacement - and never replaced the wrong drive 
:)

Definitely a good advert for backups though :)

-Karl



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7BEF90D9F4D4CB985F3573C3>