Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 10 Mar 2019 12:29:26 +1100
From:      Michelle Sullivan <michelle@sorbs.net>
To:        Ben RUBSON <ben.rubson@gmail.com>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>, Stefan Esser <se@freebsd.org>
Subject:   Re: ZFS pool faulted (corrupt metadata) but the disk data appears ok...
Message-ID:  <3be04f0b-bded-9b77-896b-631824a14c4a@sorbs.net>
In-Reply-To: <5eb35692-37ab-33bf-aea1-9f4aa61bb7f7@sorbs.net>
References:  <54D3E9F6.20702@sorbs.net> <54D41608.50306@delphij.net> <54D41AAA.6070303@sorbs.net> <54D41C52.1020003@delphij.net> <54D424F0.9080301@sorbs.net> <54D47F94.9020404@freebsd.org> <54D4A552.7050502@sorbs.net> <54D4BB5A.30409@freebsd.org> <54D8B3D8.6000804@sorbs.net> <54D8CECE.60909@freebsd.org> <54D8D4A1.9090106@sorbs.net> <54D8D5DE.4040906@sentex.net> <54D8D92C.6030705@sorbs.net> <54D8E189.40201@sorbs.net> <54D924DD.4000205@sorbs.net> <54DCAC29.8000301@sorbs.net> <9c995251-45f1-cf27-c4c8-30a4bd0f163c@sorbs.net> <8282375D-5DDC-4294-A69C-03E9450D9575@gmail.com> <73dd7026-534e-7212-a037-0cbf62a61acd@sorbs.net> <FAB7C3BA-057F-4AB4-96E1-5C3208BABBA7@gmail.com> <027070fb-f7b5-3862-3a52-c0f280ab46d1@sorbs.net> <42C31457-1A84-4CCA-BF14-357F1F3177DA@gmail.com> <5eb35692-37ab-33bf-aea1-9f4aa61bb7f7@sorbs.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Michelle Sullivan wrote:
> Ben RUBSON wrote:
>> On 02 Feb 2018 21:48, Michelle Sullivan wrote:
>>
>>> Ben RUBSON wrote:
>>>
>>>> So disks died because of the carrier, as I assume the second 
>>>> unscathed server was OK...
>>>
>>> Pretty much.
>>>
>>>> Heads must have scratched the platters, but they should have been 
>>>> parked, so... Really strange.
>>>
>>> You'd have thought... though 2 of the drives look like it was wear 
>>> and wear issues (the 2 not showing red lights) just not picked up on 
>>> the periodic scrub....  Could be that the recovery showed that one 
>>> up... you know - how you can have an array working fine, but one 
>>> disk dies then others fail during the rebuild because of the extra 
>>> workload.
>>
>> Yes... To try to mitigate this, when I add a new vdev to a pool, I 
>> spread the new disks I have among the existing vdevs, and construct 
>> the new vdev with the remaining new disk(s) + other disks retrieved 
>> from the other vdevs. Thus, when possible, avoiding vdevs with all 
>> disks at the same runtime.
>> However I only use mirrors, applying this with raid-Z could be a 
>> little bit more tricky...
>>
> Believe it or not...
>
> # zpool status -v
>   pool: VirtualDisks
>  state: ONLINE
> status: One or more devices are configured to use a non-native block 
> size.
>     Expect reduced performance.
> action: Replace affected devices with devices that support the
>     configured block size, or migrate data to a properly configured
>     pool.
>   scan: none requested
> config:
>
>     NAME                       STATE     READ WRITE CKSUM
>     VirtualDisks               ONLINE       0     0     0
>       zvol/sorbs/VirtualDisks  ONLINE       0     0     0  block size: 
> 512B configured, 8192B native
>
> errors: No known data errors
>
>   pool: sorbs
>  state: ONLINE
>   scan: resilvered 2.38T in 307445734561816429h29m with 0 errors on 
> Sat Aug 26 09:26:53 2017
> config:
>
>     NAME                  STATE     READ WRITE CKSUM
>     sorbs                 ONLINE       0     0     0
>       raidz2-0            ONLINE       0     0     0
>         mfid0             ONLINE       0     0     0
>         mfid1             ONLINE       0     0     0
>         mfid7             ONLINE       0     0     0
>         mfid8             ONLINE       0     0     0
>         mfid12            ONLINE       0     0     0
>         mfid10            ONLINE       0     0     0
>         mfid14            ONLINE       0     0     0
>         mfid11            ONLINE       0     0     0
>         mfid6             ONLINE       0     0     0
>         mfid15            ONLINE       0     0     0
>         mfid2             ONLINE       0     0     0
>         mfid3             ONLINE       0     0     0
>         spare-12          ONLINE       0     0     3
>           mfid13          ONLINE       0     0     0
>           mfid9           ONLINE       0     0     0
>         mfid4             ONLINE       0     0     0
>         mfid5             ONLINE       0     0     0
>     spares
>       185579620420611382  INUSE     was /dev/mfid9
>
> errors: No known data errors
>
>
> It would appear that the when I replaced the damaged drives it picked 
> one of them up as being rebuilt from back in August (before it was 
> packed up to go) and that was why it saw it as 'corrupted metadata' 
> and spent the last 3 weeks importing it, it rebuilt it as it was 
> importing it.. no dataloss that I can determine. (literally just 
> finished in the middle of the night here.)
>

And back to this little nutmeg...

We had a fire last night ...  and it (the same pool) was resilvering 
again...  Corrupted the metadata.. import -fFX worked and it started 
rebuilding, then during the early hours when the pool was at 50%(ish) 
rebuilt/resilvered (one vdev) there was at last one more issue on the 
powerline... UPSs went out after multiple hits and now can't get it 
imported - the server was in single user mode - on a FBSD-12 USB stick 
...  so it was only resilvering...  "zdb -AAA -L -uhdi -FX -e storage" 
returns sanely...

anyone any thoughts how I might get the data back/pool to import? (zpool 
import -fFX storage spends a long time working and eventually comes back 
with unable to import as one or more of the vdevs are unavailable - 
however they are all there as far as I can tell)

THanks,

-- 
Michelle Sullivan
http://www.mhix.org/




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3be04f0b-bded-9b77-896b-631824a14c4a>