From owner-freebsd-fs@freebsd.org  Fri Feb  2 20:49:02 2018
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5DD3CEC9246
 for <freebsd-fs@mailman.ysv.freebsd.org>; Fri,  2 Feb 2018 20:49:02 +0000 (UTC)
 (envelope-from michelle@sorbs.net)
Received: from hades.sorbs.net (hades.sorbs.net [72.12.213.40])
 by mx1.freebsd.org (Postfix) with ESMTP id 06A457FA29
 for <freebsd-fs@freebsd.org>; Fri,  2 Feb 2018 20:49:01 +0000 (UTC)
 (envelope-from michelle@sorbs.net)
MIME-version: 1.0
Content-transfer-encoding: 7BIT
Content-type: text/plain; CHARSET=US-ASCII; format=flowed
Received: from isux.com (203-206-128-220.perm.iinet.net.au [203.206.128.220])
 by hades.sorbs.net
 (Oracle Communications Messaging Server 7.0.5.29.0 64bit (built Jul 9 2013))
 with ESMTPSA id <0P3J003GGJKZK500@hades.sorbs.net> for freebsd-fs@freebsd.org; 
 Fri, 02 Feb 2018 12:58:13 -0800 (PST)
Subject: Re: ZFS pool faulted (corrupt metadata) but the disk data appears
 ok...
To: Ben RUBSON <ben.rubson@gmail.com>,
 "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
References: <54D3E9F6.20702@sorbs.net> <54D41608.50306@delphij.net>
 <54D41AAA.6070303@sorbs.net> <54D41C52.1020003@delphij.net>
 <54D424F0.9080301@sorbs.net> <54D47F94.9020404@freebsd.org>
 <54D4A552.7050502@sorbs.net> <54D4BB5A.30409@freebsd.org>
 <54D8B3D8.6000804@sorbs.net> <54D8CECE.60909@freebsd.org>
 <54D8D4A1.9090106@sorbs.net> <54D8D5DE.4040906@sentex.net>
 <54D8D92C.6030705@sorbs.net> <54D8E189.40201@sorbs.net>
 <54D924DD.4000205@sorbs.net> <54DCAC29.8000301@sorbs.net>
 <9c995251-45f1-cf27-c4c8-30a4bd0f163c@sorbs.net>
 <8282375D-5DDC-4294-A69C-03E9450D9575@gmail.com>
 <73dd7026-534e-7212-a037-0cbf62a61acd@sorbs.net>
 <FAB7C3BA-057F-4AB4-96E1-5C3208BABBA7@gmail.com>
From: Michelle Sullivan <michelle@sorbs.net>
Message-id: <027070fb-f7b5-3862-3a52-c0f280ab46d1@sorbs.net>
Date: Sat, 03 Feb 2018 07:48:58 +1100
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:51.0)
 Gecko/20100101 Firefox/51.0 SeaMonkey/2.48
In-reply-to: <FAB7C3BA-057F-4AB4-96E1-5C3208BABBA7@gmail.com>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 02 Feb 2018 20:49:02 -0000

Ben RUBSON wrote:
> On 02 Feb 2018 11:51, Michelle Sullivan wrote:
>
>> Ben RUBSON wrote:
>>> On 02 Feb 2018 11:26, Michelle Sullivan wrote:
>>>
>>> Hi Michelle,
>>>
>>>> Michelle Sullivan wrote:
>>>>> Michelle Sullivan wrote:
>>>>>> So far (few hours in) zfs import -fFX has not faulted with this 
>>>>>> image...
>>>>>> it's running out of memory currently about 16G of 32G- however 
>>>>>> 9.2-P15
>>>>>> kernel died within minutes... out of memory (all 32G and swap) so am
>>>>>> more optimistic at the moment...  Fingers Crossed.
>>>>> And the answer:
>>>>>
>>>>> 11-STABLE on a USB stick.
>>>>>
>>>>> Remove the drive that was replacing the hotspare (ie the replacement
>>>>> drive for the one that initially died)
>>>>> zpool import -fFX storage
>>>>> zpool export storage
>>>>>
>>>>> reboot back to 9.x
>>>>> zpool import storage
>>>>> re-insert drive replacement drive.
>>>>> reboot
>>>> Gotta thank people for this again, saved me again this time on a 
>>>> non-FreeBSD system this time (with a lot of using a modified 
>>>> recoverdisk for OSX - thanks PSK@)... Lost 3 disks out of a raidz2 
>>>> and 2 more had read errors on some sectors.. don't know how much 
>>>> (if any) data I've lost but at least it's not a rebuild from back 
>>>> up of all 48TB..
>>>
>>> What about the root-cause ?
>>
>> 3 disks died whilst the server was in transit from Malta to Australia 
>> (and I'm surprised that was all considering the state of some of the 
>> stuff that came out of the container - have a 3kva UPS that is 
>> completely destroyed despite good packing.)
>>> Sounds like you had 5 disks dying at the same time ?
>>
>> Turns out that one of the 3 that had 'red lights on' had bad sectors, 
>> the other 2 were just excluded by the BIOS...  I did a byte copy onto 
>> new drives found no read errors so put them back in and forced them 
>> online.  The other 1 had 78k of bytes unreadable so new disk went in 
>> and an convinced the controller that it was the same disk as the one 
>> it replaced, the export/import produced 2 more disks unrecoverable 
>> read errors that nothing had flagged previously, so byte copied them 
>> onto new drives and the import -fFX is currently working (5 hours so 
>> far)...
>>
>>> Do you periodically run long smart tests ?
>>
>> Yup (fully automated.)
>>
>>> Zpool scrubs ?
>>
>> Both servers took a zpool scrub before they were packed into the 
>> containers... the second one came out unscathed... but then most 
>> stuff in the second container came out unscathed unlike the first....
>
> What a story ! Thanks for the details.
>
> So disks died because of the carrier, as I assume the second unscathed 
> server was OK...

Pretty much.

> Heads must have scratched the platters, but they should have been 
> parked, so... Really strange.
>
You'd have thought... though 2 of the drives look like it was wear and 
wear issues (the 2 not showing red lights) just not picked up on the 
periodic scrub....  Could be that the recovery showed that one up... you 
know - how you can have an array working fine, but one disk dies then 
others fail during the rebuild because of the extra workload.

> Hope you'll recover your whole pool.

So do I, it was my build server, everything important backed up with 
multiple redundancies except for the build VMs.. it'd take me about 4 
weeks to rebuild it if I have to put it all back from backups and 
rebuild the build VMs.. but hey at least I can rebuild it unlike many 
with big servers. :P

That said, the import -fFX is still running (and it is actually running) 
so it's still scanning/rebuilding the meta data.

Michelle

-- 
Michelle Sullivan
http://www.mhix.org/