From owner-freebsd-fs@freebsd.org Fri Feb 2 20:49:02 2018 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5DD3CEC9246 for ; Fri, 2 Feb 2018 20:49:02 +0000 (UTC) (envelope-from michelle@sorbs.net) Received: from hades.sorbs.net (hades.sorbs.net [72.12.213.40]) by mx1.freebsd.org (Postfix) with ESMTP id 06A457FA29 for ; Fri, 2 Feb 2018 20:49:01 +0000 (UTC) (envelope-from michelle@sorbs.net) MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII; format=flowed Received: from isux.com (203-206-128-220.perm.iinet.net.au [203.206.128.220]) by hades.sorbs.net (Oracle Communications Messaging Server 7.0.5.29.0 64bit (built Jul 9 2013)) with ESMTPSA id <0P3J003GGJKZK500@hades.sorbs.net> for freebsd-fs@freebsd.org; Fri, 02 Feb 2018 12:58:13 -0800 (PST) Subject: Re: ZFS pool faulted (corrupt metadata) but the disk data appears ok... To: Ben RUBSON , "freebsd-fs@freebsd.org" References: <54D3E9F6.20702@sorbs.net> <54D41608.50306@delphij.net> <54D41AAA.6070303@sorbs.net> <54D41C52.1020003@delphij.net> <54D424F0.9080301@sorbs.net> <54D47F94.9020404@freebsd.org> <54D4A552.7050502@sorbs.net> <54D4BB5A.30409@freebsd.org> <54D8B3D8.6000804@sorbs.net> <54D8CECE.60909@freebsd.org> <54D8D4A1.9090106@sorbs.net> <54D8D5DE.4040906@sentex.net> <54D8D92C.6030705@sorbs.net> <54D8E189.40201@sorbs.net> <54D924DD.4000205@sorbs.net> <54DCAC29.8000301@sorbs.net> <9c995251-45f1-cf27-c4c8-30a4bd0f163c@sorbs.net> <8282375D-5DDC-4294-A69C-03E9450D9575@gmail.com> <73dd7026-534e-7212-a037-0cbf62a61acd@sorbs.net> From: Michelle Sullivan Message-id: <027070fb-f7b5-3862-3a52-c0f280ab46d1@sorbs.net> Date: Sat, 03 Feb 2018 07:48:58 +1100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:51.0) Gecko/20100101 Firefox/51.0 SeaMonkey/2.48 In-reply-to: X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Feb 2018 20:49:02 -0000 Ben RUBSON wrote: > On 02 Feb 2018 11:51, Michelle Sullivan wrote: > >> Ben RUBSON wrote: >>> On 02 Feb 2018 11:26, Michelle Sullivan wrote: >>> >>> Hi Michelle, >>> >>>> Michelle Sullivan wrote: >>>>> Michelle Sullivan wrote: >>>>>> So far (few hours in) zfs import -fFX has not faulted with this >>>>>> image... >>>>>> it's running out of memory currently about 16G of 32G- however >>>>>> 9.2-P15 >>>>>> kernel died within minutes... out of memory (all 32G and swap) so am >>>>>> more optimistic at the moment... Fingers Crossed. >>>>> And the answer: >>>>> >>>>> 11-STABLE on a USB stick. >>>>> >>>>> Remove the drive that was replacing the hotspare (ie the replacement >>>>> drive for the one that initially died) >>>>> zpool import -fFX storage >>>>> zpool export storage >>>>> >>>>> reboot back to 9.x >>>>> zpool import storage >>>>> re-insert drive replacement drive. >>>>> reboot >>>> Gotta thank people for this again, saved me again this time on a >>>> non-FreeBSD system this time (with a lot of using a modified >>>> recoverdisk for OSX - thanks PSK@)... Lost 3 disks out of a raidz2 >>>> and 2 more had read errors on some sectors.. don't know how much >>>> (if any) data I've lost but at least it's not a rebuild from back >>>> up of all 48TB.. >>> >>> What about the root-cause ? >> >> 3 disks died whilst the server was in transit from Malta to Australia >> (and I'm surprised that was all considering the state of some of the >> stuff that came out of the container - have a 3kva UPS that is >> completely destroyed despite good packing.) >>> Sounds like you had 5 disks dying at the same time ? >> >> Turns out that one of the 3 that had 'red lights on' had bad sectors, >> the other 2 were just excluded by the BIOS... I did a byte copy onto >> new drives found no read errors so put them back in and forced them >> online. The other 1 had 78k of bytes unreadable so new disk went in >> and an convinced the controller that it was the same disk as the one >> it replaced, the export/import produced 2 more disks unrecoverable >> read errors that nothing had flagged previously, so byte copied them >> onto new drives and the import -fFX is currently working (5 hours so >> far)... >> >>> Do you periodically run long smart tests ? >> >> Yup (fully automated.) >> >>> Zpool scrubs ? >> >> Both servers took a zpool scrub before they were packed into the >> containers... the second one came out unscathed... but then most >> stuff in the second container came out unscathed unlike the first.... > > What a story ! Thanks for the details. > > So disks died because of the carrier, as I assume the second unscathed > server was OK... Pretty much. > Heads must have scratched the platters, but they should have been > parked, so... Really strange. > You'd have thought... though 2 of the drives look like it was wear and wear issues (the 2 not showing red lights) just not picked up on the periodic scrub.... Could be that the recovery showed that one up... you know - how you can have an array working fine, but one disk dies then others fail during the rebuild because of the extra workload. > Hope you'll recover your whole pool. So do I, it was my build server, everything important backed up with multiple redundancies except for the build VMs.. it'd take me about 4 weeks to rebuild it if I have to put it all back from backups and rebuild the build VMs.. but hey at least I can rebuild it unlike many with big servers. :P That said, the import -fFX is still running (and it is actually running) so it's still scanning/rebuilding the meta data. Michelle -- Michelle Sullivan http://www.mhix.org/