Date: Sat, 03 Feb 2018 07:48:58 +1100 From: Michelle Sullivan <michelle@sorbs.net> To: Ben RUBSON <ben.rubson@gmail.com>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: ZFS pool faulted (corrupt metadata) but the disk data appears ok... Message-ID: <027070fb-f7b5-3862-3a52-c0f280ab46d1@sorbs.net> In-Reply-To: <FAB7C3BA-057F-4AB4-96E1-5C3208BABBA7@gmail.com> References: <54D3E9F6.20702@sorbs.net> <54D41608.50306@delphij.net> <54D41AAA.6070303@sorbs.net> <54D41C52.1020003@delphij.net> <54D424F0.9080301@sorbs.net> <54D47F94.9020404@freebsd.org> <54D4A552.7050502@sorbs.net> <54D4BB5A.30409@freebsd.org> <54D8B3D8.6000804@sorbs.net> <54D8CECE.60909@freebsd.org> <54D8D4A1.9090106@sorbs.net> <54D8D5DE.4040906@sentex.net> <54D8D92C.6030705@sorbs.net> <54D8E189.40201@sorbs.net> <54D924DD.4000205@sorbs.net> <54DCAC29.8000301@sorbs.net> <9c995251-45f1-cf27-c4c8-30a4bd0f163c@sorbs.net> <8282375D-5DDC-4294-A69C-03E9450D9575@gmail.com> <73dd7026-534e-7212-a037-0cbf62a61acd@sorbs.net> <FAB7C3BA-057F-4AB4-96E1-5C3208BABBA7@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Ben RUBSON wrote: > On 02 Feb 2018 11:51, Michelle Sullivan wrote: > >> Ben RUBSON wrote: >>> On 02 Feb 2018 11:26, Michelle Sullivan wrote: >>> >>> Hi Michelle, >>> >>>> Michelle Sullivan wrote: >>>>> Michelle Sullivan wrote: >>>>>> So far (few hours in) zfs import -fFX has not faulted with this >>>>>> image... >>>>>> it's running out of memory currently about 16G of 32G- however >>>>>> 9.2-P15 >>>>>> kernel died within minutes... out of memory (all 32G and swap) so am >>>>>> more optimistic at the moment... Fingers Crossed. >>>>> And the answer: >>>>> >>>>> 11-STABLE on a USB stick. >>>>> >>>>> Remove the drive that was replacing the hotspare (ie the replacement >>>>> drive for the one that initially died) >>>>> zpool import -fFX storage >>>>> zpool export storage >>>>> >>>>> reboot back to 9.x >>>>> zpool import storage >>>>> re-insert drive replacement drive. >>>>> reboot >>>> Gotta thank people for this again, saved me again this time on a >>>> non-FreeBSD system this time (with a lot of using a modified >>>> recoverdisk for OSX - thanks PSK@)... Lost 3 disks out of a raidz2 >>>> and 2 more had read errors on some sectors.. don't know how much >>>> (if any) data I've lost but at least it's not a rebuild from back >>>> up of all 48TB.. >>> >>> What about the root-cause ? >> >> 3 disks died whilst the server was in transit from Malta to Australia >> (and I'm surprised that was all considering the state of some of the >> stuff that came out of the container - have a 3kva UPS that is >> completely destroyed despite good packing.) >>> Sounds like you had 5 disks dying at the same time ? >> >> Turns out that one of the 3 that had 'red lights on' had bad sectors, >> the other 2 were just excluded by the BIOS... I did a byte copy onto >> new drives found no read errors so put them back in and forced them >> online. The other 1 had 78k of bytes unreadable so new disk went in >> and an convinced the controller that it was the same disk as the one >> it replaced, the export/import produced 2 more disks unrecoverable >> read errors that nothing had flagged previously, so byte copied them >> onto new drives and the import -fFX is currently working (5 hours so >> far)... >> >>> Do you periodically run long smart tests ? >> >> Yup (fully automated.) >> >>> Zpool scrubs ? >> >> Both servers took a zpool scrub before they were packed into the >> containers... the second one came out unscathed... but then most >> stuff in the second container came out unscathed unlike the first.... > > What a story ! Thanks for the details. > > So disks died because of the carrier, as I assume the second unscathed > server was OK... Pretty much. > Heads must have scratched the platters, but they should have been > parked, so... Really strange. > You'd have thought... though 2 of the drives look like it was wear and wear issues (the 2 not showing red lights) just not picked up on the periodic scrub.... Could be that the recovery showed that one up... you know - how you can have an array working fine, but one disk dies then others fail during the rebuild because of the extra workload. > Hope you'll recover your whole pool. So do I, it was my build server, everything important backed up with multiple redundancies except for the build VMs.. it'd take me about 4 weeks to rebuild it if I have to put it all back from backups and rebuild the build VMs.. but hey at least I can rebuild it unlike many with big servers. :P That said, the import -fFX is still running (and it is actually running) so it's still scanning/rebuilding the meta data. Michelle -- Michelle Sullivan http://www.mhix.org/
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?027070fb-f7b5-3862-3a52-c0f280ab46d1>