Date: Thu, 02 May 2019 09:46:22 +1000 From: Michelle Sullivan <michelle@sorbs.net> To: Steven Hartland <killing@multiplay.co.uk> Cc: Paul Mather <paul@gromit.dlib.vt.edu>, freebsd-stable <freebsd-stable@freebsd.org> Subject: Re: ZFS... Message-ID: <289FE04E-1692-4763-96B3-91E8C1BBBBD6@sorbs.net> In-Reply-To: <47137ea9-1ab2-1271-c15f-c0c05a17b92f@multiplay.co.uk> References: <30506b3d-64fb-b327-94ae-d9da522f3a48@sorbs.net> <17B373DA-4AFC-4D25-B776-0D0DED98B320@sorbs.net> <70fac2fe3f23f85dd442d93ffea368e1@ultra-secure.de> <70C87D93-D1F9-458E-9723-19F9777E6F12@sorbs.net> <CAGMYy3tYqvrKgk2c==WTwrH03uTN1xQifPRNxXccMsRE1spaRA@mail.gmail.com> <5ED8BADE-7B2C-4B73-93BC-70739911C5E3@sorbs.net> <d0118f7e-7cfc-8bf1-308c-823bce088039@denninger.net> <2e4941bf-999a-7f16-f4fe-1a520f2187c0@sorbs.net> <CAOtMX2gOwwZuGft2vPpR-LmTpMVRy6hM_dYy9cNiw%2Bg1kDYpXg@mail.gmail.com> <34539589-162B-4891-A68F-88F879B59650@sorbs.net> <CAOtMX2iB7xJszO8nT_KU%2BrFuSkTyiraMHddz1fVooe23bEZguA@mail.gmail.com> <576857a5-a5ab-eeb8-2391-992159d9c4f2@denninger.net> <A7928311-8F51-4C72-839C-C9C2BA62C66E@sorbs.net> <b0fa0f8e-dc45-9d66-cc48-c733cbb9645b@denninger.net> <FD9802E0-E2E4-464A-8ABD-83B0A21C08F2@sorbs.net> <bf63007@sorbs.net> <CB86C16D-87D9-4D3F-9291-1E2586246E04@sorbs.net> <7DBA7907-BE8F-4944-9A71-86E5AC1B85CA@gromit.dlib.vt.edu> <5c458075-351f-6eb6-44aa-1bd268398343@sorbs.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Michelle Sullivan http://www.mhix.org/ Sent from my iPad > On 02 May 2019, at 03:39, Steven Hartland <killing@multiplay.co.uk> wrote:= >=20 >=20 >=20 >> On 01/05/2019 15:53, Michelle Sullivan wrote: >> Paul Mather wrote: >>>> On Apr 30, 2019, at 11:17 PM, Michelle Sullivan <michelle@sorbs.net> wr= ote: >>>>=20 >>>> Been there done that though with ext2 rather than UFS.. still got all m= y data back... even though it was a nightmare.. >>>=20 >>>=20 >>> Is that an implication that had all your data been on UFS (or ext2:) thi= s time around you would have got it all back? (I've got that impression thr= ough this thread from things you've written.) That sort of makes it sound li= ke UFS is bulletproof to me. >>=20 >> Its definitely not (and far from it) bullet proof - however when the data= on disk is not corrupt I have managed to recover it - even if it has been a= nightmare - no structure - all files in lost+found etc... or even resorting= to r-studio in the even of lost raid information etc.. > Yes but you seem to have done this with ZFS too, just not in this particul= arly bad case. >=20 There is no r-studio for zfs or I would have turned to it as soon as this is= sue hit. > If you imagine that the in memory update for the metadata was corrupted an= d then written out to disk, which is what you seem to have experienced with y= our ZFS pool, then you'd be in much the same position. >>=20 >> This case - from what my limited knowledge has managed to fathom is a spa= cemap has become corrupt due to partial write during the hard power failure.= This was the second hard outage during the resilver process following a dri= ve platter failure (on a ZRAID2 - so single platter failure should be comple= tely recoverable all cases - except hba failure or other corruption which do= es not appear to be the case).. the spacemap fails checksum (no surprises th= ere being that it was part written) however it cannot be repaired (for what e= ver reason))... how I get that this is an interesting case... one cannot jus= t assume anything about the corrupt spacemap... it could be complete and jus= t the checksum is wrong, it could be completely corrupt and ignorable.. but w= hat I understand of ZFS (and please watchers chime in if I'm wrong) the spac= emap is just the freespace map.. if corrupt or missing one cannot just 'fix i= t' because there is a very good chance that the fix would corrupt something t= hat is actually allocated and therefore the best solution would be (to "fix i= t") would be consider it 100% full and therefore 'dead space' .. but zfs doe= sn't do that - probably a good thing - the result being that a drive that is= supposed to be good (and zdb reports some +36m objects there) becomes compl= etely unreadable ... my thought (desire/want) on a 'walk' tool would be a l= ast resort tool that could walk the datasets and send them elsewhere (like z= fs send) so that I could create a new pool elsewhere and send the data it kn= ows about to another pool and then blow away the original - if there are cor= ruptions or data missing, thats my problem it's a last resort.. but in the c= ase the critical structures become corrupt it means a local recovery option i= s enabled.. it means that if the data is all there and the corruption is jus= t a spacemap one can transfer the entire drive/data to a new pool whilst the= original host is rebuilt... this would *significantly* help most people wit= h large pools that have to blow them away and re-create the pools because of= errors/corruptions etc... and with the addition of 'rsync' (the checksummin= g of files) it would be trivial to just 'fix' the data corrupted or missing f= rom a mirror host rather than transferring the entire pool from (possibly) o= ffsite.... >=20 > =46rom what I've read that's not a partial write issue, as in that case th= e pool would have just rolled back. It sounds more like the write was succes= sful but the data in that write was trashed due to your power incident and t= hat was replicated across ALL drives. >=20 I think this might be where the problem started.. it was already rolling bac= k from the first power issue (it did exactly what was expected and programme= d, it rolled back 5 seconds.. which as no-one had write access to it from th= e start of the resilver I really didn=E2=80=99t care as the only changes wer= e the resilver itself.). Now you assertion/musing maybe correct... all driv= es got trashed data.. I think not but unless we get into it and examine it I= think we won=E2=80=99t know. What I do know is in the second round -FfX wo= uldn=E2=80=99t work, I used zdb to locate a =E2=80=9CLOADED=E2=80=9D MOS and= used -t <txg> to import.. the txg number was 7 or 8 from current so just ou= tside of the -X limit (going off memory here, so could have been more, but I= remember it was just past the switch limit.) > To be clear this may or may not be what your seeing as you don't see to ha= ve covered any of the details of the issues your seeing and what in detail s= teps you have tried to recover with? There have been many steps over the last month.. and some of which I may hav= e made it from very difficult to recover to non recoverable now... though th= e only writes is what the kernel does as have not got it (the dataset) mount= ed at any time, even though it has imported. >=20 > I'm not saying this is the case but all may not be lost depending on the e= xact nature of the corruption. >=20 > For more information on space maps see: > https://www.delphix.com/blog/delphix-engineering/openzfs-code-walk-metasla= bs-and-space-maps This is something I read a month ago, along with multiple other articles on t= he same blog, including https://www.delphix.com/blog/openzfs-pool-import-rec= overy Which I might add got me from non importable to importable but not mountable= . I have *not* attempted to bypass the checksum line for spacemap load to date= as I see that as a possible way to make the problem worse. > https://sdimitro.github.io/post/zfs-lsm-flushing/ Not read this. >=20 > A similar behavior resulted in being a bug: > https://www.reddit.com/r/zfs/comments/97czae/zfs_zdb_space_map_errors_on_u= nmountable_zpool/ >=20 Or this.. will go there following =E2=80=9Cpressing send=E2=80=9D.. :) > Regards > Steve > _______________________________________________ > freebsd-stable@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?289FE04E-1692-4763-96B3-91E8C1BBBBD6>