FreeBSD Mail Archives

Date:      Thu, 02 May 2019 09:46:22 +1000
From:      Michelle Sullivan <michelle@sorbs.net>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        Paul Mather <paul@gromit.dlib.vt.edu>, freebsd-stable <freebsd-stable@freebsd.org>
Subject:   Re: ZFS...
Message-ID:  <289FE04E-1692-4763-96B3-91E8C1BBBBD6@sorbs.net>
In-Reply-To: <47137ea9-1ab2-1271-c15f-c0c05a17b92f@multiplay.co.uk>
References:  <30506b3d-64fb-b327-94ae-d9da522f3a48@sorbs.net> <17B373DA-4AFC-4D25-B776-0D0DED98B320@sorbs.net> <70fac2fe3f23f85dd442d93ffea368e1@ultra-secure.de> <70C87D93-D1F9-458E-9723-19F9777E6F12@sorbs.net> <CAGMYy3tYqvrKgk2c==WTwrH03uTN1xQifPRNxXccMsRE1spaRA@mail.gmail.com> <5ED8BADE-7B2C-4B73-93BC-70739911C5E3@sorbs.net> <d0118f7e-7cfc-8bf1-308c-823bce088039@denninger.net> <2e4941bf-999a-7f16-f4fe-1a520f2187c0@sorbs.net> <CAOtMX2gOwwZuGft2vPpR-LmTpMVRy6hM_dYy9cNiw%2Bg1kDYpXg@mail.gmail.com> <34539589-162B-4891-A68F-88F879B59650@sorbs.net> <CAOtMX2iB7xJszO8nT_KU%2BrFuSkTyiraMHddz1fVooe23bEZguA@mail.gmail.com> <576857a5-a5ab-eeb8-2391-992159d9c4f2@denninger.net> <A7928311-8F51-4C72-839C-C9C2BA62C66E@sorbs.net> <b0fa0f8e-dc45-9d66-cc48-c733cbb9645b@denninger.net> <FD9802E0-E2E4-464A-8ABD-83B0A21C08F2@sorbs.net> <bf63007@sorbs.net> <CB86C16D-87D9-4D3F-9291-1E2586246E04@sorbs.net> <7DBA7907-BE8F-4944-9A71-86E5AC1B85CA@gromit.dlib.vt.edu> <5c458075-351f-6eb6-44aa-1bd268398343@sorbs.net>



Michelle Sullivan
http://www.mhix.org/
Sent from my iPad

> On 02 May 2019, at 03:39, Steven Hartland <killing@multiplay.co.uk> wrote:=

>=20
>=20
>=20
>> On 01/05/2019 15:53, Michelle Sullivan wrote:
>> Paul Mather wrote:
>>>> On Apr 30, 2019, at 11:17 PM, Michelle Sullivan <michelle@sorbs.net> wr=
ote:
>>>>=20
>>>> Been there done that though with ext2 rather than UFS..  still got all m=
y data back... even though it was a nightmare..
>>>=20
>>>=20
>>> Is that an implication that had all your data been on UFS (or ext2:) thi=
s time around you would have got it all back?  (I've got that impression thr=
ough this thread from things you've written.) That sort of makes it sound li=
ke UFS is bulletproof to me.
>>=20
>> Its definitely not (and far from it) bullet proof - however when the data=
 on disk is not corrupt I have managed to recover it - even if it has been a=
 nightmare - no structure - all files in lost+found etc... or even resorting=
 to r-studio in the even of lost raid information etc..
> Yes but you seem to have done this with ZFS too, just not in this particul=
arly bad case.
>=20

There is no r-studio for zfs or I would have turned to it as soon as this is=
sue hit.


> If you imagine that the in memory update for the metadata was corrupted an=
d then written out to disk, which is what you seem to have experienced with y=
our ZFS pool, then you'd be in much the same position.
>>=20
>> This case - from what my limited knowledge has managed to fathom is a spa=
cemap has become corrupt due to partial write during the hard power failure.=
 This was the second hard outage during the resilver process following a dri=
ve platter failure (on a ZRAID2 - so single platter failure should be comple=
tely recoverable all cases - except hba failure or other corruption which do=
es not appear to be the case).. the spacemap fails checksum (no surprises th=
ere being that it was part written) however it cannot be repaired (for what e=
ver reason))... how I get that this is an interesting case... one cannot jus=
t assume anything about the corrupt spacemap... it could be complete and jus=
t the checksum is wrong, it could be completely corrupt and ignorable.. but w=
hat I understand of ZFS (and please watchers chime in if I'm wrong) the spac=
emap is just the freespace map.. if corrupt or missing one cannot just 'fix i=
t' because there is a very good chance that the fix would corrupt something t=
hat is actually allocated and therefore the best solution would be (to "fix i=
t") would be consider it 100% full and therefore 'dead space' .. but zfs doe=
sn't do that - probably a good thing - the result being that a drive that is=
 supposed to be good (and zdb reports some +36m objects there) becomes compl=
etely unreadable ...  my thought (desire/want) on a 'walk' tool would be a l=
ast resort tool that could walk the datasets and send them elsewhere (like z=
fs send) so that I could create a new pool elsewhere and send the data it kn=
ows about to another pool and then blow away the original - if there are cor=
ruptions or data missing, thats my problem it's a last resort.. but in the c=
ase the critical structures become corrupt it means a local recovery option i=
s enabled.. it means that if the data is all there and the corruption is jus=
t a spacemap one can transfer the entire drive/data to a new pool whilst the=
 original host is rebuilt... this would *significantly* help most people wit=
h large pools that have to blow them away and re-create the pools because of=
 errors/corruptions etc... and with the addition of 'rsync' (the checksummin=
g of files) it would be trivial to just 'fix' the data corrupted or missing f=
rom a mirror host rather than transferring the entire pool from (possibly) o=
ffsite....
>=20
> =46rom what I've read that's not a partial write issue, as in that case th=
e pool would have just rolled back. It sounds more like the write was succes=
sful but the data in that write was trashed due to your power incident and t=
hat was replicated across ALL drives.
>=20

I think this might be where the problem started.. it was already rolling bac=
k from the first power issue (it did exactly what was expected and programme=
d, it rolled back 5 seconds.. which as no-one had write access to it from th=
e start of the resilver I really didn=E2=80=99t care as the only changes wer=
e the resilver itself.). Now you assertion/musing maybe correct...  all driv=
es got trashed data.. I think not but unless we get into it and examine it I=
 think we won=E2=80=99t know.  What I do know is in the second round -FfX wo=
uldn=E2=80=99t work, I used zdb to locate a =E2=80=9CLOADED=E2=80=9D MOS and=
 used -t <txg> to import.. the txg number was 7 or 8 from current so just ou=
tside of the -X limit (going off memory here, so could have been more, but I=
 remember it was just past the switch limit.)

> To be clear this may or may not be what your seeing as you don't see to ha=
ve covered any of the details of the issues your seeing and what in detail s=
teps you have tried to recover with?

There have been many steps over the last month.. and some of which I may hav=
e made it from very difficult to recover to non recoverable now... though th=
e only writes is what the kernel does as have not got it (the dataset) mount=
ed at any time, even though it has imported.

>=20
> I'm not saying this is the case but all may not be lost depending on the e=
xact nature of the corruption.
>=20
> For more information on space maps see:
> https://www.delphix.com/blog/delphix-engineering/openzfs-code-walk-metasla=
bs-and-space-maps

This is something I read a month ago, along with multiple other articles on t=
he same blog, including https://www.delphix.com/blog/openzfs-pool-import-rec=
overy

Which I might add got me from non importable to importable but not mountable=
.

I have *not* attempted to bypass the checksum line for spacemap load to date=
 as I see that as a possible way to make the problem worse.

> https://sdimitro.github.io/post/zfs-lsm-flushing/

Not read this.

>=20
> A similar behavior resulted in being a bug:
> https://www.reddit.com/r/zfs/comments/97czae/zfs_zdb_space_map_errors_on_u=
nmountable_zpool/
>=20

Or this.. will go there following =E2=80=9Cpressing send=E2=80=9D.. :)

>     Regards
>     Steve
> _______________________________________________
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org"

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?289FE04E-1692-4763-96B3-91E8C1BBBBD6>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation