Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Apr 2023 12:37:30 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        =?utf-8?B?Sm9zw6kgUMOpcmV6?= <fbl@aoek.com>, Current FreeBSD <freebsd-current@freebsd.org>
Subject:   Re: git: 2a58b312b62f - main - zfs: merge openzfs/zfs@431083f75
Message-ID:  <963B0620-5342-4EC3-AA54-52DDD70D9E3D@yahoo.com>
References:  <963B0620-5342-4EC3-AA54-52DDD70D9E3D.ref@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Jos=C3=A9_P=C3=A9rez <fbl_at_aoek.com> wrote on
Date: Tue, 18 Apr 2023 16:59:03 UTC :

> El 2023-04-17 21:59, Pawel Jakub Dawidek escribi=C3=B3:
> > Jos=C3=A9,
> >=20
> > I can only speak of block cloning in details, but I'll try to =
address
> > everything.
> >=20
> > The easiest way to avoid block_cloning-related corruption on the
> > kernel after the last OpenZFS merge, but before e0bb199925 is to set
> > the compress property to 'off' and the sync property to something
> > other than 'disabled'. This will avoid the block_cloning-related
> > corruption and zil_replaying() panic.
> >=20
> > As for the other corruption, unfortunately I don't know the details,
> > but my understanding is that it is happening under higher load. Not
> > sure I'd trust a kernel built on a machine with this bug present. =
What
> > I would do is to compile the kernel as of 068913e4ba somewhere else,
> > boot the problematic machine in single-user mode and install the =
newly
> > built kernel.
> >=20
> > As far as I can tell, contrary to some initial reports, none of the
> > problems introduced by the recent OpenZFS merge corrupt the pool
> > metadata, only file's data. You can locate the files modified with =
the
> > bogus kernel using find(1) with a proper modification time, but you
> > have to decide what to do with them (either throw them away, restore
> > them from backup or inspect them).
>=20
> Sharing my experience on how to get out of the worst case scenario =
with=20
> a building machine that is affected by the bug.
>=20
> CAVEAT: this is my experience, take it at your own risk. It worked for=20=

> me, there is no guarantee that it will work for your. You may create=20=

> corrupted files and make your system harder to recover or definitely=20=

> brick it. Don't blame me, you have been warned. YMMV.
>=20
> Boot in single user mode and check if your pool has block cloning in=20=

> use:
> # zpool get feature@block_cloning zroot
> NAME PROPERTY VALUE SOURCE
> zroot feature@block_cloning active local
>=20
> In this case it does because the value is "active". If it's "enabled"=20=

> you do not need to do anything.

Well, if block_cloning is disabled it would not become active.

But, if it is enabled, it can automatically become active by
creating a first entry in the involved Block Reference Table
during any activity meets the criteria for such. If the FreeBSD
vintage in place is one that corrupts zfs data for any reason,
one would still want to progress to a vintage that does not
corrupt zfs data, even if block_cloning is enabled but not
active just before starting such an update sequence.

So, in progressing past the vintage that corrupt zfs data,
one could end up with block_cloning enabled in the process.
At least, that is my understanding of the issue.

May be only a subset of the "causes data corruption" range of
vintages would have to worry about block_cloning becoming active
during the effort to get past all the sources of corruptions.
(If so, I've no clue what range that would be.)

I expect that the "you do not need to do anything" for
block_cloning being "enabled" instead of "active" may be too
strong of a claim, depending on the specific starting-vintage
inside the range with zfs data corruption problems.

(=46rom what I've read, when the last Block Reference Table
entry is removed for any reason, the matching block_cloning
changes back from being indicated as active to being indicated
as enabled.)

> 1) When in single user mode set compression property to "off" on any =
zfs=20
> active dataset that has compression other than "off" and the sync=20
> property to something other than "disabled".
> 2) Boot multiuser and update your current sources, e.g.
> git update --rebase
> 3) Build and install a new kernel without too much pressure (e.g. with=20=

> -j 1):
> make -j 1 kernel
> 4) Reboot with the new kernel
> 5) Now you have to reinstall the kernel with
> make installkernel
> This is because the new kernel files were written by the old kernel=20
> and need to be removed.
> 6) Find out when the pool was upgraded (I used command history) and=20
> create a file with that date, in my case:
> touch -t 2304161957 /tmp/from
> 7) Find out when you booted the new kernel (I used fgrep Copyright=20
> /var/log/messages | tail -n 1) and create a file with that date, in my=20=

> case:
> touch -t 2304172142 /tmp/to
> 8) Find the files/firs created between the two dates:
> find / -newerBm /tmp/from -and -not -newerBm /tmp/to >=20
> /tmp/filelist.txt
> 9) Inspect /tmp/filelist.txt and save any important items. If the=20
> important files are not corrupted you can do:
> cp important_file new; mv new important_file
> NOTA BENE: "touch important_file" would not work, you do need to=20
> re-create the file.
> 10) Delete the remaining files/dirs in /tmp/filelist.txt. If you did =
5)=20
> you will remove /boot/kernel.old files, but not /boot/kernel files.
> 11) Restore your compression and sync properties where appropiate.
>=20

=3D=3D=3D
Mark Millard
marklmi at yahoo.com




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?963B0620-5342-4EC3-AA54-52DDD70D9E3D>