From nobody Mon Dec 20 14:15:00 2021
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id AE7171900B03
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Mon, 20 Dec 2021 14:16:32 +0000 (UTC)
	(envelope-from freebsd-listen@fabiankeil.de)
Received: from smtprelay01.ispgateway.de (smtprelay01.ispgateway.de [80.67.18.13])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(Client did not present a certificate)
	by mx1.freebsd.org (Postfix) with ESMTPS id 4JHhTr0VLCz3KNF
	for <freebsd-hackers@freebsd.org>; Mon, 20 Dec 2021 14:16:31 +0000 (UTC)
	(envelope-from freebsd-listen@fabiankeil.de)
Received: from [217.246.54.215] (helo=fabiankeil.de)
	by smtprelay01.ispgateway.de with esmtpsa  (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.94.2)
	(envelope-from <freebsd-listen@fabiankeil.de>)
	id 1mzJSu-0003t3-Gp
	for freebsd-hackers@freebsd.org; Mon, 20 Dec 2021 15:16:36 +0100
Date: Mon, 20 Dec 2021 15:15:00 +0100
From: Fabian Keil <freebsd-listen@fabiankeil.de>
To: freebsd-hackers@freebsd.org
Subject: Re: Patches for GPT and geli recovery
Message-ID: <20211220151500.5e57c1a6@fabiankeil.de>
In-Reply-To: <67419422-5633-4e4b-870d-aec8762ec6a1@gmail.com>
References: <20211219175011.3023a232@fabiankeil.de>
 <CAFPNf59bXZTEdYzSmM7qH5mwYSykRdXrpHUOqn-qiE9ND2d=xQ@mail.gmail.com>
 <67419422-5633-4e4b-870d-aec8762ec6a1@gmail.com>
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
Content-Type: multipart/signed; boundary="Sig_/XQGc+8skOnRuNnHdVPt8VzF";
 protocol="application/pgp-signature"; micalg=pgp-sha1
X-Df-Sender: Nzc1MDY3
X-Rspamd-Queue-Id: 4JHhTr0VLCz3KNF
X-Spamd-Bar: +
Authentication-Results: mx1.freebsd.org;
	dkim=none;
	dmarc=none;
	spf=none (mx1.freebsd.org: domain of freebsd-listen@fabiankeil.de has no SPF policy when checking 80.67.18.13) smtp.mailfrom=freebsd-listen@fabiankeil.de
X-Spamd-Result: default: False [1.49 / 15.00];
	 RCVD_VIA_SMTP_AUTH(0.00)[];
	 ARC_NA(0.00)[];
	 MID_RHS_MATCH_FROM(0.00)[];
	 FROM_HAS_DN(0.00)[];
	 RWL_MAILSPIKE_GOOD(0.00)[80.67.18.13:from];
	 TO_MATCH_ENVRCPT_ALL(0.00)[];
	 NEURAL_SPAM_SHORT(0.77)[0.770];
	 MIME_GOOD(-0.20)[multipart/signed,text/plain];
	 TO_DN_NONE(0.00)[];
	 DMARC_NA(0.00)[fabiankeil.de];
	 AUTH_NA(1.00)[];
	 RCPT_COUNT_ONE(0.00)[1];
	 NEURAL_SPAM_MEDIUM(0.97)[0.971];
	 NEURAL_SPAM_LONG(0.95)[0.946];
	 RCVD_IN_DNSWL_NONE(0.00)[80.67.18.13:from];
	 SIGNED_PGP(-2.00)[];
	 R_SPF_NA(0.00)[no SPF record];
	 FROM_EQ_ENVFROM(0.00)[];
	 R_DKIM_NA(0.00)[];
	 MIME_TRACE(0.00)[0:+,1:+,2:~];
	 ASN(0.00)[asn:8972, ipnet:80.67.16.0/20, country:DE];
	 RCVD_COUNT_TWO(0.00)[2];
	 RCVD_TLS_ALL(0.00)[];
	 RECEIVED_SPAMHAUS_PBL(0.00)[217.246.54.215:received]
X-ThisMailContainsUnwantedMimeParts: N

--Sig_/XQGc+8skOnRuNnHdVPt8VzF
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Jason Bacon <bacon4000@gmail.com> wrote on 2021-12-19 at 16:21:39:

> On 12/19/21 13:40, Lee Brown wrote:
> >=20
> >=20
> > On Sun, Dec 19, 2021 at 8:52 AM Fabian Keil=20
> > <freebsd-listen@fabiankeil.de <mailto:freebsd-listen@fabiankeil.de>> wr=
ote:
> >=20
> >     [cut]
> >     BTW, I would also be interested to know if others have
> >     experienced similar data corruption and could figure
> >     out how it happened.
> >=20
> > Sounds like bitrot.=C2=A0 Bits flip on disks all the time, it doesn't m=
atter=20
> > if they are spinning rust or SSD, it happens.=C2=A0 Sometimes they are=
=20
> > detected and corrected, in which case you won't know.=C2=A0 Sometimes t=
hey=20
> > are detected and uncorrectable, you'll see that error propagated into=20
> > the driver.=C2=A0 And sometimes they are not detected at all and cause =
no=20
> > errors that the OS can surmise.=C2=A0 The higher the density of bits, t=
he=20
> > higher the probability of corruption.=C2=A0 SMART is not reliably=20
> > predictive.=C2=A0 How does it happen?=C2=A0 Cosmic rays and entropy.=C2=
=A0 I've had=20
> > lighty written SSD's fail after a few months.
> >=20
> > I don't use ZFS, but have GELI-Authentication under a GMIRROR, so=20
> > whenever a bad checksum is read, it breaks the mirror, which gets=20
> > attention (Iast I looked, there wasn't a simple userland hook for bad=20
> > GELI reads, but there was for GMIRROR add/remove events).
=20
> How old was the corrupted filesystem?

I just checked:

fk@t520 /var/log/fk/2021-12-20 $grep "zpool create" *zpool-history*
ssh-steffen-sudo-zpool-history--l-bpool-20211220T102957:2017-08-10.21:52:07=
 zpool create -f -o version=3D28 -O compression=3Dlzjb bpool /dev/ada0p2 [u=
ser 0 (root) on kendra]
ssh-steffen-sudo-zpool-history--l-dpool-20211220T103420:2015-03-17.18:46:42=
 zpool create dpool /dev/gpt/dpool-ada0.eli [user 0 (root) on kendra]
ssh-steffen-sudo-zpool-history--l-rpool-ada1-20211220T103234:2017-04-11.12:=
33:47 zpool create -o version=3D28 -o failmode=3Dcontinue -O compression=3D=
lzjb -O checksum=3Dsha256 rpool mirror /dev/ada0p3.eli /dev/da1p3.eli [user=
 0 (root) on ElectroBSD-11.0-STABLE-amd64]
sudo-zpool-history--l-cloudia2-20211220T103856:2017-04-12.14:45:07 zpool cr=
eate -O recordsize=3D1m -O checksum=3Dsha512 cloudia2 /dev/label/cloudia2.e=
li [user 0 (root) on t520.local]

So it looks like the partially corrupted pool "dpool" on
partition five was created on 2015-03-17 while the
(former) root pool "rpool-ada1" which didn't show any signs
of corruption was created on 2017-04-11 which indicates
that I installed a new operating system with cloudiatr
and kept the data pool unmodified.

The boot pool "bpool" was created on 2017-08-10 but
it gets recreated with each ElectroBSD kernel update
anyway.

>                                        I habitually wipe my disks and do=
=20
> a fresh install at least once every 2 years to avoid issues like this.=20

Do you read back the complete data after fresh installs to confirm
that the rewritten data arrived on disk as expected?

I prefer ZFS scrubs to confirm that the data is still reachable.

It's not obvious to me that recreating the data is safer than
keeping the old data but verifying checksums.

> I have experienced unexplained, unrecoverable errors on old filesystems,=
=20
> but fortunately nothing critical.

I too have experienced various unrecoverable errors on disks
but I never lost GPT partition data and geli meta data at the
same time while most of the data on disk remained valid and
without the disk reporting any problems.

While the pools "dpool" and "cloudia2" contained a couple of
corrupt blocks this could be completely unrelated to the
corruption of the partition table and the geli meta data.

> This to me serves as another reminder to maintain regular backups of=20
> important files and consider everything else expendable.

Agreed.

The problem disk mostly contained DVD rips and while some of them
weren't available on other disks as well, they could be recreated
by simply ripping the DVDs again.

Of course it's conceivable that some of the source DVDs now contain
corruption as well (I own many older DVDs that contain corrupt blocks),
but I could probably buy them new or rent them if needed.

I use zogftw for backups and my important data is backed
up to multiple external pools and some of them are stored
off-site.

Fabian

--Sig_/XQGc+8skOnRuNnHdVPt8VzF
Content-Type: application/pgp-signature
Content-Description: OpenPGP digital signature

-----BEGIN PGP SIGNATURE-----

iF0EARECAB0WIQTKUNd6H/m3+ByGULIFiohV/3dUnQUCYcCP5QAKCRAFiohV/3dU
nSZWAKCVKS3A32rTmr/6Ymq7QoQBiobNywCfUWgNungBQYNYqLdTGI77WTHKEw0=
=JFP0
-----END PGP SIGNATURE-----

--Sig_/XQGc+8skOnRuNnHdVPt8VzF--