From nobody Mon Dec 20 14:15:00 2021 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id AE7171900B03 for ; Mon, 20 Dec 2021 14:16:32 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay01.ispgateway.de (smtprelay01.ispgateway.de [80.67.18.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4JHhTr0VLCz3KNF for ; Mon, 20 Dec 2021 14:16:31 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from [217.246.54.215] (helo=fabiankeil.de) by smtprelay01.ispgateway.de with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1mzJSu-0003t3-Gp for freebsd-hackers@freebsd.org; Mon, 20 Dec 2021 15:16:36 +0100 Date: Mon, 20 Dec 2021 15:15:00 +0100 From: Fabian Keil To: freebsd-hackers@freebsd.org Subject: Re: Patches for GPT and geli recovery Message-ID: <20211220151500.5e57c1a6@fabiankeil.de> In-Reply-To: <67419422-5633-4e4b-870d-aec8762ec6a1@gmail.com> References: <20211219175011.3023a232@fabiankeil.de> <67419422-5633-4e4b-870d-aec8762ec6a1@gmail.com> List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Content-Type: multipart/signed; boundary="Sig_/XQGc+8skOnRuNnHdVPt8VzF"; protocol="application/pgp-signature"; micalg=pgp-sha1 X-Df-Sender: Nzc1MDY3 X-Rspamd-Queue-Id: 4JHhTr0VLCz3KNF X-Spamd-Bar: + Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of freebsd-listen@fabiankeil.de has no SPF policy when checking 80.67.18.13) smtp.mailfrom=freebsd-listen@fabiankeil.de X-Spamd-Result: default: False [1.49 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; RWL_MAILSPIKE_GOOD(0.00)[80.67.18.13:from]; TO_MATCH_ENVRCPT_ALL(0.00)[]; NEURAL_SPAM_SHORT(0.77)[0.770]; MIME_GOOD(-0.20)[multipart/signed,text/plain]; TO_DN_NONE(0.00)[]; DMARC_NA(0.00)[fabiankeil.de]; AUTH_NA(1.00)[]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_SPAM_MEDIUM(0.97)[0.971]; NEURAL_SPAM_LONG(0.95)[0.946]; RCVD_IN_DNSWL_NONE(0.00)[80.67.18.13:from]; SIGNED_PGP(-2.00)[]; R_SPF_NA(0.00)[no SPF record]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:8972, ipnet:80.67.16.0/20, country:DE]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; RECEIVED_SPAMHAUS_PBL(0.00)[217.246.54.215:received] X-ThisMailContainsUnwantedMimeParts: N --Sig_/XQGc+8skOnRuNnHdVPt8VzF Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Jason Bacon wrote on 2021-12-19 at 16:21:39: > On 12/19/21 13:40, Lee Brown wrote: > >=20 > >=20 > > On Sun, Dec 19, 2021 at 8:52 AM Fabian Keil=20 > > > wr= ote: > >=20 > > [cut] > > BTW, I would also be interested to know if others have > > experienced similar data corruption and could figure > > out how it happened. > >=20 > > Sounds like bitrot.=C2=A0 Bits flip on disks all the time, it doesn't m= atter=20 > > if they are spinning rust or SSD, it happens.=C2=A0 Sometimes they are= =20 > > detected and corrected, in which case you won't know.=C2=A0 Sometimes t= hey=20 > > are detected and uncorrectable, you'll see that error propagated into=20 > > the driver.=C2=A0 And sometimes they are not detected at all and cause = no=20 > > errors that the OS can surmise.=C2=A0 The higher the density of bits, t= he=20 > > higher the probability of corruption.=C2=A0 SMART is not reliably=20 > > predictive.=C2=A0 How does it happen?=C2=A0 Cosmic rays and entropy.=C2= =A0 I've had=20 > > lighty written SSD's fail after a few months. > >=20 > > I don't use ZFS, but have GELI-Authentication under a GMIRROR, so=20 > > whenever a bad checksum is read, it breaks the mirror, which gets=20 > > attention (Iast I looked, there wasn't a simple userland hook for bad=20 > > GELI reads, but there was for GMIRROR add/remove events). =20 > How old was the corrupted filesystem? I just checked: fk@t520 /var/log/fk/2021-12-20 $grep "zpool create" *zpool-history* ssh-steffen-sudo-zpool-history--l-bpool-20211220T102957:2017-08-10.21:52:07= zpool create -f -o version=3D28 -O compression=3Dlzjb bpool /dev/ada0p2 [u= ser 0 (root) on kendra] ssh-steffen-sudo-zpool-history--l-dpool-20211220T103420:2015-03-17.18:46:42= zpool create dpool /dev/gpt/dpool-ada0.eli [user 0 (root) on kendra] ssh-steffen-sudo-zpool-history--l-rpool-ada1-20211220T103234:2017-04-11.12:= 33:47 zpool create -o version=3D28 -o failmode=3Dcontinue -O compression=3D= lzjb -O checksum=3Dsha256 rpool mirror /dev/ada0p3.eli /dev/da1p3.eli [user= 0 (root) on ElectroBSD-11.0-STABLE-amd64] sudo-zpool-history--l-cloudia2-20211220T103856:2017-04-12.14:45:07 zpool cr= eate -O recordsize=3D1m -O checksum=3Dsha512 cloudia2 /dev/label/cloudia2.e= li [user 0 (root) on t520.local] So it looks like the partially corrupted pool "dpool" on partition five was created on 2015-03-17 while the (former) root pool "rpool-ada1" which didn't show any signs of corruption was created on 2017-04-11 which indicates that I installed a new operating system with cloudiatr and kept the data pool unmodified. The boot pool "bpool" was created on 2017-08-10 but it gets recreated with each ElectroBSD kernel update anyway. > I habitually wipe my disks and do= =20 > a fresh install at least once every 2 years to avoid issues like this.=20 Do you read back the complete data after fresh installs to confirm that the rewritten data arrived on disk as expected? I prefer ZFS scrubs to confirm that the data is still reachable. It's not obvious to me that recreating the data is safer than keeping the old data but verifying checksums. > I have experienced unexplained, unrecoverable errors on old filesystems,= =20 > but fortunately nothing critical. I too have experienced various unrecoverable errors on disks but I never lost GPT partition data and geli meta data at the same time while most of the data on disk remained valid and without the disk reporting any problems. While the pools "dpool" and "cloudia2" contained a couple of corrupt blocks this could be completely unrelated to the corruption of the partition table and the geli meta data. > This to me serves as another reminder to maintain regular backups of=20 > important files and consider everything else expendable. Agreed. The problem disk mostly contained DVD rips and while some of them weren't available on other disks as well, they could be recreated by simply ripping the DVDs again. Of course it's conceivable that some of the source DVDs now contain corruption as well (I own many older DVDs that contain corrupt blocks), but I could probably buy them new or rent them if needed. I use zogftw for backups and my important data is backed up to multiple external pools and some of them are stored off-site. Fabian --Sig_/XQGc+8skOnRuNnHdVPt8VzF Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- iF0EARECAB0WIQTKUNd6H/m3+ByGULIFiohV/3dUnQUCYcCP5QAKCRAFiohV/3dU nSZWAKCVKS3A32rTmr/6Ymq7QoQBiobNywCfUWgNungBQYNYqLdTGI77WTHKEw0= =JFP0 -----END PGP SIGNATURE----- --Sig_/XQGc+8skOnRuNnHdVPt8VzF--