Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 7 Jan 2024 14:15:14 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        lev@freebsd.org
Cc:        freebsd-fs <freebsd-fs@freebsd.org>, freebsd-stable <freebsd-stable@freebsd.org>
Subject:   Re: FreeBSD 13.2-STABLE can not boot from damaged mirror AND pool stuck in "resilver" state even without new devices.
Message-ID:  <CANCZdfrV7ROiQD-UGAbgydKbYE7jiM_9t9c6n8F79hqq16X0Kg@mail.gmail.com>
In-Reply-To: <2f91eeb7-430b-49e2-817b-5acd0f445fe9@FreeBSD.org>
References:  <f97d80ee-0b01-4d68-beb5-53e905f0404c@FreeBSD.org> <e74464be-09b6-43e2-9365-7b0271b2d6eb@FreeBSD.org> <cc136316-f285-41bd-8d59-c5adce06e277@quip.cz> <065f4f5c-f38b-45f4-b7e7-5248f871f7e6@FreeBSD.org> <CANCZdfrYCk7%2B6wCALvszmNZOcZeDxxNp%2Bk5PyH%2BTGJZ%2BovsU=Q@mail.gmail.com> <d11ffb2e-0ee8-4c20-b5d9-5ea63463adba@FreeBSD.org> <2f91eeb7-430b-49e2-817b-5acd0f445fe9@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
--0000000000006e66a9060e619437
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Sun, Jan 7, 2024 at 1:57=E2=80=AFPM Lev Serebryakov <lev@freebsd.org> wr=
ote:

> On 07.01.2024 21:49, Lev Serebryakov wrote:
>
> > On 07.01.2024 19:34, Warner Losh wrote:
> >
> >> I must have missed it. What were the diagnostics?
>
>   Oh, and two "nvlist inconsistency" before that vvvv
>
> > zio_read error: 5
> > zio_read error: 5
> > zio_read error: 5
>

5 is EIO which the loader uses internally for any error that the disk
reports.
I've not read through all the code involved here, but I think that means
there
might be read errors for real.

Though the nvlist inconsistency might be an issue.

So, if this is a mirror, then ada0 blank and ada1 with good data, in theory
you should be fine. However, perhaps ZFS is finding that there's an error
from
ada1 for real. Does all of ada1 read with a simple dd?

Not sure about the losing devices you described later on.

> ZFS: i/o error - all block copies unavailable
> > ZFS: can't read MOS of pool zroot
> >
> >
> >   To be honest, I thinks there is something else. Because sequence of
> events were (sorry, too long, but I think, tht every detail matters here)=
:
>

Yea. There's something that's failing, which zio_read is woefully under
reporting for our diagnostic efforts. And/or something is
getting confused by the blank disk and/or the partially resilvered disk.


> (1) Update to 13.2 from 12.4. With installation of new gptzfsboot with
> gpart on both disks. It could place new /boot far away, but see (2)
> > (2) Reboot, which completed, but showed that ada0 has problems
> > (3) Replacement of ada0 by DC technicians, new disk is 512/4096, old
> disk is 512/512, pool has ashift=3D9
> > (4) Server refuses to boot from ada1 (ada0 is empty) with diagnostics
> (see above)
> > (5) Linux rescue system, passing 2 devices to qemu with FreeBSD (becaus=
e
> Linux shows that ZFS is on whole disk, not on partition!).
> > (6) Re-creation of GPT on ada0, start of resilver (with sub-optimal
> ashift!).
> > (7) Interruption of resilver with reboot, because it is painfully slow
> under qemu.
> > (8) Wipe of ada0 (at this point resilver status of pool becomes crazy)
> to put live FreeBSD image to boot somehow.
> > (9) Many tries to cancel resilver and boot from single-disk "historical=
"
> pool on ada1, no success. I've attributed it to the strange state of pool=
:
> one component, no mirrior, but "resilvering".
> > (10) Boot from small UFS partition (which replaces swap partition).
> > (11) Pool on ada1 (old, live, 512/512 disk) is still "Reslivering"
> without any additional components (with zero speed, of course).
> > (12) Prepare partitions on ada0 again, creating new pool with ashift=3D=
12,
> send|receive.
> > (13) Removing partition table on ada1 (with old pool, ashift=3D9, still
> resilvering after many-many reboots with only one device in it).
>
>   And pleas note: this pool on ada1 (old, live disk) was NOT upgraded
> after 12-STABLE. It was old, 12-STABLE "level" pool with all new features
> disabled.
>

Yea, this isn't *THAT*OtHER* problem :).

Warner


> --
> // Lev Serebryakov
>
>

--0000000000006e66a9060e619437
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Sun, Jan 7, 2024 at 1:57=E2=80=AFP=
M Lev Serebryakov &lt;<a href=3D"mailto:lev@freebsd.org">lev@freebsd.org</a=
>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px=
 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On =
07.01.2024 21:49, Lev Serebryakov wrote:<br>
<br>
&gt; On 07.01.2024 19:34, Warner Losh wrote:<br>
&gt; <br>
&gt;&gt; I must have missed it. What were the diagnostics?<br>
<br>
=C2=A0 Oh, and two &quot;nvlist inconsistency&quot; before that vvvv<br>
<br>
&gt; zio_read error: 5<br>
&gt; zio_read error: 5<br>
&gt; zio_read error: 5<br></blockquote><div><br></div><div>5 is EIO which t=
he loader uses internally for any error that the disk reports.</div><div>I&=
#39;ve not read through all the code involved here, but I think that means =
there</div><div>might be read errors for real.</div><div><br></div><div>Tho=
ugh the nvlist inconsistency might be an issue.</div><div><br></div><div>So=
, if this is a mirror, then ada0 blank and ada1 with good data, in theory</=
div><div>you should be fine. However, perhaps ZFS is finding that there&#39=
;s an error from</div><div>ada1 for real. Does all of ada1 read with a simp=
le dd?</div><div><br></div><div>Not sure about the losing devices you descr=
ibed later on.</div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">&gt; ZFS: i/o error - all block copies unavailable<br>
&gt; ZFS: can&#39;t read MOS of pool zroot<br>
&gt; <br>
&gt; <br>
&gt;=C2=A0 =C2=A0To be honest, I thinks there is something else. Because se=
quence of events were (sorry, too long, but I think, tht every detail matte=
rs here):<br></blockquote><div><br></div><div>Yea. There&#39;s something th=
at&#39;s failing, which zio_read is woefully under reporting for our diagno=
stic efforts. And/or something is</div><div>getting confused by the blank d=
isk and/or the partially resilvered disk.</div><div><br></div><div><br></di=
v><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;borde=
r-left:1px solid rgb(204,204,204);padding-left:1ex">
&gt; (1) Update to 13.2 from 12.4. With installation of new gptzfsboot with=
 gpart on both disks. It could place new /boot far away, but see (2)<br>
&gt; (2) Reboot, which completed, but showed that ada0 has problems<br>
&gt; (3) Replacement of ada0 by DC technicians, new disk is 512/4096, old d=
isk is 512/512, pool has ashift=3D9<br>
&gt; (4) Server refuses to boot from ada1 (ada0 is empty) with diagnostics =
(see above)<br>
&gt; (5) Linux rescue system, passing 2 devices to qemu with FreeBSD (becau=
se Linux shows that ZFS is on whole disk, not on partition!).<br>
&gt; (6) Re-creation of GPT on ada0, start of resilver (with sub-optimal as=
hift!).<br>
&gt; (7) Interruption of resilver with reboot, because it is painfully slow=
 under qemu.<br>
&gt; (8) Wipe of ada0 (at this point resilver status of pool becomes crazy)=
 to put live FreeBSD image to boot somehow.<br>
&gt; (9) Many tries to cancel resilver and boot from single-disk &quot;hist=
orical&quot; pool on ada1, no success. I&#39;ve attributed it to the strang=
e state of pool: one component, no mirrior, but &quot;resilvering&quot;.<br=
>
&gt; (10) Boot from small UFS partition (which replaces swap partition).<br=
>
&gt; (11) Pool on ada1 (old, live, 512/512 disk) is still &quot;Reslivering=
&quot; without any additional components (with zero speed, of course).<br>
&gt; (12) Prepare partitions on ada0 again, creating new pool with ashift=
=3D12, send|receive.<br>
&gt; (13) Removing partition table on ada1 (with old pool, ashift=3D9, stil=
l resilvering after many-many reboots with only one device in it).<br>
<br>
=C2=A0 And pleas note: this pool on ada1 (old, live disk) was NOT upgraded =
after 12-STABLE. It was old, 12-STABLE &quot;level&quot; pool with all new =
features disabled.<br></blockquote><div><br></div><div>Yea, this isn&#39;t =
*THAT*OtHER* problem :).=C2=A0</div><div><br></div><div>Warner</div><div>=
=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0=
.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
-- <br>
// Lev Serebryakov<br>
<br>
</blockquote></div></div>

--0000000000006e66a9060e619437--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfrV7ROiQD-UGAbgydKbYE7jiM_9t9c6n8F79hqq16X0Kg>