Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 25 Feb 2022 13:30:32 +0000
From:      Steven Hartland <killing@multiplay.co.uk>
To:        "Eugene M. Zheganin" <eugene@zhegan.in>
Cc:        stable@freebsd.org
Subject:   Re: zfs mirrored pool dead after a disk death and reset
Message-ID:  <CAHEMsqYUt9EFFkLqw1fecfcBC0ts6WkkK2i4EqVDSN1ELJiERw@mail.gmail.com>
In-Reply-To: <d959873f-3a0d-8f81-193d-f1f70c48eaa7@zhegan.in>
References:  <d959873f-3a0d-8f81-193d-f1f70c48eaa7@zhegan.in>

next in thread | previous in thread | raw e-mail | index | archive | help
--0000000000008cdc2a05d8d7b56d
Content-Type: text/plain; charset="UTF-8"

Have you tried removing the dead disk physically. I've seen in the past a
bad disk sending causing bad data to be sent to the controller causing
knock on issues.

Also the output doesn't show multiple devices, only nvd0. I'm hoping you
didn't use nv raid to create the mirror, as that means there's no ZFS
protection?

On Fri, 25 Feb 2022 at 11:07, Eugene M. Zheganin <eugene@zhegan.in> wrote:

> Hello.
>
> Recently a disk died in one of my servers running 12.2
> (12.2-RELEASE-p2). So.... it died, I got a bunch of dmesg errors saying
> there's a bunch of i/o commands stuck, OS became partially livelocked (I
> still could login, but barely could do anything) so.... considering this
> is a mirrored pool, and "I have done it many times before, nothing could
> be safer !" I sent a reset to the server via IPMI.
>
> And it was quite discouraging finding this after a successful boot-up
> from intact zroot (yeah, I've already tried to zpool import -F after an
> export, so initially it was imported already, showing the same
> devastating state):
>
>
> [root@db0:~]# zpool import
> pool: data
> id: 15967028801499953224
> state: FAULTED
> status: One or more devices contains corrupted data.
> action: The pool cannot be imported due to damaged devices or data.
> The pool may be active on another system, but can be imported using
> the '-f' flag.
> see: http://illumos.org/msg/ZFS-8000-5E
> config:
> data                   FAULTED  corrupted data
> 9566965891719887395  FAULTED  corrupted data
> nvd0                 ONLINE
>
>
> # zpool import -F data
> cannot import 'data': one or more devices is currently unavailable
>
>
> Well, -yeah, I do have a replica, I didn't lose one bit of data, but
> it's still a tragedy - to lose pool after one silly reset (and I have
> done it literally a hundred times before on various servers and FreeBSD
> versions).
>
> So, a couple of questions:
>
> - is it worth trying FreeBSD 13 to recover ? (just to get the experience
> if it can be still recovered)
>
> - is it because it's more dangerous with NVMes or would it also happen
> on SSD/rotational drives ?
>
> - would zpool checkpoint save me in this case ?
>
>
> Thanks.
>
> Eugene.
>
>
>

--0000000000008cdc2a05d8d7b56d
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Have you tried removing the dead disk physically.=C2=A0I&#=
39;ve seen in the past a bad disk sending causing bad data to be sent to th=
e controller causing knock on issues.<div><br></div><div>Also the output do=
esn&#39;t show multiple devices, only nvd0. I&#39;m hoping you didn&#39;t u=
se nv raid to create the mirror, as that means there&#39;s no ZFS protectio=
n?</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmai=
l_attr">On Fri, 25 Feb 2022 at 11:07, Eugene M. Zheganin &lt;<a href=3D"mai=
lto:eugene@zhegan.in" target=3D"_blank">eugene@zhegan.in</a>&gt; wrote:<br>=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex">Hello.<br>
<br>
Recently a disk died in one of my servers running 12.2 <br>
(12.2-RELEASE-p2). So.... it died, I got a bunch of dmesg errors saying <br=
>
there&#39;s a bunch of i/o commands stuck, OS became partially livelocked (=
I <br>
still could login, but barely could do anything) so.... considering this <b=
r>
is a mirrored pool, and &quot;I have done it many times before, nothing cou=
ld <br>
be safer !&quot; I sent a reset to the server via IPMI.<br>
<br>
And it was quite discouraging finding this after a successful boot-up <br>
from intact zroot (yeah, I&#39;ve already tried to zpool import -F after an=
 <br>
export, so initially it was imported already, showing the same <br>
devastating state):<br>
<br>
<br>
[root@db0:~]# zpool import<br>
pool: data<br>
id: 15967028801499953224<br>
state: FAULTED<br>
status: One or more devices contains corrupted data.<br>
action: The pool cannot be imported due to damaged devices or data.<br>
The pool may be active on another system, but can be imported using<br>
the &#39;-f&#39; flag.<br>
see: <a href=3D"http://illumos.org/msg/ZFS-8000-5E" rel=3D"noreferrer" targ=
et=3D"_blank">http://illumos.org/msg/ZFS-8000-5E</a><br>;
config:<br>
data=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 FAULTED=C2=A0 corrupted data<br>
9566965891719887395=C2=A0 FAULTED=C2=A0 corrupted data<br>
nvd0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 ONLINE<br>
<br>
<br>
# zpool import -F data<br>
cannot import &#39;data&#39;: one or more devices is currently unavailable<=
br>
<br>
<br>
Well, -yeah, I do have a replica, I didn&#39;t lose one bit of data, but <b=
r>
it&#39;s still a tragedy - to lose pool after one silly reset (and I have <=
br>
done it literally a hundred times before on various servers and FreeBSD <br=
>
versions).<br>
<br>
So, a couple of questions:<br>
<br>
- is it worth trying FreeBSD 13 to recover ? (just to get the experience <b=
r>
if it can be still recovered)<br>
<br>
- is it because it&#39;s more dangerous with NVMes or would it also happen =
<br>
on SSD/rotational drives ?<br>
<br>
- would zpool checkpoint save me in this case ?<br>
<br>
<br>
Thanks.<br>
<br>
Eugene.<br>
<br>
<br>
</blockquote></div>

--0000000000008cdc2a05d8d7b56d--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAHEMsqYUt9EFFkLqw1fecfcBC0ts6WkkK2i4EqVDSN1ELJiERw>