From nobody Fri Feb 25 13:30:32 2022
X-Original-To: stable@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 76D9F19DD2E1
	for <stable@mlmmj.nyi.freebsd.org>; Fri, 25 Feb 2022 13:31:07 +0000 (UTC)
	(envelope-from steven@multiplay.co.uk)
Received: from mail-il1-x12b.google.com (mail-il1-x12b.google.com [IPv6:2607:f8b0:4864:20::12b])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4K4rJT4zx0z3jqp
	for <stable@freebsd.org>; Fri, 25 Feb 2022 13:31:05 +0000 (UTC)
	(envelope-from steven@multiplay.co.uk)
Received: by mail-il1-x12b.google.com with SMTP id w4so4271883ilj.5
        for <stable@freebsd.org>; Fri, 25 Feb 2022 05:31:05 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=multiplay-co-uk.20210112.gappssmtp.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=tlvJzRjWoA4kk71nOSFQY5rnqc6UdP2BsIbecUvFyq0=;
        b=bDoF+rnM+aCMerKOya/KutIJxy+VsvRN+t6saqAIi0L3GJQYta7nCVLLBRCahGOhDu
         tJwAxcWCE+FRixLQ436f78Jmhhad2B42pIlya1BE7QQg+xGoDFFgXjH+6xmCpzWPCM4I
         JJtHxmOl428IhSX3rpEUqrt4mbcLuWO1gFj+TBzxgxbfaTfJ+MWJivj6FTXRcwowLIY0
         XdFJWsWwCb/C15Xv0sESJ2UudYMKeb6YtrBJ4OW8da140LgT5TWvWNhjb1GY6z5vSgp+
         cXHOaejYhEGp71Iqw1WwMXO9DY7cr3BrLm2SiDCrq+Pq7q2gXJr0bQ/SnBIvwZjArU9y
         6ZJg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=tlvJzRjWoA4kk71nOSFQY5rnqc6UdP2BsIbecUvFyq0=;
        b=16hlsFCf2KUEUoUMkBoZIYKGQWwkJSJ5q05tLewirmQHZ1BtOnJl0/Qa1fw+lhkb8F
         XxE+ElhVfbOSFYDB440afvx6WWEwMdChKlxQFwbrrQwXarlOvgrasmgoYzoglNzjcUO9
         9vUSoZOIe3J5DhKIqHi9IV96h1fJug537jj+gh5n222B29loCNsSY3TLOc3qgJ//UTGs
         XbZgVzfOPbyZQffpdrWvhYj8XrB6jBV94cgtaoJqIjnZ6tqJ8k8yBSqytUB6uZwHLIOW
         q8CaflWJ24bWYaOHAu0H8o0xsRVwCheJlJL4unjvLgyXS72CteZNBjnXsbLBgm5W7eK4
         qUOQ==
X-Gm-Message-State: AOAM530uKncrdXC1GS9GvJfSkzbMdaOwU+ltqlJknbIrQJDCWKfwq13G
	nGC8d07WViy3E8ngECLLpruS43FH0mFq4cwmGcazz2KThx8=
X-Google-Smtp-Source: ABdhPJzd6Yhwaf+RRnLhanKoEr+s3clJZ6uM+AyTu4XSsiWveFxPLpwEkOFLDsf09oLlAz8qNgHG3e43LJ6lrGAhHjU=
X-Received: by 2002:a05:6e02:170c:b0:2bf:1248:63df with SMTP id
 u12-20020a056e02170c00b002bf124863dfmr6346971ill.155.1645795859021; Fri, 25
 Feb 2022 05:30:59 -0800 (PST)
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-stable
List-Help: <mailto:stable+help@freebsd.org>
List-Post: <mailto:stable@freebsd.org>
List-Subscribe: <mailto:stable+subscribe@freebsd.org>
List-Unsubscribe: <mailto:stable+unsubscribe@freebsd.org>
Sender: owner-freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
MIME-Version: 1.0
References: <d959873f-3a0d-8f81-193d-f1f70c48eaa7@zhegan.in>
In-Reply-To: <d959873f-3a0d-8f81-193d-f1f70c48eaa7@zhegan.in>
From: Steven Hartland <killing@multiplay.co.uk>
Date: Fri, 25 Feb 2022 13:30:32 +0000
Message-ID: <CAHEMsqYUt9EFFkLqw1fecfcBC0ts6WkkK2i4EqVDSN1ELJiERw@mail.gmail.com>
Subject: Re: zfs mirrored pool dead after a disk death and reset
To: "Eugene M. Zheganin" <eugene@zhegan.in>
Cc: stable@freebsd.org
Content-Type: multipart/alternative; boundary="0000000000008cdc2a05d8d7b56d"
X-Rspamd-Queue-Id: 4K4rJT4zx0z3jqp
X-Spamd-Bar: ---
Authentication-Results: mx1.freebsd.org;
	dkim=pass header.d=multiplay-co-uk.20210112.gappssmtp.com header.s=20210112 header.b=bDoF+rnM;
	dmarc=pass (policy=none) header.from=multiplay.co.uk;
	spf=pass (mx1.freebsd.org: domain of steven@multiplay.co.uk designates 2607:f8b0:4864:20::12b as permitted sender) smtp.mailfrom=steven@multiplay.co.uk
X-Spamd-Result: default: False [-3.70 / 15.00];
	 RCVD_TLS_ALL(0.00)[];
	 ARC_NA(0.00)[];
	 R_DKIM_ALLOW(-0.20)[multiplay-co-uk.20210112.gappssmtp.com:s=20210112];
	 NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	 FROM_HAS_DN(0.00)[];
	 TO_DN_SOME(0.00)[];
	 R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36];
	 NEURAL_HAM_LONG(-1.00)[-1.000];
	 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
	 PREVIOUSLY_DELIVERED(0.00)[stable@freebsd.org];
	 TO_MATCH_ENVRCPT_SOME(0.00)[];
	 DKIM_TRACE(0.00)[multiplay-co-uk.20210112.gappssmtp.com:+];
	 RCPT_COUNT_TWO(0.00)[2];
	 RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::12b:from];
	 NEURAL_HAM_SHORT(-1.00)[-1.000];
	 DMARC_POLICY_ALLOW(-0.50)[multiplay.co.uk,none];
	 MLMMJ_DEST(0.00)[stable];
	 FORGED_SENDER(0.30)[killing@multiplay.co.uk,steven@multiplay.co.uk];
	 MIME_TRACE(0.00)[0:+,1:+,2:~];
	 RCVD_COUNT_TWO(0.00)[2];
	 ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US];
	 FROM_NEQ_ENVFROM(0.00)[killing@multiplay.co.uk,steven@multiplay.co.uk]
X-ThisMailContainsUnwantedMimeParts: N

--0000000000008cdc2a05d8d7b56d
Content-Type: text/plain; charset="UTF-8"

Have you tried removing the dead disk physically. I've seen in the past a
bad disk sending causing bad data to be sent to the controller causing
knock on issues.

Also the output doesn't show multiple devices, only nvd0. I'm hoping you
didn't use nv raid to create the mirror, as that means there's no ZFS
protection?

On Fri, 25 Feb 2022 at 11:07, Eugene M. Zheganin <eugene@zhegan.in> wrote:

> Hello.
>
> Recently a disk died in one of my servers running 12.2
> (12.2-RELEASE-p2). So.... it died, I got a bunch of dmesg errors saying
> there's a bunch of i/o commands stuck, OS became partially livelocked (I
> still could login, but barely could do anything) so.... considering this
> is a mirrored pool, and "I have done it many times before, nothing could
> be safer !" I sent a reset to the server via IPMI.
>
> And it was quite discouraging finding this after a successful boot-up
> from intact zroot (yeah, I've already tried to zpool import -F after an
> export, so initially it was imported already, showing the same
> devastating state):
>
>
> [root@db0:~]# zpool import
> pool: data
> id: 15967028801499953224
> state: FAULTED
> status: One or more devices contains corrupted data.
> action: The pool cannot be imported due to damaged devices or data.
> The pool may be active on another system, but can be imported using
> the '-f' flag.
> see: http://illumos.org/msg/ZFS-8000-5E
> config:
> data                   FAULTED  corrupted data
> 9566965891719887395  FAULTED  corrupted data
> nvd0                 ONLINE
>
>
> # zpool import -F data
> cannot import 'data': one or more devices is currently unavailable
>
>
> Well, -yeah, I do have a replica, I didn't lose one bit of data, but
> it's still a tragedy - to lose pool after one silly reset (and I have
> done it literally a hundred times before on various servers and FreeBSD
> versions).
>
> So, a couple of questions:
>
> - is it worth trying FreeBSD 13 to recover ? (just to get the experience
> if it can be still recovered)
>
> - is it because it's more dangerous with NVMes or would it also happen
> on SSD/rotational drives ?
>
> - would zpool checkpoint save me in this case ?
>
>
> Thanks.
>
> Eugene.
>
>
>

--0000000000008cdc2a05d8d7b56d
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Have you tried removing the dead disk physically.=C2=A0I&#=
39;ve seen in the past a bad disk sending causing bad data to be sent to th=
e controller causing knock on issues.<div><br></div><div>Also the output do=
esn&#39;t show multiple devices, only nvd0. I&#39;m hoping you didn&#39;t u=
se nv raid to create the mirror, as that means there&#39;s no ZFS protectio=
n?</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmai=
l_attr">On Fri, 25 Feb 2022 at 11:07, Eugene M. Zheganin &lt;<a href=3D"mai=
lto:eugene@zhegan.in" target=3D"_blank">eugene@zhegan.in</a>&gt; wrote:<br>=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex">Hello.<br>
<br>
Recently a disk died in one of my servers running 12.2 <br>
(12.2-RELEASE-p2). So.... it died, I got a bunch of dmesg errors saying <br=
>
there&#39;s a bunch of i/o commands stuck, OS became partially livelocked (=
I <br>
still could login, but barely could do anything) so.... considering this <b=
r>
is a mirrored pool, and &quot;I have done it many times before, nothing cou=
ld <br>
be safer !&quot; I sent a reset to the server via IPMI.<br>
<br>
And it was quite discouraging finding this after a successful boot-up <br>
from intact zroot (yeah, I&#39;ve already tried to zpool import -F after an=
 <br>
export, so initially it was imported already, showing the same <br>
devastating state):<br>
<br>
<br>
[root@db0:~]# zpool import<br>
pool: data<br>
id: 15967028801499953224<br>
state: FAULTED<br>
status: One or more devices contains corrupted data.<br>
action: The pool cannot be imported due to damaged devices or data.<br>
The pool may be active on another system, but can be imported using<br>
the &#39;-f&#39; flag.<br>
see: <a href=3D"http://illumos.org/msg/ZFS-8000-5E" rel=3D"noreferrer" targ=
et=3D"_blank">http://illumos.org/msg/ZFS-8000-5E</a><br>
config:<br>
data=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 FAULTED=C2=A0 corrupted data<br>
9566965891719887395=C2=A0 FAULTED=C2=A0 corrupted data<br>
nvd0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 ONLINE<br>
<br>
<br>
# zpool import -F data<br>
cannot import &#39;data&#39;: one or more devices is currently unavailable<=
br>
<br>
<br>
Well, -yeah, I do have a replica, I didn&#39;t lose one bit of data, but <b=
r>
it&#39;s still a tragedy - to lose pool after one silly reset (and I have <=
br>
done it literally a hundred times before on various servers and FreeBSD <br=
>
versions).<br>
<br>
So, a couple of questions:<br>
<br>
- is it worth trying FreeBSD 13 to recover ? (just to get the experience <b=
r>
if it can be still recovered)<br>
<br>
- is it because it&#39;s more dangerous with NVMes or would it also happen =
<br>
on SSD/rotational drives ?<br>
<br>
- would zpool checkpoint save me in this case ?<br>
<br>
<br>
Thanks.<br>
<br>
Eugene.<br>
<br>
<br>
</blockquote></div>

--0000000000008cdc2a05d8d7b56d--