From nobody Fri Feb 25 13:30:32 2022 X-Original-To: stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 76D9F19DD2E1 for ; Fri, 25 Feb 2022 13:31:07 +0000 (UTC) (envelope-from steven@multiplay.co.uk) Received: from mail-il1-x12b.google.com (mail-il1-x12b.google.com [IPv6:2607:f8b0:4864:20::12b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4K4rJT4zx0z3jqp for ; Fri, 25 Feb 2022 13:31:05 +0000 (UTC) (envelope-from steven@multiplay.co.uk) Received: by mail-il1-x12b.google.com with SMTP id w4so4271883ilj.5 for ; Fri, 25 Feb 2022 05:31:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=tlvJzRjWoA4kk71nOSFQY5rnqc6UdP2BsIbecUvFyq0=; b=bDoF+rnM+aCMerKOya/KutIJxy+VsvRN+t6saqAIi0L3GJQYta7nCVLLBRCahGOhDu tJwAxcWCE+FRixLQ436f78Jmhhad2B42pIlya1BE7QQg+xGoDFFgXjH+6xmCpzWPCM4I JJtHxmOl428IhSX3rpEUqrt4mbcLuWO1gFj+TBzxgxbfaTfJ+MWJivj6FTXRcwowLIY0 XdFJWsWwCb/C15Xv0sESJ2UudYMKeb6YtrBJ4OW8da140LgT5TWvWNhjb1GY6z5vSgp+ cXHOaejYhEGp71Iqw1WwMXO9DY7cr3BrLm2SiDCrq+Pq7q2gXJr0bQ/SnBIvwZjArU9y 6ZJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=tlvJzRjWoA4kk71nOSFQY5rnqc6UdP2BsIbecUvFyq0=; b=16hlsFCf2KUEUoUMkBoZIYKGQWwkJSJ5q05tLewirmQHZ1BtOnJl0/Qa1fw+lhkb8F XxE+ElhVfbOSFYDB440afvx6WWEwMdChKlxQFwbrrQwXarlOvgrasmgoYzoglNzjcUO9 9vUSoZOIe3J5DhKIqHi9IV96h1fJug537jj+gh5n222B29loCNsSY3TLOc3qgJ//UTGs XbZgVzfOPbyZQffpdrWvhYj8XrB6jBV94cgtaoJqIjnZ6tqJ8k8yBSqytUB6uZwHLIOW q8CaflWJ24bWYaOHAu0H8o0xsRVwCheJlJL4unjvLgyXS72CteZNBjnXsbLBgm5W7eK4 qUOQ== X-Gm-Message-State: AOAM530uKncrdXC1GS9GvJfSkzbMdaOwU+ltqlJknbIrQJDCWKfwq13G nGC8d07WViy3E8ngECLLpruS43FH0mFq4cwmGcazz2KThx8= X-Google-Smtp-Source: ABdhPJzd6Yhwaf+RRnLhanKoEr+s3clJZ6uM+AyTu4XSsiWveFxPLpwEkOFLDsf09oLlAz8qNgHG3e43LJ6lrGAhHjU= X-Received: by 2002:a05:6e02:170c:b0:2bf:1248:63df with SMTP id u12-20020a056e02170c00b002bf124863dfmr6346971ill.155.1645795859021; Fri, 25 Feb 2022 05:30:59 -0800 (PST) List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Steven Hartland Date: Fri, 25 Feb 2022 13:30:32 +0000 Message-ID: Subject: Re: zfs mirrored pool dead after a disk death and reset To: "Eugene M. Zheganin" Cc: stable@freebsd.org Content-Type: multipart/alternative; boundary="0000000000008cdc2a05d8d7b56d" X-Rspamd-Queue-Id: 4K4rJT4zx0z3jqp X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=multiplay-co-uk.20210112.gappssmtp.com header.s=20210112 header.b=bDoF+rnM; dmarc=pass (policy=none) header.from=multiplay.co.uk; spf=pass (mx1.freebsd.org: domain of steven@multiplay.co.uk designates 2607:f8b0:4864:20::12b as permitted sender) smtp.mailfrom=steven@multiplay.co.uk X-Spamd-Result: default: False [-3.70 / 15.00]; RCVD_TLS_ALL(0.00)[]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[multiplay-co-uk.20210112.gappssmtp.com:s=20210112]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[stable@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[multiplay-co-uk.20210112.gappssmtp.com:+]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::12b:from]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[multiplay.co.uk,none]; MLMMJ_DEST(0.00)[stable]; FORGED_SENDER(0.30)[killing@multiplay.co.uk,steven@multiplay.co.uk]; MIME_TRACE(0.00)[0:+,1:+,2:~]; RCVD_COUNT_TWO(0.00)[2]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; FROM_NEQ_ENVFROM(0.00)[killing@multiplay.co.uk,steven@multiplay.co.uk] X-ThisMailContainsUnwantedMimeParts: N --0000000000008cdc2a05d8d7b56d Content-Type: text/plain; charset="UTF-8" Have you tried removing the dead disk physically. I've seen in the past a bad disk sending causing bad data to be sent to the controller causing knock on issues. Also the output doesn't show multiple devices, only nvd0. I'm hoping you didn't use nv raid to create the mirror, as that means there's no ZFS protection? On Fri, 25 Feb 2022 at 11:07, Eugene M. Zheganin wrote: > Hello. > > Recently a disk died in one of my servers running 12.2 > (12.2-RELEASE-p2). So.... it died, I got a bunch of dmesg errors saying > there's a bunch of i/o commands stuck, OS became partially livelocked (I > still could login, but barely could do anything) so.... considering this > is a mirrored pool, and "I have done it many times before, nothing could > be safer !" I sent a reset to the server via IPMI. > > And it was quite discouraging finding this after a successful boot-up > from intact zroot (yeah, I've already tried to zpool import -F after an > export, so initially it was imported already, showing the same > devastating state): > > > [root@db0:~]# zpool import > pool: data > id: 15967028801499953224 > state: FAULTED > status: One or more devices contains corrupted data. > action: The pool cannot be imported due to damaged devices or data. > The pool may be active on another system, but can be imported using > the '-f' flag. > see: http://illumos.org/msg/ZFS-8000-5E > config: > data FAULTED corrupted data > 9566965891719887395 FAULTED corrupted data > nvd0 ONLINE > > > # zpool import -F data > cannot import 'data': one or more devices is currently unavailable > > > Well, -yeah, I do have a replica, I didn't lose one bit of data, but > it's still a tragedy - to lose pool after one silly reset (and I have > done it literally a hundred times before on various servers and FreeBSD > versions). > > So, a couple of questions: > > - is it worth trying FreeBSD 13 to recover ? (just to get the experience > if it can be still recovered) > > - is it because it's more dangerous with NVMes or would it also happen > on SSD/rotational drives ? > > - would zpool checkpoint save me in this case ? > > > Thanks. > > Eugene. > > > --0000000000008cdc2a05d8d7b56d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Have you tried removing the dead disk physically.=C2=A0I&#= 39;ve seen in the past a bad disk sending causing bad data to be sent to th= e controller causing knock on issues.

Also the output do= esn't show multiple devices, only nvd0. I'm hoping you didn't u= se nv raid to create the mirror, as that means there's no ZFS protectio= n?

On Fri, 25 Feb 2022 at 11:07, Eugene M. Zheganin <eugene@zhegan.in> wrote:
=
Hello.

Recently a disk died in one of my servers running 12.2
(12.2-RELEASE-p2). So.... it died, I got a bunch of dmesg errors saying there's a bunch of i/o commands stuck, OS became partially livelocked (= I
still could login, but barely could do anything) so.... considering this is a mirrored pool, and "I have done it many times before, nothing cou= ld
be safer !" I sent a reset to the server via IPMI.

And it was quite discouraging finding this after a successful boot-up
from intact zroot (yeah, I've already tried to zpool import -F after an=
export, so initially it was imported already, showing the same
devastating state):


[root@db0:~]# zpool import
pool: data
id: 15967028801499953224
state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
The pool may be active on another system, but can be imported using
the '-f' flag.
see: http://illumos.org/msg/ZFS-8000-5E
config:
data=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 FAULTED=C2=A0 corrupted data
9566965891719887395=C2=A0 FAULTED=C2=A0 corrupted data
nvd0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 ONLINE


# zpool import -F data
cannot import 'data': one or more devices is currently unavailable<= br>

Well, -yeah, I do have a replica, I didn't lose one bit of data, but it's still a tragedy - to lose pool after one silly reset (and I have <= br> done it literally a hundred times before on various servers and FreeBSD versions).

So, a couple of questions:

- is it worth trying FreeBSD 13 to recover ? (just to get the experience if it can be still recovered)

- is it because it's more dangerous with NVMes or would it also happen =
on SSD/rotational drives ?

- would zpool checkpoint save me in this case ?


Thanks.

Eugene.


--0000000000008cdc2a05d8d7b56d--