From nobody Mon Aug 19 21:19:50 2024 X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4WnlqQ5VL6z5TXHs for ; Mon, 19 Aug 2024 21:20:02 +0000 (UTC) (envelope-from boyvalue@gmail.com) Received: from mail-qt1-x836.google.com (mail-qt1-x836.google.com [IPv6:2607:f8b0:4864:20::836]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4WnlqQ1D6Vz4rMV for ; Mon, 19 Aug 2024 21:20:02 +0000 (UTC) (envelope-from boyvalue@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20230601 header.b=d+sttAKN; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of boyvalue@gmail.com designates 2607:f8b0:4864:20::836 as permitted sender) smtp.mailfrom=boyvalue@gmail.com Received: by mail-qt1-x836.google.com with SMTP id d75a77b69052e-44fe9cf83c7so26817891cf.0 for ; Mon, 19 Aug 2024 14:20:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1724102401; x=1724707201; darn=freebsd.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=ImDhKFKhcqOEtqtubBHyoB8fhmSVWUzooSBBqqGhG5o=; b=d+sttAKNQqyfMthtQLRSGkpH+KjgTjLT3Z5pXA1Isg3FzWsu5vRmLTy0qkucAN8Lk0 YpBJBW9c52dbdp9r9A+e5Qm5gl1TjFa7BG34wSaOgqGPyJ6k7j6JxlgeflbWR7W4GyeY l4GvCvQW9/z6C2cAdtGFK3hSbzTLO+HiTAyPcLW0tEJAPLCvpdkVS1rZOZHk7WO/44St Uhj30DCd+6QIxW6S76juZ6e9asxi6MW9Gr0i5FMfXPRN4KhQLMzaovgvDJN6PcUlla1V YlJk2SAwNYXKkEn3tk5MpjA34wxYosnDR5tMdH2ZOiL2mEMolKILc4o5L7AQMTopt6Cu wvlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724102401; x=1724707201; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ImDhKFKhcqOEtqtubBHyoB8fhmSVWUzooSBBqqGhG5o=; b=nUYoMHW727JVheLhR2Et/tgN6njiKfPODFFpMpnPIkX/V9E3MxwChVk28fGT+2YI4/ qWVhIjdUYInCUnHU4SVcwJRRSiNSZEmpA6uyAVJxUhtjOZkgLhC8A2iJNKOdWpI6Amzf cPgccLGdb5SOgxgWo+65PGYFqwu2Bc9xaPlg2jpIB2211S7sima9TnlKPoLCRxKGysza sC5YMg85lOLWZWNbNQRDc9jKqf+GFq9fs4uDjNgNzTNdcmOKLHrA0lYUNIbHffFGQR13 /oOSrrS+X3L6fG5gBHYY/D8Tgk5l5abkwjmN7eFuH3eYN3YvajJVDDG7PUqe8+uB8rLA i1sg== X-Gm-Message-State: AOJu0YxuemnbMOI5MrDAApu0zsBeWV3CWDx9sBopWUESZvJBmjNGCzu5 dglrYhnD9O3swYTpkG+jY0+WFO1Og4k5oyGV/ha8GcJJLu4CYFW+4I90X/fq6NHrAAuenCdvjgA e3UIV0car2PWQ1/XXNWVmewI36qw9n1w= X-Google-Smtp-Source: AGHT+IHQH9mZKeXOXHrOScPcf63k1ed+O5mLYZdi8TmrlCVGF/IdgZPCq7ra0pN9D819Xif63y+y6bis7x042W/1wKE= X-Received: by 2002:a05:622a:4819:b0:44f:e2ba:2d66 with SMTP id d75a77b69052e-453742189admr204472501cf.18.1724102400899; Mon, 19 Aug 2024 14:20:00 -0700 (PDT) List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@FreeBSD.org MIME-Version: 1.0 From: Pamela Ballantyne Date: Mon, 19 Aug 2024 15:19:50 -0600 Message-ID: Subject: ZFS: Suspended Pool due to allegedly uncorrectable I/O error To: freebsd-fs@freebsd.org Content-Type: multipart/alternative; boundary="00000000000029846c06200fdf50" X-Spamd-Bar: --- X-Spamd-Result: default: False [-4.00 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-0.999]; NEURAL_HAM_SHORT(-1.00)[-0.999]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20230601]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; TO_MATCH_ENVRCPT_ALL(0.00)[]; RCVD_TLS_LAST(0.00)[]; ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; RCPT_COUNT_ONE(0.00)[1]; FREEMAIL_ENVFROM(0.00)[gmail.com]; FREEMAIL_FROM(0.00)[gmail.com]; DKIM_TRACE(0.00)[gmail.com:+]; FROM_HAS_DN(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org]; TO_DN_NONE(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MISSING_XM_UA(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; MLMMJ_DEST(0.00)[freebsd-fs@freebsd.org]; RCVD_COUNT_ONE(0.00)[1]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::836:from] X-Rspamd-Queue-Id: 4WnlqQ1D6Vz4rMV --00000000000029846c06200fdf50 Content-Type: text/plain; charset="UTF-8" Hi, So, this is long, so here's TL;DR: ZFS suspended a pool for presumably good reasons, but on reboot, there didn't seem to be any good reason for it. As a background, I'm an early ZFS adopter of ZFS. I have a remote server running ZFS continuously since late 2010, 24x7. I also use ZFS on my home machines. While I do not claim to be a ZFS expert, I've managed to handle the various issues that have come up over the years and haven't had to ask for help from the experts. But now I am completely baffled and would appreciate any help, advice, pointers, links, whatever. On Sunday Morning, 08/11, I upgraded the server from 12.4-RELEASE-p9 to 13.3-RELEASE-p5. The upgrade went smoothly; there was no problem, and the server worked flawlessly post-upgrade. On Thursday evening, 8/15, the server became unreachable. It would still respond to pings via the IP address, but that was it. I used to be able to access the server via IPMI, but that ability disappeared several company mergers ago. The current NOC staff sent me a screenshot of the server output, which showed repeated messages saying: "Solaris: WARNING: Pool 'zroot' has encountered an uncorrectable I/O failure and has been suspended." There had been no warnings in the log files, nothing. There was no sign from the S.M.A.R.T. monitoring system, nothing. It's a simple mirrored setup with just two drives. So I expected a catastrophic hardware failure. Maybe the HBA had failed (this is on a SuperMicro Blade server), or both drives had manage to die at the same time. Without any way to log in remotely, I requested a reboot. The server rebooted without errors. I could ssh into my account and poke around. Everything was normal. There were no log entries related to the crash. I realize post-crash there would be no filesystem to write to, but there was still nothing leading up to it - no hardware or disk-related messages of any kind. The only sign of any problem I could find was 2 checksum errors listed on only one of the drives in the mirror when I did zpool status. I ran a scrub, which completed without any problem or error. About 30 minutes after the scrub, the two checksum errors disappeared without manually clearing them. I've run some drive tests and they both pass with flying colors. And it's now been a few days and the system has been performing flawlessly. So, I am completely flummoxed. I am trying to understand why the pool was suspended when it looks like something ZFS should have easily handled. I've had complete drive failures, and ZFS just kept on going. Is there any bug or incompatibility in 13.3-p5? Is this something that will recur on each full moon? So thanks in advance for any advice, shared experiences, or whatever you can offer. Best, Pammy --00000000000029846c06200fdf50 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi,

So, this is long, so here's TL;= DR:=C2=A0 ZFS suspended a pool for presumably
good reasons, but o= n reboot, there didn't seem to be any good reason for it.
As a background, I'm an early ZFS adopter of ZFS. I have a = remote server running ZFS
continuously since late 2010, 24x7.= I also use ZFS on my home machines. While I do not
claim to be a= ZFS expert, I've managed to handle the various issues that have come u= p over=C2=A0
the years and haven't had to ask for help from t= he experts.

But now I am completely baffled and wo= uld appreciate any help, advice, pointers, links, whatever.

<= /div>
On Sunday Morning, 08/11, I upgraded the server from 12.4-RELEASE= -p9 to 13.3-RELEASE-p5.
The upgrade went smoothly; there was no p= roblem, and the server worked flawlessly post-upgrade.

=
On Thursday evening, 8/15, the server became unreachable. It would sti= ll respond to pings via=C2=A0
the IP address, but that was it.=C2= =A0 I used to be able to access the server via IPMI, but that ability disap= peared
several company mergers ago. The current NOC staff sent me= a screenshot of the server output,
which showed repeated message= s saying:

"Solaris: WARNING: Pool 'zroot&= #39; has encountered an uncorrectable I/O failure and has been suspended.&q= uot;

There had been no warnings in the log fil= es, nothing. There was no sign from the S.M.A.R.T. monitoring system, nothi= ng.

It's a simple mirrored setup with just two= drives. So I expected a catastrophic hardware failure. Maybe the HBA had= =C2=A0
failed (this is on a SuperMicro Blade server), or both dri= ves had manage to die at the same time.=C2=A0

With= out any way to log in remotely, I requested a reboot.=C2=A0 The server rebo= oted without errors. I could
ssh into my account and poke around.= =C2=A0 Everything was normal. There were no log entries related to the cras= h. I realize post-crash
there would be no filesystem to write to,= but there was still nothing leading up to it - no hardware or disk-related=
messages of any kind.=C2=A0 The only sign of any problem I could= find was 2 checksum errors listed on only one of the
drives in t= he mirror when I did zpool status.

I ran a scrub, = which completed without any problem or error. About 30 minutes after the sc= rub, the=C2=A0
two checksum errors disappeared without manually c= learing them. I've run some drive tests and
they both pass wi= th flying colors. And it's now been a few days and=C2=A0the system has = been performing flawlessly.

So, I am completely fl= ummoxed. I am trying to understand why the pool was suspended when it looks= like
something ZFS should have easily handled. I've had comp= lete drive failures, and ZFS just kept on going.
Is there any bug= or incompatibility in 13.3-p5?=C2=A0 Is this something that will recur on = each full moon?

So thanks in advance for any advic= e, shared experiences, or whatever you can offer.

= Best,
Pammy



--00000000000029846c06200fdf50--