FreeBSD Mail Archives

Date:      Mon, 19 Aug 2024 15:19:50 -0600
From:      Pamela Ballantyne <boyvalue@gmail.com>
To:        freebsd-fs@freebsd.org
Subject:   ZFS: Suspended Pool due to allegedly uncorrectable I/O error
Message-ID:  <CAESeg0wm9iZN=8tpo_rtC1hgRMzyr4dhJrQyxgSuiS183_9y8A@mail.gmail.com>

index | next in thread | raw e-mail


[-- Attachment #1 --]
Hi,

So, this is long, so here's TL;DR:  ZFS suspended a pool for presumably
good reasons, but on reboot, there didn't seem to be any good reason for it.

As a background, I'm an early ZFS adopter of ZFS. I have a remote server
running ZFS
continuously since late 2010, 24x7. I also use ZFS on my home machines.
While I do not
claim to be a ZFS expert, I've managed to handle the various issues that
have come up over
the years and haven't had to ask for help from the experts.

But now I am completely baffled and would appreciate any help, advice,
pointers, links, whatever.

On Sunday Morning, 08/11, I upgraded the server from 12.4-RELEASE-p9 to
13.3-RELEASE-p5.
The upgrade went smoothly; there was no problem, and the server worked
flawlessly post-upgrade.

On Thursday evening, 8/15, the server became unreachable. It would still
respond to pings via
the IP address, but that was it.  I used to be able to access the server
via IPMI, but that ability disappeared
several company mergers ago. The current NOC staff sent me a screenshot of
the server output,
which showed repeated messages saying:

"Solaris: WARNING: Pool 'zroot' has encountered an uncorrectable I/O
failure and has been suspended."

There had been no warnings in the log files, nothing. There was no sign
from the S.M.A.R.T. monitoring system, nothing.

It's a simple mirrored setup with just two drives. So I expected a
catastrophic hardware failure. Maybe the HBA had
failed (this is on a SuperMicro Blade server), or both drives had manage to
die at the same time.

Without any way to log in remotely, I requested a reboot.  The server
rebooted without errors. I could
ssh into my account and poke around.  Everything was normal. There were no
log entries related to the crash. I realize post-crash
there would be no filesystem to write to, but there was still nothing
leading up to it - no hardware or disk-related
messages of any kind.  The only sign of any problem I could find was 2
checksum errors listed on only one of the
drives in the mirror when I did zpool status.

I ran a scrub, which completed without any problem or error. About 30
minutes after the scrub, the
two checksum errors disappeared without manually clearing them. I've run
some drive tests and
they both pass with flying colors. And it's now been a few days and the
system has been performing flawlessly.

So, I am completely flummoxed. I am trying to understand why the pool was
suspended when it looks like
something ZFS should have easily handled. I've had complete drive failures,
and ZFS just kept on going.
Is there any bug or incompatibility in 13.3-p5?  Is this something that
will recur on each full moon?

So thanks in advance for any advice, shared experiences, or whatever you
can offer.

Best,
Pammy

[-- Attachment #2 --]
<div dir="ltr">Hi,<div><br></div><div>So, this is long, so here&#39;s TL;DR:  ZFS suspended a pool for presumably</div><div>good reasons, but on reboot, there didn&#39;t seem to be any good reason for it.</div><div><br></div><div>As a background, I&#39;m an early ZFS adopter of ZFS. I have a remote server running ZFS<br></div><div>continuously since late 2010, 24x7. I also use ZFS on my home machines. While I do not</div><div>claim to be a ZFS expert, I&#39;ve managed to handle the various issues that have come up over </div><div>the years and haven&#39;t had to ask for help from the experts.</div><div><br></div><div>But now I am completely baffled and would appreciate any help, advice, pointers, links, whatever.</div><div><br></div><div>On Sunday Morning, 08/11, I upgraded the server from 12.4-RELEASE-p9 to 13.3-RELEASE-p5.</div><div>The upgrade went smoothly; there was no problem, and the server worked flawlessly post-upgrade.</div><div><br></div><div>On Thursday evening, 8/15, the server became unreachable. It would still respond to pings via </div><div>the IP address, but that was it.  I used to be able to access the server via IPMI, but that ability disappeared</div><div>several company mergers ago. The current NOC staff sent me a screenshot of the server output,</div><div>which showed repeated messages saying:</div><div><br></div><div>&quot;Solaris: WARNING: Pool &#39;zroot&#39; has encountered an uncorrectable I/O failure and has been suspended.&quot;<br></div><div><br></div><div>There had been no warnings in the log files, nothing. There was no sign from the S.M.A.R.T. monitoring system, nothing.</div><div><br></div><div>It&#39;s a simple mirrored setup with just two drives. So I expected a catastrophic hardware failure. Maybe the HBA had </div><div>failed (this is on a SuperMicro Blade server), or both drives had manage to die at the same time. </div><div><br></div><div>Without any way to log in remotely, I requested a reboot.  The server rebooted without errors. I could</div><div>ssh into my account and poke around.  Everything was normal. There were no log entries related to the crash. I realize post-crash</div><div>there would be no filesystem to write to, but there was still nothing leading up to it - no hardware or disk-related</div><div>messages of any kind.  The only sign of any problem I could find was 2 checksum errors listed on only one of the</div><div>drives in the mirror when I did zpool status.</div><div><br></div><div>I ran a scrub, which completed without any problem or error. About 30 minutes after the scrub, the </div><div>two checksum errors disappeared without manually clearing them. I&#39;ve run some drive tests and</div><div>they both pass with flying colors. And it&#39;s now been a few days and the system has been performing flawlessly.</div><div><br></div><div>So, I am completely flummoxed. I am trying to understand why the pool was suspended when it looks like</div><div>something ZFS should have easily handled. I&#39;ve had complete drive failures, and ZFS just kept on going.</div><div>Is there any bug or incompatibility in 13.3-p5?  Is this something that will recur on each full moon?</div><div><br></div><div>So thanks in advance for any advice, shared experiences, or whatever you can offer.</div><div><br></div><div>Best,</div><div>Pammy</div><div><br></div><div><br></div><div><br></div></div>

help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAESeg0wm9iZN=8tpo_rtC1hgRMzyr4dhJrQyxgSuiS183_9y8A>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation