Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 12 Aug 2023 22:38:54 -0400
From:      Garrett Wollman <wollman@bimajority.org>
To:        freebsd-stable@freebsd.org
Subject:   Interesting (Open)ZFS issue
Message-ID:  <25816.16958.659259.797522@hergotha.csail.mit.edu>

next in thread | raw e-mail | index | archive | help
On Friday, a server that I had upgraded to 13.2 the day before
suddenly developed faults on both SSDs in its root pool.  The
machine was a Dell R430, so the two SSDs were behind a LSI/whoever HBA
(new enough that it's an mpr(4) and not mps or mpt).  The first disk
started reporting the exceedingly obscure:

(da0:mpr0:0:0:0): SCSI sense: ILLEGAL REQUEST asc:74,79 (Security conflict in translated device)

(This error is so obscure that the only place I could find it was in
the SCSI-ATA translator specification and all of the lists of SCSI
sense codes that copy the message directly from it.)

The other drive started throwing more "normal", or at least
interpretable, uncorrectable read errors at the same time.
I immediately powered the machine off and, when I got into the data
center, moved the mostly-working drive to another server so I could
copy off whatever bits were still readable using `dd conv=sync,noerror`.

Mounting the copy, with zero-filled blocks in place of the errored
blocks, I could `zpool scrub` it and it would unsurprisingly find a
bunch of errors, but otherwise complete "successfully".  I could copy
most of the data off, but attempting to read certain parts would
result in a panic about one second later -- fast enough that I could
not catch the panic message, but enough of a delay that the `cp`
command completed and exited to the shell prompt before the panic.
Attempting to destroy one of the snapshots with a lot of the errored
blocks in it would insta-panic.

This seems to me like a bug: `zpool scrub` correctly identified the
damaged parts of the disk, so ZFS knows that those regions of the pool
are bad in some way -- they should cause an error rather than a panic!

I did manage to (mostly successfully) migrate the data and essential
functions of the old server to new drives and new-to-us hardware, so
I'm not looking for debugging help here, but I wanted to at least get
this issue into the archives as something that can happen.  Because
the data (network traces) is sensitive, I unfortunately can't provide
an image of the filesystem for debugging purposes.

-GAWollman




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?25816.16958.659259.797522>