Date: Sat, 12 Aug 2023 22:38:54 -0400 From: Garrett Wollman <wollman@bimajority.org> To: freebsd-stable@freebsd.org Subject: Interesting (Open)ZFS issue Message-ID: <25816.16958.659259.797522@hergotha.csail.mit.edu>
next in thread | raw e-mail | index | archive | help
On Friday, a server that I had upgraded to 13.2 the day before suddenly developed faults on both SSDs in its root pool. The machine was a Dell R430, so the two SSDs were behind a LSI/whoever HBA (new enough that it's an mpr(4) and not mps or mpt). The first disk started reporting the exceedingly obscure: (da0:mpr0:0:0:0): SCSI sense: ILLEGAL REQUEST asc:74,79 (Security conflict in translated device) (This error is so obscure that the only place I could find it was in the SCSI-ATA translator specification and all of the lists of SCSI sense codes that copy the message directly from it.) The other drive started throwing more "normal", or at least interpretable, uncorrectable read errors at the same time. I immediately powered the machine off and, when I got into the data center, moved the mostly-working drive to another server so I could copy off whatever bits were still readable using `dd conv=sync,noerror`. Mounting the copy, with zero-filled blocks in place of the errored blocks, I could `zpool scrub` it and it would unsurprisingly find a bunch of errors, but otherwise complete "successfully". I could copy most of the data off, but attempting to read certain parts would result in a panic about one second later -- fast enough that I could not catch the panic message, but enough of a delay that the `cp` command completed and exited to the shell prompt before the panic. Attempting to destroy one of the snapshots with a lot of the errored blocks in it would insta-panic. This seems to me like a bug: `zpool scrub` correctly identified the damaged parts of the disk, so ZFS knows that those regions of the pool are bad in some way -- they should cause an error rather than a panic! I did manage to (mostly successfully) migrate the data and essential functions of the old server to new drives and new-to-us hardware, so I'm not looking for debugging help here, but I wanted to at least get this issue into the archives as something that can happen. Because the data (network traces) is sensitive, I unfortunately can't provide an image of the filesystem for debugging purposes. -GAWollman
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?25816.16958.659259.797522>