From nobody Sun Aug 13 02:38:54 2023
X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4RNhYh03Sgz4mJdX
	for <freebsd-stable@mlmmj.nyi.freebsd.org>; Sun, 13 Aug 2023 02:39:04 +0000 (UTC)
	(envelope-from wollman@hergotha.csail.mit.edu)
Received: from hergotha.csail.mit.edu (tunnel82308-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "garrett.wollman.name", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4RNhYg1v2zz3Nkp
	for <freebsd-stable@freebsd.org>; Sun, 13 Aug 2023 02:39:03 +0000 (UTC)
	(envelope-from wollman@hergotha.csail.mit.edu)
Authentication-Results: mx1.freebsd.org;
	dkim=none;
	spf=pass (mx1.freebsd.org: domain of wollman@hergotha.csail.mit.edu designates 2001:470:1f06:ccb::2 as permitted sender) smtp.mailfrom=wollman@hergotha.csail.mit.edu;
	dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=bimajority.org (policy=none)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
	by hergotha.csail.mit.edu (8.17.1/8.17.1) with ESMTPS id 37D2ctVk071810
	(version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO)
	for <freebsd-stable@freebsd.org>; Sat, 12 Aug 2023 22:38:55 -0400 (EDT)
	(envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
	by hergotha.csail.mit.edu (8.17.1/8.17.1/Submit) id 37D2ct0Q071809;
	Sat, 12 Aug 2023 22:38:55 -0400 (EDT)
	(envelope-from wollman)
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-stable
List-Help: <mailto:stable+help@freebsd.org>
List-Post: <mailto:stable@freebsd.org>
List-Subscribe: <mailto:stable+subscribe@freebsd.org>
List-Unsubscribe: <mailto:stable+unsubscribe@freebsd.org>
Sender: owner-freebsd-stable@freebsd.org
X-BeenThere: freebsd-stable@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <25816.16958.659259.797522@hergotha.csail.mit.edu>
Date: Sat, 12 Aug 2023 22:38:54 -0400
From: Garrett Wollman <wollman@bimajority.org>
To: freebsd-stable@freebsd.org
Subject: Interesting (Open)ZFS issue
X-Mailer: VM 8.2.0b under 28.2 (amd64-portbld-freebsd13.2)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.4 (hergotha.csail.mit.edu [0.0.0.0]); Sat, 12 Aug 2023 22:38:55 -0400 (EDT)
X-Spam-Status: No, score=-0.8 required=5.0 tests=ALL_TRUSTED,
	HEADER_FROM_DIFFERENT_DOMAINS autolearn=disabled version=4.0.0
X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-14) on
	hergotha.csail.mit.edu
X-Spamd-Result: default: False [-2.49 / 15.00];
	NEURAL_HAM_SHORT(-1.00)[-1.000];
	NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	NEURAL_HAM_LONG(-0.59)[-0.590];
	FORGED_SENDER(0.30)[wollman@bimajority.org,wollman@hergotha.csail.mit.edu];
	R_SPF_ALLOW(-0.20)[+ip6:2001:470:1f06:ccb::2];
	DMARC_POLICY_SOFTFAIL(0.10)[bimajority.org : SPF not aligned (relaxed), No valid DKIM,none];
	MIME_GOOD(-0.10)[text/plain];
	ASN(0.00)[asn:6939, ipnet:2001:470::/32, country:US];
	MLMMJ_DEST(0.00)[freebsd-stable@freebsd.org];
	R_DKIM_NA(0.00)[];
	MIME_TRACE(0.00)[0:+];
	RCVD_TLS_LAST(0.00)[];
	RCPT_COUNT_ONE(0.00)[1];
	TO_DN_NONE(0.00)[];
	FREEFALL_USER(0.00)[wollman];
	ARC_NA(0.00)[];
	FROM_NEQ_ENVFROM(0.00)[wollman@bimajority.org,wollman@hergotha.csail.mit.edu];
	FROM_HAS_DN(0.00)[];
	PREVIOUSLY_DELIVERED(0.00)[freebsd-stable@freebsd.org];
	TO_MATCH_ENVRCPT_ALL(0.00)[];
	RCVD_COUNT_TWO(0.00)[2]
X-Spamd-Bar: --
X-Rspamd-Queue-Id: 4RNhYg1v2zz3Nkp

On Friday, a server that I had upgraded to 13.2 the day before
suddenly developed faults on both SSDs in its root pool.  The
machine was a Dell R430, so the two SSDs were behind a LSI/whoever HBA
(new enough that it's an mpr(4) and not mps or mpt).  The first disk
started reporting the exceedingly obscure:

(da0:mpr0:0:0:0): SCSI sense: ILLEGAL REQUEST asc:74,79 (Security conflict in translated device)

(This error is so obscure that the only place I could find it was in
the SCSI-ATA translator specification and all of the lists of SCSI
sense codes that copy the message directly from it.)

The other drive started throwing more "normal", or at least
interpretable, uncorrectable read errors at the same time.
I immediately powered the machine off and, when I got into the data
center, moved the mostly-working drive to another server so I could
copy off whatever bits were still readable using `dd conv=sync,noerror`.

Mounting the copy, with zero-filled blocks in place of the errored
blocks, I could `zpool scrub` it and it would unsurprisingly find a
bunch of errors, but otherwise complete "successfully".  I could copy
most of the data off, but attempting to read certain parts would
result in a panic about one second later -- fast enough that I could
not catch the panic message, but enough of a delay that the `cp`
command completed and exited to the shell prompt before the panic.
Attempting to destroy one of the snapshots with a lot of the errored
blocks in it would insta-panic.

This seems to me like a bug: `zpool scrub` correctly identified the
damaged parts of the disk, so ZFS knows that those regions of the pool
are bad in some way -- they should cause an error rather than a panic!

I did manage to (mostly successfully) migrate the data and essential
functions of the old server to new drives and new-to-us hardware, so
I'm not looking for debugging help here, but I wanted to at least get
this issue into the archives as something that can happen.  Because
the data (network traces) is sensitive, I unfortunately can't provide
an image of the filesystem for debugging purposes.

-GAWollman