From nobody Sun Aug 13 02:38:54 2023 X-Original-To: freebsd-stable@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4RNhYh03Sgz4mJdX for ; Sun, 13 Aug 2023 02:39:04 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Received: from hergotha.csail.mit.edu (tunnel82308-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "garrett.wollman.name", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4RNhYg1v2zz3Nkp for ; Sun, 13 Aug 2023 02:39:03 +0000 (UTC) (envelope-from wollman@hergotha.csail.mit.edu) Authentication-Results: mx1.freebsd.org; dkim=none; spf=pass (mx1.freebsd.org: domain of wollman@hergotha.csail.mit.edu designates 2001:470:1f06:ccb::2 as permitted sender) smtp.mailfrom=wollman@hergotha.csail.mit.edu; dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=bimajority.org (policy=none) Received: from hergotha.csail.mit.edu (localhost [127.0.0.1]) by hergotha.csail.mit.edu (8.17.1/8.17.1) with ESMTPS id 37D2ctVk071810 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO) for ; Sat, 12 Aug 2023 22:38:55 -0400 (EDT) (envelope-from wollman@hergotha.csail.mit.edu) Received: (from wollman@localhost) by hergotha.csail.mit.edu (8.17.1/8.17.1/Submit) id 37D2ct0Q071809; Sat, 12 Aug 2023 22:38:55 -0400 (EDT) (envelope-from wollman) List-Id: Production branch of FreeBSD source code List-Archive: https://lists.freebsd.org/archives/freebsd-stable List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-stable@freebsd.org X-BeenThere: freebsd-stable@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <25816.16958.659259.797522@hergotha.csail.mit.edu> Date: Sat, 12 Aug 2023 22:38:54 -0400 From: Garrett Wollman To: freebsd-stable@freebsd.org Subject: Interesting (Open)ZFS issue X-Mailer: VM 8.2.0b under 28.2 (amd64-portbld-freebsd13.2) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.4 (hergotha.csail.mit.edu [0.0.0.0]); Sat, 12 Aug 2023 22:38:55 -0400 (EDT) X-Spam-Status: No, score=-0.8 required=5.0 tests=ALL_TRUSTED, HEADER_FROM_DIFFERENT_DOMAINS autolearn=disabled version=4.0.0 X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-14) on hergotha.csail.mit.edu X-Spamd-Result: default: False [-2.49 / 15.00]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-0.59)[-0.590]; FORGED_SENDER(0.30)[wollman@bimajority.org,wollman@hergotha.csail.mit.edu]; R_SPF_ALLOW(-0.20)[+ip6:2001:470:1f06:ccb::2]; DMARC_POLICY_SOFTFAIL(0.10)[bimajority.org : SPF not aligned (relaxed), No valid DKIM,none]; MIME_GOOD(-0.10)[text/plain]; ASN(0.00)[asn:6939, ipnet:2001:470::/32, country:US]; MLMMJ_DEST(0.00)[freebsd-stable@freebsd.org]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; TO_DN_NONE(0.00)[]; FREEFALL_USER(0.00)[wollman]; ARC_NA(0.00)[]; FROM_NEQ_ENVFROM(0.00)[wollman@bimajority.org,wollman@hergotha.csail.mit.edu]; FROM_HAS_DN(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-stable@freebsd.org]; TO_MATCH_ENVRCPT_ALL(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-Spamd-Bar: -- X-Rspamd-Queue-Id: 4RNhYg1v2zz3Nkp On Friday, a server that I had upgraded to 13.2 the day before suddenly developed faults on both SSDs in its root pool. The machine was a Dell R430, so the two SSDs were behind a LSI/whoever HBA (new enough that it's an mpr(4) and not mps or mpt). The first disk started reporting the exceedingly obscure: (da0:mpr0:0:0:0): SCSI sense: ILLEGAL REQUEST asc:74,79 (Security conflict in translated device) (This error is so obscure that the only place I could find it was in the SCSI-ATA translator specification and all of the lists of SCSI sense codes that copy the message directly from it.) The other drive started throwing more "normal", or at least interpretable, uncorrectable read errors at the same time. I immediately powered the machine off and, when I got into the data center, moved the mostly-working drive to another server so I could copy off whatever bits were still readable using `dd conv=sync,noerror`. Mounting the copy, with zero-filled blocks in place of the errored blocks, I could `zpool scrub` it and it would unsurprisingly find a bunch of errors, but otherwise complete "successfully". I could copy most of the data off, but attempting to read certain parts would result in a panic about one second later -- fast enough that I could not catch the panic message, but enough of a delay that the `cp` command completed and exited to the shell prompt before the panic. Attempting to destroy one of the snapshots with a lot of the errored blocks in it would insta-panic. This seems to me like a bug: `zpool scrub` correctly identified the damaged parts of the disk, so ZFS knows that those regions of the pool are bad in some way -- they should cause an error rather than a panic! I did manage to (mostly successfully) migrate the data and essential functions of the old server to new drives and new-to-us hardware, so I'm not looking for debugging help here, but I wanted to at least get this issue into the archives as something that can happen. Because the data (network traces) is sensitive, I unfortunately can't provide an image of the filesystem for debugging purposes. -GAWollman