From nobody Tue Dec 23 09:31:43 2025 X-Original-To: current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4db8vl40qkz6L2yB for ; Tue, 23 Dec 2025 09:33:39 +0000 (UTC) (envelope-from Alexander@Leidinger.net) Received: from mailgate.Leidinger.net (bastille.leidinger.net [89.238.82.207]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature ECDSA (prime256v1) client-digest SHA256) (Client CN "mailgate.leidinger.net", Issuer "E7" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4db8vl04Vnz3fnt for ; Tue, 23 Dec 2025 09:33:38 +0000 (UTC) (envelope-from Alexander@Leidinger.net) Authentication-Results: mx1.freebsd.org; none List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@FreeBSD.org MIME-Version: 1.0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=leidinger.net; s=outgoing-alex; t=1766482364; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=wlMo8pqmU185QhCA/AwORucrk74SzhuKeCNsoyvB5N4=; b=nzIY0GNtmMBrpQWt5pgmw/rSy83241vfMZBhCzApd8lP5cQ1ttkRnuENdF05PX6zAsV1nx rp12qqGBh52tgCE0jwAhKYmH9EV6E8wIafP/71A/DpLZHz5LLr1k6H3A7NHDdbl9Tmw7SO iGpB9bWq21uxv1XGooIfEuiRmNVYZYVP3H7fmp7slSaS39aapSZNSQ9Hc3ggmr6rQVrd6t VHhHw/QUbejZs1FMlM///cUNo0yFReX6Bt1Wg11z9CEChFgbNFtw7m46YAwhLMHPLpzN62 AzZLT4IXSdp3/cYVqQct4nKrFy7icW2DnB/r3nMrTc9bMV5DEbVViJWLn4Z1Ng== Date: Tue, 23 Dec 2025 10:31:43 +0100 From: Alexander Leidinger To: Warner Losh Cc: Current Subject: Re: Changes in cam/nvme causes issues? In-Reply-To: References: <198170948d34f4dc169e94934da82161@Leidinger.net> <89a92e0a926239e2c192dc0ff9c80d6e@Leidinger.net> Message-ID: Organization: No organization, this is a private message. Content-Type: multipart/signed; protocol="application/pgp-signature"; boundary="=_c58046925ce793454f8ac90d7ad4be2e"; micalg=pgp-sha256 X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:34240, ipnet:89.238.64.0/18, country:DE] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Rspamd-Queue-Id: 4db8vl04Vnz3fnt This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --=_c58046925ce793454f8ac90d7ad4be2e Content-Type: multipart/alternative; boundary="=_62b8689b2f5b3983b81d241231879e94" --=_62b8689b2f5b3983b81d241231879e94 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8; format=flowed Am 2025-12-22 17:58, schrieb Warner Losh: > On Sun, Dec 21, 2025 at 8:37 AM Alexander Leidinger > wrote: > > Am 2025-12-14 14:05, schrieb Warner Losh: > > Let's do one issue at a time. There's too much missing info. Top > posting since there's not a lot of context to this request > > The disk died now completely, so the CRC errors are out of reach now. > > First, let's start with pciconf -l of the nvme drive. I have a strong > idea, but need some data. > > While already provided privately with some other data, here for the > public so that people are aware that currently there is an issue with > such drives: > nvme0@pci0:5:0:0: class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d > device=0xa809 subvendor=0x144d subdevice=0xa801 > Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V Yea, so far this is the only report I've received, and there's not enough data in it to reproduce it with any of the dozen NVMe drives that I have, or to spot a difference with what I know I check in the code. So if it's compiled into the kernel with cam also compiled into the kernel, I know it works. CAM is in the kerne, nvme is loaded as a module (from 15-current): ---snip--- # kldstat | egrep '(nvm|cam)' 2 1 0xffffffff811e3000 20db8 nvme.ko ---snip--- I will do a clean rebuild with the most recent 16-current and provide a full dmesg if this still doesn't work. Bye, Alexander. > Warner > > Bye, > Alexander. > > Also, the disk report needs full logs with and without the settings > that have uncorrectable in them. I'd expect that a shorter timeout > would lead to different behavior, but maybe that error syndrome isn't > one I've seen. It would also be helpful to know which of the times > changes the behavior... > > Warner > > On Sun, Dec 14, 2025, 5:06 AM Alexander Leidinger > wrote: Hi Warner, > > I try to update a 15-current (as of 2025-11-27-110715) to a recent 16 > (as of 2025-12-13-132815). It fails to import a pool due to a missing > nvme. I also have a broken HD in this system... to be on the safe side > I > mention it. > > This is from 15-current: > ---snip--- > NAME STATE READ WRITE CKSUM > rpool DEGRADED 0 0 0 > mirror-0 DEGRADED 0 0 0 > diskid/DISK-WD-WCC4N4KLEZT7p3 ONLINE 0 0 0 > diskid/DISK-WD-WCC4N1DF9DA2p3 ONLINE 0 0 0 > diskid/DISK-WD-WX52D625R0NTp3 ONLINE 0 0 0 > diskid/DISK-WD-WCC4N1PYJ3F8p3 OFFLINE 0 0 0 > logs > diskid/DISK-493504058890547p1 ONLINE 0 0 0 > cache > diskid/DISK-493504058890547p2 ONLINE 0 0 0 > > NAME STATE READ WRITE CKSUM > space DEGRADED 0 0 0 > raidz2-0 DEGRADED 0 0 0 > diskid/DISK-WD-WCC4N4KLEZT7p4 ONLINE 0 0 0 > diskid/DISK-WD-WCC4N1DF9DA2p4 ONLINE 0 0 0 > diskid/DISK-WD-WX52D625R0NTp4 ONLINE 0 0 0 > diskid/DISK-WD-WX52D625R2TPp4 ONLINE 0 0 0 > diskid/DISK-WD-WCC4N1PYJ3F8p4 OFFLINE 0 0 0 > logs > diskid/DISK-S649NL0T819360Vp2 ONLINE 0 0 0 > cache > diskid/DISK-S649NL0T819360Vp3 ONLINE 0 0 0 > ---snip--- > > The offline marked partitions are on the same HD (the broken one). The > DISK-S649NL0T819360V device use as log and cache in the second pool > causes the issue on 16-current. > > On 16-current I get "uncorrectable parity/CRC error" messages on boot > from the broken disk. I used this to get rid of those errors: > ---snip--- > # grep kern.cam /tmp/be_mount.MhLw/boot/loader.conf > kern.cam.tur_timeout="60" > kern.cam.inquiry_timeout="60" > kern.cam.modesense_timeout="60" > ---snip--- > > But the second pool ("space") fails to get imported. When I import it > via "zpool import -m space" it shows me that the log and cache devices > (different partitions on the same hardware) are not available. > This is the device in question as seen from 15-current: > ---snip--- > nda0: > nda0: Serial Number S649NL0T819360V > [1] nda0: nvme version 1.4 > nda0: 953869MB (1953525168 512 byte sectors) > [1] GEOM: new disk nda0 > ... > [1] pass6 at nvme0 bus 0 scbus6 target 0 lun 1 > pass6: > pass6: Serial Number S649NL0T819360V > [1] pass6: nvme version 1.4 > ---snip--- > > In case you need some info from the 15- or 16-current BE, which info do > you need? > > Bye, > Alexander. > > -- > http://www.Leidinger.net Alexander@Leidinger.net: PGP > 0x8F31830F9F2772BF > http://www.FreeBSD.org netchild@FreeBSD.org : PGP > 0x8F31830F9F2772BF -- http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.org netchild@FreeBSD.org : PGP 0x8F31830F9F2772BF -- http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF http://www.FreeBSD.org netchild@FreeBSD.org : PGP 0x8F31830F9F2772BF --=_62b8689b2f5b3983b81d241231879e94 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Am 2025-12-22 17:58, schrieb Warner Losh:

 

On Sun, Dec 21, 2025 at 8:37=E2=80= =AFAM Alexander Leidinger <Alexander@leidinger.net> wrote:

Am 2025-12-14 14:05, schrieb = Warner Losh:

Let's do one issue at a time. There's too much missi= ng info. Top posting since there's  not a lot of context to this reque= st 
 
The disk died now completely, so the CRC errors are out o= f reach now.
 
First, let's start with pciconf -l of the nvme drive. I h= ave a strong idea, but need some data.
 
While already provided privately with some other data, he= re for the public so that people are aware that currently there is an issue= with such drives:
nvme0@pci0:5:0:0: class=3D0x010802 rev=3D0x00 hdr=3D0x00 = vendor=3D0x144d device=3D0xa809 subvendor=3D0x144d subdevice=3D0xa801
Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V
 
Yea, so far this is the only report I've received, and there's not eno= ugh data in it to reproduce it with any of the dozen NVMe drives that I hav= e, or to spot a difference with what I know I check in the code. So if it's= compiled into the kernel with cam also compiled into the kernel, I know it= works.
 
CAM is in the kerne, nvme is loaded as a module (from 15-current):
---snip---
# kldstat | egrep '(nvm|cam)'
 2    1 0xffffffff81= 1e3000    20db8 nvme.ko
---snip---
 
I will do a clean rebuild with the most recent 16-current and provide = a full dmesg if this still doesn't work.
 
Bye,
Alexander.
 
Warner 
 
 
Bye,
Alexander.
 
Also, the disk report needs full logs with and without th= e settings that have uncorrectable in them. I'd expect that a shorter timeo= ut would lead to different behavior, but maybe that error syndrome isn't on= e I've seen. It would also be helpful to know which of the times changes th= e behavior...
 
Warner

On Sun, Dec 14, 2025, 5:06=E2=80=AFAM Alexander Leidinger = <Alexander= @leidinger.net> wrote:
Hi Warner,

I try to update a 15-current= (as of 2025-11-27-110715) to a recent 16
(as of 2025-12-13-132815). = It fails to import a pool due to a missing
nvme. I also have a broken= HD in this system... to be on the safe side I
mention it.

This is from 15-current:
---snip---
        =  NAME                  &n= bsp;            STATE     READ= WRITE CKSUM
         rpool    &nbs= p;                     &n= bsp;   DEGRADED     0     0   =  0
           mirror-0   = ;                     &nb= sp;DEGRADED     0     0     0<= br />             diskid/DISK-WD-WCC4N4K= LEZT7p3  ONLINE       0     0 =    0
             diskid= /DISK-WD-WCC4N1DF9DA2p3  ONLINE       0  &nbs= p;  0     0
          &nb= sp;  diskid/DISK-WD-WX52D625R0NTp3  ONLINE      &n= bsp;0     0     0
      &= nbsp;      diskid/DISK-WD-WCC4N1PYJ3F8p3  OFFLINE =     0     0     0
  &nbs= p;      logs
           d= iskid/DISK-493504058890547p1    ONLINE       = 0     0     0
       = ;  cache
           diskid/DISK-493= 504058890547p2    ONLINE       0   =  0     0

         = NAME                    &= nbsp;          STATE     READ WRITE= CKSUM
         space      &nb= sp;                     &= nbsp; DEGRADED     0     0     = ;0
           raidz2-0    &nbs= p;                    DEG= RADED     0     0     0
&= nbsp;            diskid/DISK-WD-WCC4N4KLEZT7p= 4  ONLINE       0     0   = ;  0
             diskid/DISK-= WD-WCC4N1DF9DA2p4  ONLINE       0    &nb= sp;0     0
            &n= bsp;diskid/DISK-WD-WX52D625R0NTp4  ONLINE       0&= nbsp;    0     0
        =      diskid/DISK-WD-WX52D625R2TPp4  ONLINE   =    0     0     0
   = ;          diskid/DISK-WD-WCC4N1PYJ3F8p4  OFF= LINE      0     0     0
&= nbsp;        logs
        &nbs= p;  diskid/DISK-S649NL0T819360Vp2    ONLINE    &nb= sp;  0     0     0
    &n= bsp;    cache
           diski= d/DISK-S649NL0T819360Vp3    ONLINE       0&nb= sp;    0     0
---snip---

The offl= ine marked partitions are on the same HD (the broken one). The
DISK-S= 649NL0T819360V device use as log and cache in the second pool
causes = the issue on 16-current.

On 16-current I get "uncorrectable pari= ty/CRC error" messages on boot
from the broken disk. I used this to g= et rid of those errors:
---snip---
# grep kern.cam /tmp/be_mount.= MhLw/boot/loader.conf
kern.cam.tur_timeout=3D"60"
kern.cam.inquir= y_timeout=3D"60"
kern.cam.modesense_timeout=3D"60"
---snip---

But the second pool ("space") fails to get imported. When I import = it
via "zpool import -m space" it shows me that the log and cache dev= ices
(different partitions on the same hardware) are not available.This is the device in question as seen from 15-current:
---snip---=
nda0: <Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V>
nda0:= Serial Number S649NL0T819360V
[1] nda0: nvme version 1.4
nda0: 9= 53869MB (1953525168 512 byte sectors)
[1] GEOM: new disk nda0
...=
[1] pass6 at nvme0 bus 0 scbus6 target 0 lun 1
pass6: <Samsun= g SSD 980 1TB 2B4QFXO7 S649NL0T819360V>
pass6: Serial Number S649NL= 0T819360V
[1] pass6: nvme version 1.4
---snip---

In ca= se you need some info from the 15- or 16-current BE, which info do
yo= u need?

Bye,
Alexander.

--
http://= www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
= http://www.FreeBSD.org    netchild@FreeBSD.org  : = PGP 0x8F31830F9F2772BF


--


--
--=_62b8689b2f5b3983b81d241231879e94-- --=_c58046925ce793454f8ac90d7ad4be2e Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc; size=833 Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- iQIyBAEBCAAdFiEER9UlYXp1PSd08nWXEg2wmwP42IYFAmlKYY4ACgkQEg2wmwP4 2IZ8BA/4npcxTpdRYZJ5/A0+vcT7jZvrg37cCg9EJaBb09v7AqUiaaPwC+7FjY6q Ji4E2xBZyhXf1xC9wbDrDc2jki5xNUD14O5BISyCanARH5u2g7GP6s1blxAX1SzK OmDG11tOMZIPrV7TycyNHh6pquaznRUxb5xbU9mc4OlGoO40OL+EiX5lUVzYDbCw MjjfZb5MYI0QmACsxdNsdLMoftSej7KUm0vJJfF5Bs0LspLwj0uzpnqmw+dNhV9D ROgM0co3MHnzHg3Kju0xeaXevKQ9JSioHzMFe92fg0KR6R3V9XzU69kf8mfx0t3l fHxPY0K88KeRGDR+QoutysZrb1ssN4AkCrskvCpyG5W3aRleUwIAZPjNn48K71AR tIPvchXnnArGhMhEMR4ZMnPRqaFJsGYoY75HzVeJxC2v5IE1XhV34/kZh9pmqyc/ lZQGoWkLHY+W/HJwa8aQUv6uPeTkmFIbm+1KZo3RGudLSPausWW0qT4lnK2dMznk S9JDvxn3Za26FtUbmW7Es0Sq1NzznEzX+ZieNwB5oQ2Gm10LriFTg3GVHy1f5gzc t4Rf6N5wfw2ucOu9EuAfMyVT3YqeNZy+BVWmNUt6mqpJrbboHU+91fX2X7hKVGMy WJVV4sfKjF5YEPorPV8ZU2TessssGFjfUhZwPLleJu0cKKPbEw== =ohgH -----END PGP SIGNATURE----- --=_c58046925ce793454f8ac90d7ad4be2e--