Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 22 Dec 2025 09:58:13 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        Alexander Leidinger <Alexander@leidinger.net>
Cc:        Current <current@freebsd.org>
Subject:   Re: Changes in cam/nvme causes issues?
Message-ID:  <CANCZdfobDeZ6eZ7AgcDS7FFMzVhOU0KUFPYxwK6=s41nA=GB2Q@mail.gmail.com>
In-Reply-To: <89a92e0a926239e2c192dc0ff9c80d6e@Leidinger.net>
References:  <198170948d34f4dc169e94934da82161@Leidinger.net> <CANCZdfpRA%2B7YNV6Qm8M=wkKs1Kx_uez%2BEzAN_8W%2BAs_Amp5fAA@mail.gmail.com> <89a92e0a926239e2c192dc0ff9c80d6e@Leidinger.net>

index | next in thread | previous in thread | raw e-mail

[-- Attachment #1 --]
On Sun, Dec 21, 2025 at 8:37 AM Alexander Leidinger <Alexander@leidinger.net>
wrote:

> Am 2025-12-14 14:05, schrieb Warner Losh:
>
> Let's do one issue at a time. There's too much missing info. Top posting
> since there's  not a lot of context to this request
>
>
> The disk died now completely, so the CRC errors are out of reach now.
>
>
> First, let's start with pciconf -l of the nvme drive. I have a strong
> idea, but need some data.
>
>
> While already provided privately with some other data, here for the public
> so that people are aware that currently there is an issue with such drives:
> nvme0@pci0:5:0:0: class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d
> device=0xa809 subvendor=0x144d subdevice=0xa801
> Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V
>

Yea, so far this is the only report I've received, and there's not enough
data in it to reproduce it with any of the dozen NVMe drives that I have,
or to spot a difference with what I know I check in the code. So if it's
compiled into the kernel with cam also compiled into the kernel, I know it
works.

Warner


> Bye,
> Alexander.
>
>
> Also, the disk report needs full logs with and without the settings that
> have uncorrectable in them. I'd expect that a shorter timeout would lead to
> different behavior, but maybe that error syndrome isn't one I've seen. It
> would also be helpful to know which of the times changes the behavior...
>
> Warner
>
> On Sun, Dec 14, 2025, 5:06 AM Alexander Leidinger <Alexander@leidinger.net>
> wrote:
>
> Hi Warner,
>
> I try to update a 15-current (as of 2025-11-27-110715) to a recent 16
> (as of 2025-12-13-132815). It fails to import a pool due to a missing
> nvme. I also have a broken HD in this system... to be on the safe side I
> mention it.
>
> This is from 15-current:
> ---snip---
>          NAME                               STATE     READ WRITE CKSUM
>          rpool                              DEGRADED     0     0     0
>            mirror-0                         DEGRADED     0     0     0
>              diskid/DISK-WD-WCC4N4KLEZT7p3  ONLINE       0     0     0
>              diskid/DISK-WD-WCC4N1DF9DA2p3  ONLINE       0     0     0
>              diskid/DISK-WD-WX52D625R0NTp3  ONLINE       0     0     0
>              diskid/DISK-WD-WCC4N1PYJ3F8p3  OFFLINE      0     0     0
>          logs
>            diskid/DISK-493504058890547p1    ONLINE       0     0     0
>          cache
>            diskid/DISK-493504058890547p2    ONLINE       0     0     0
>
>          NAME                               STATE     READ WRITE CKSUM
>          space                              DEGRADED     0     0     0
>            raidz2-0                         DEGRADED     0     0     0
>              diskid/DISK-WD-WCC4N4KLEZT7p4  ONLINE       0     0     0
>              diskid/DISK-WD-WCC4N1DF9DA2p4  ONLINE       0     0     0
>              diskid/DISK-WD-WX52D625R0NTp4  ONLINE       0     0     0
>              diskid/DISK-WD-WX52D625R2TPp4  ONLINE       0     0     0
>              diskid/DISK-WD-WCC4N1PYJ3F8p4  OFFLINE      0     0     0
>          logs
>            diskid/DISK-S649NL0T819360Vp2    ONLINE       0     0     0
>          cache
>            diskid/DISK-S649NL0T819360Vp3    ONLINE       0     0     0
> ---snip---
>
> The offline marked partitions are on the same HD (the broken one). The
> DISK-S649NL0T819360V device use as log and cache in the second pool
> causes the issue on 16-current.
>
> On 16-current I get "uncorrectable parity/CRC error" messages on boot
> from the broken disk. I used this to get rid of those errors:
> ---snip---
> # grep kern.cam /tmp/be_mount.MhLw/boot/loader.conf
> kern.cam.tur_timeout="60"
> kern.cam.inquiry_timeout="60"
> kern.cam.modesense_timeout="60"
> ---snip---
>
> But the second pool ("space") fails to get imported. When I import it
> via "zpool import -m space" it shows me that the log and cache devices
> (different partitions on the same hardware) are not available.
> This is the device in question as seen from 15-current:
> ---snip---
> nda0: <Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V>
> nda0: Serial Number S649NL0T819360V
> [1] nda0: nvme version 1.4
> nda0: 953869MB (1953525168 512 byte sectors)
> [1] GEOM: new disk nda0
> ...
> [1] pass6 at nvme0 bus 0 scbus6 target 0 lun 1
> pass6: <Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V>
> pass6: Serial Number S649NL0T819360V
> [1] pass6: nvme version 1.4
> ---snip---
>
> In case you need some info from the 15- or 16-current BE, which info do
> you need?
>
> Bye,
> Alexander.
>
> --
> http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
> http://www.FreeBSD.org    netchild@FreeBSD.org  : PGP 0x8F31830F9F2772BF
>
>
> --
> http://www.Leidinger.net Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF
> http://www.FreeBSD.org    netchild@FreeBSD.org  : PGP 0x8F31830F9F2772BF
>

[-- Attachment #2 --]
<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Sun, Dec 21, 2025 at 8:37 AM Alexander Leidinger &lt;<a href="mailto:Alexander@leidinger.net">Alexander@leidinger.net</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="font-size:10pt;font-family:Verdana,Geneva,sans-serif">
<p id="m_-6883825662093253788reply-intro">Am 2025-12-14 14:05, schrieb Warner Losh:</p>
<blockquote type="cite" style="padding:0px 0.4em;border-left:2px solid rgb(16,16,255);margin:0px">
<div id="m_-6883825662093253788replybody1">
<div dir="auto">Let&#39;s do one issue at a time. There&#39;s too much missing info. Top posting since there&#39;s  not a lot of context to this request </div>
</div>
</blockquote>
<div id="m_-6883825662093253788replybody1">
<div dir="auto">
<div dir="auto"> </div>
<div dir="auto">The disk died now completely, so the CRC errors are out of reach now.</div>
<div dir="auto"> </div>
</div>
</div>
<blockquote type="cite" style="padding:0px 0.4em;border-left:2px solid rgb(16,16,255);margin:0px">
<div id="m_-6883825662093253788replybody1">
<div dir="auto">
<div dir="auto">First, let&#39;s start with pciconf -l of the nvme drive. I have a strong idea, but need some data.</div>
</div>
</div>
</blockquote>
<div id="m_-6883825662093253788replybody1">
<div dir="auto">
<div dir="auto"> </div>
<div dir="auto">While already provided privately with some other data, here for the public so that people are aware that currently there is an issue with such drives:</div>
<div dir="auto">nvme0@pci0:5:0:0: class=0x010802 rev=0x00 hdr=0x00 vendor=0x144d device=0xa809 subvendor=0x144d subdevice=0xa801</div>
<div dir="auto">Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V</div></div></div></div></blockquote><div><br></div><div>Yea, so far this is the only report I&#39;ve received, and there&#39;s not enough data in it to reproduce it with any of the dozen NVMe drives that I have, or to spot a difference with what I know I check in the code. So if it&#39;s compiled into the kernel with cam also compiled into the kernel, I know it works.</div><div><br></div><div>Warner </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="font-size:10pt;font-family:Verdana,Geneva,sans-serif"><div id="m_-6883825662093253788replybody1"><div dir="auto"><div dir="auto"></div>
<div dir="auto">Bye,</div>
<div dir="auto">Alexander.</div>
<div dir="auto"> </div>
</div>
</div>
<blockquote type="cite" style="padding:0px 0.4em;border-left:2px solid rgb(16,16,255);margin:0px">
<div id="m_-6883825662093253788replybody1">
<div dir="auto">
<div dir="auto">Also, the disk report needs full logs with and without the settings that have uncorrectable in them. I&#39;d expect that a shorter timeout would lead to different behavior, but maybe that error syndrome isn&#39;t one I&#39;ve seen. It would also be helpful to know which of the times changes the behavior...</div>
<div dir="auto"> </div>
<div dir="auto">Warner</div>
</div>
<br>
<div>
<div dir="ltr">On Sun, Dec 14, 2025, 5:06 AM Alexander Leidinger &lt;<a href="mailto:Alexander@leidinger.net" rel="noreferrer" target="_blank">Alexander@leidinger.net</a>&gt; wrote:</div>
<blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Warner,<br><br>I try to update a 15-current (as of 2025-11-27-110715) to a recent 16 <br>(as of 2025-12-13-132815). It fails to import a pool due to a missing <br>nvme. I also have a broken HD in this system... to be on the safe side I <br>mention it.<br><br>This is from 15-current:<br>---snip---<br>         NAME                               STATE     READ WRITE CKSUM<br>         rpool                              DEGRADED     0     0     0<br>           mirror-0                         DEGRADED     0     0     0<br>             diskid/DISK-WD-WCC4N4KLEZT7p3  ONLINE       0     0     0<br>             diskid/DISK-WD-WCC4N1DF9DA2p3  ONLINE       0     0     0<br>             diskid/DISK-WD-WX52D625R0NTp3  ONLINE       0     0     0<br>             diskid/DISK-WD-WCC4N1PYJ3F8p3  OFFLINE      0     0     0<br>         logs<br>           diskid/DISK-493504058890547p1    ONLINE       0     0     0<br>         cache<br>           diskid/DISK-493504058890547p2    ONLINE       0     0     0<br><br>         NAME                               STATE     READ WRITE CKSUM<br>         space                              DEGRADED     0     0     0<br>           raidz2-0                         DEGRADED     0     0     0<br>             diskid/DISK-WD-WCC4N4KLEZT7p4  ONLINE       0     0     0<br>             diskid/DISK-WD-WCC4N1DF9DA2p4  ONLINE       0     0     0<br>             diskid/DISK-WD-WX52D625R0NTp4  ONLINE       0     0     0<br>             diskid/DISK-WD-WX52D625R2TPp4  ONLINE       0     0     0<br>             diskid/DISK-WD-WCC4N1PYJ3F8p4  OFFLINE      0     0     0<br>         logs<br>           diskid/DISK-S649NL0T819360Vp2    ONLINE       0     0     0<br>         cache<br>           diskid/DISK-S649NL0T819360Vp3    ONLINE       0     0     0<br>---snip---<br><br>The offline marked partitions are on the same HD (the broken one). The <br>DISK-S649NL0T819360V device use as log and cache in the second pool <br>causes the issue on 16-current.<br><br>On 16-current I get &quot;uncorrectable parity/CRC error&quot; messages on boot <br>from the broken disk. I used this to get rid of those errors:<br>---snip---<br># grep kern.cam /tmp/be_mount.MhLw/boot/loader.conf<br>kern.cam.tur_timeout=&quot;60&quot;<br>kern.cam.inquiry_timeout=&quot;60&quot;<br>kern.cam.modesense_timeout=&quot;60&quot;<br>---snip---<br><br>But the second pool (&quot;space&quot;) fails to get imported. When I import it <br>via &quot;zpool import -m space&quot; it shows me that the log and cache devices <br>(different partitions on the same hardware) are not available.<br>This is the device in question as seen from 15-current:<br>---snip---<br>nda0: &lt;Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V&gt;<br>nda0: Serial Number S649NL0T819360V<br>[1] nda0: nvme version 1.4<br>nda0: 953869MB (1953525168 512 byte sectors)<br>[1] GEOM: new disk nda0<br>...<br>[1] pass6 at nvme0 bus 0 scbus6 target 0 lun 1<br>pass6: &lt;Samsung SSD 980 1TB 2B4QFXO7 S649NL0T819360V&gt;<br>pass6: Serial Number S649NL0T819360V<br>[1] pass6: nvme version 1.4<br>---snip---<br><br>In case you need some info from the 15- or 16-current BE, which info do <br>you need?<br><br>Bye,<br>Alexander.<br><br>-- <br><a href="http://www.Leidinger.net" rel="noopener noreferrer" target="_blank">http://www.Leidinger.net</a>; Alexander@Leidinger.net: PGP 0x8F31830F9F2772BF<br><a href="http://www.FreeBSD.org" rel="noopener noreferrer" target="_blank">http://www.FreeBSD.org</a>    netchild@FreeBSD.org  : PGP 0x8F31830F9F2772BF</blockquote>
</div>
</div>
</blockquote>
<p><br></p>
<div id="m_-6883825662093253788signature">-- <br>
<div style="margin:0px;padding:0px;font-family:monospace"><a href="http://www.Leidinger.net" rel="noopener noreferrer" target="_blank">http://www.Leidinger.net</a>; <a href="mailto:Alexander@Leidinger.net:" target="_blank">Alexander@Leidinger.net:</a> PGP 0x8F31830F9F2772BF<br><a href="http://www.FreeBSD.org" rel="noopener noreferrer" target="_blank">http://www.FreeBSD.org</a>;    <a href="mailto:netchild@FreeBSD.org" target="_blank">netchild@FreeBSD.org</a>  : PGP 0x8F31830F9F2772BF</div>
</div>
</div>
</blockquote></div></div>
home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfobDeZ6eZ7AgcDS7FFMzVhOU0KUFPYxwK6=s41nA=GB2Q>