Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 8 Jun 2023 00:24:55 -0600
From:      Warner Losh <imp@bsdimp.com>
To:        Rebecca Cran <rebecca@bsdio.com>
Cc:        FreeBSD CURRENT <freebsd-current@freebsd.org>
Subject:   Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)
Message-ID:  <CANCZdfrdRN%2BzGkk=V9Sk=uoZYgtEkRx9G5MKaJQ9tPMARDwjEA@mail.gmail.com>
In-Reply-To: <5b52fc08-fb5a-900e-b98c-817a4ab79846@bsdio.com>
References:  <5b52fc08-fb5a-900e-b98c-817a4ab79846@bsdio.com>

next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000041c28005fd9850ff
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Wed, Jun 7, 2023 at 11:12=E2=80=AFPM Rebecca Cran <rebecca@bsdio.com> wr=
ote:

> I got a seemingly random nvme data transfer error on my new arm64 Ampere
> Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive.
>
> Since it's a new drive and smartctl doesn't show any errors I thought it
> might be worth mentioning here.
>
> I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b585=
f.
>
>
> dmesg contains:
>
> nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8
> nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0
> (nda0:nvme0:0:0:1): WRITE. NCB: opc=3D1 fuse=3D0 nsid=3D1 prp1=3D0 prp2=
=3D0
> cdw=3D98085b90 0 7 0 0 0
> (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error
> (nda0:nvme0:0:0:1): Error 5, Retries exhausted
>
>
> nvmecontrol identify nvme0 shows:
>
> Vendor ID:                   144d
> Subsystem Vendor ID:         144d
> Model Number:                SAMSUNG MZPLJ6T4HALA-00007
> Firmware Version:            EPK9CB5Q
> Recommended Arb Burst:       8
> IEEE OUI Identifier:         00 25 38
> Multi-Path I/O Capabilities: Multiple controllers, Multiple ports
> Max Data Transfer Size:      131072 bytes
> Sanitize Crypto Erase:       Supported
> Sanitize Block Erase:        Supported
> Sanitize Overwrite:          Not Supported
> Sanitize NDI:                Not Supported
> Sanitize NODMMAS:            Undefined
> Controller ID:               0x0041
> Version:                     1.3.0
>

PCIe 3 or PCIe 4?

So the only documented reason for this error is if we setup the memory wron=
g
such that the drive couldn't start a transfer from the specified address.
This seems
weird to me... But in the prior paragraph it talks about other types of
aborts that
need software intervention. If this is a transient error, then  maybe we
should retry
it as part of the data recovery. Unless this do not retry bit is set. which
it isn't. I wonder
this is retried 5 times or not before generating the error...

Warner

--00000000000041c28005fd9850ff
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Wed, Jun 7, 2023 at 11:12=E2=80=AF=
PM Rebecca Cran &lt;<a href=3D"mailto:rebecca@bsdio.com">rebecca@bsdio.com<=
/a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I=
 got a seemingly random nvme data transfer error on my new arm64 Ampere <br=
>
Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive.<br>
<br>
Since it&#39;s a new drive and smartctl doesn&#39;t show any errors I thoug=
ht it <br>
might be worth mentioning here.<br>
<br>
I&#39;m running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b5=
85f.<br>
<br>
<br>
dmesg contains:<br>
<br>
nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8<br>
nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0<b=
r>
(nda0:nvme0:0:0:1): WRITE. NCB: opc=3D1 fuse=3D0 nsid=3D1 prp1=3D0 prp2=3D0=
 <br>
cdw=3D98085b90 0 7 0 0 0<br>
(nda0:nvme0:0:0:1): CAM status: CCB request completed with an error<br>
(nda0:nvme0:0:0:1): Error 5, Retries exhausted<br>
<br>
<br>
nvmecontrol identify nvme0 shows:<br>
<br>
Vendor ID:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 144d<br>
Subsystem Vendor ID:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 144d<b=
r>
Model Number:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 SAMSUNG MZPLJ6T4HALA-00007<br>
Firmware Version:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 EPK9CB5Q<br>
Recommended Arb Burst:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 8<br>
IEEE OUI Identifier:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 00 25 =
38<br>
Multi-Path I/O Capabilities: Multiple controllers, Multiple ports<br>
Max Data Transfer Size:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 131072 bytes<br>
Sanitize Crypto Erase:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Supported<br>
Sanitize Block Erase:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Supported<b=
r>
Sanitize Overwrite:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N=
ot Supported<br>
Sanitize NDI:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Not Supported<br>
Sanitize NODMMAS:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0 Undefined<br>
Controller ID:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0 0x0041<br>
Version:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1.3.0<br></blockquot=
e><div><br></div><div>PCIe 3 or PCIe 4?</div><div><br></div><div>So the onl=
y documented reason for this error is if we setup the memory wrong</div><di=
v>such that the drive couldn&#39;t start a transfer from the specified addr=
ess. This seems</div><div>weird to me... But in the prior paragraph it talk=
s about other types of aborts that</div><div>need software intervention. If=
 this is a transient error, then=C2=A0 maybe we should retry</div><div>it a=
s part of the data recovery. Unless this do not retry bit is set. which it =
isn&#39;t. I wonder</div><div>this is retried 5 times or not before generat=
ing the error...</div><div><br></div><div>Warner</div><div>=C2=A0</div></di=
v></div>

--00000000000041c28005fd9850ff--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfrdRN%2BzGkk=V9Sk=uoZYgtEkRx9G5MKaJQ9tPMARDwjEA>