Date: Thu, 8 Jun 2023 00:24:55 -0600 From: Warner Losh <imp@bsdimp.com> To: Rebecca Cran <rebecca@bsdio.com> Cc: FreeBSD CURRENT <freebsd-current@freebsd.org> Subject: Re: Seemingly random nvme (nda) write error on new drive (retries exhausted) Message-ID: <CANCZdfrdRN%2BzGkk=V9Sk=uoZYgtEkRx9G5MKaJQ9tPMARDwjEA@mail.gmail.com> In-Reply-To: <5b52fc08-fb5a-900e-b98c-817a4ab79846@bsdio.com> References: <5b52fc08-fb5a-900e-b98c-817a4ab79846@bsdio.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000041c28005fd9850ff Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Jun 7, 2023 at 11:12=E2=80=AFPM Rebecca Cran <rebecca@bsdio.com> wr= ote: > I got a seemingly random nvme data transfer error on my new arm64 Ampere > Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive. > > Since it's a new drive and smartctl doesn't show any errors I thought it > might be worth mentioning here. > > I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b585= f. > > > dmesg contains: > > nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8 > nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0 > (nda0:nvme0:0:0:1): WRITE. NCB: opc=3D1 fuse=3D0 nsid=3D1 prp1=3D0 prp2= =3D0 > cdw=3D98085b90 0 7 0 0 0 > (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error > (nda0:nvme0:0:0:1): Error 5, Retries exhausted > > > nvmecontrol identify nvme0 shows: > > Vendor ID: 144d > Subsystem Vendor ID: 144d > Model Number: SAMSUNG MZPLJ6T4HALA-00007 > Firmware Version: EPK9CB5Q > Recommended Arb Burst: 8 > IEEE OUI Identifier: 00 25 38 > Multi-Path I/O Capabilities: Multiple controllers, Multiple ports > Max Data Transfer Size: 131072 bytes > Sanitize Crypto Erase: Supported > Sanitize Block Erase: Supported > Sanitize Overwrite: Not Supported > Sanitize NDI: Not Supported > Sanitize NODMMAS: Undefined > Controller ID: 0x0041 > Version: 1.3.0 > PCIe 3 or PCIe 4? So the only documented reason for this error is if we setup the memory wron= g such that the drive couldn't start a transfer from the specified address. This seems weird to me... But in the prior paragraph it talks about other types of aborts that need software intervention. If this is a transient error, then maybe we should retry it as part of the data recovery. Unless this do not retry bit is set. which it isn't. I wonder this is retried 5 times or not before generating the error... Warner --00000000000041c28005fd9850ff Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">= <div dir=3D"ltr" class=3D"gmail_attr">On Wed, Jun 7, 2023 at 11:12=E2=80=AF= PM Rebecca Cran <<a href=3D"mailto:rebecca@bsdio.com">rebecca@bsdio.com<= /a>> wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0= px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I= got a seemingly random nvme data transfer error on my new arm64 Ampere <br= > Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive.<br> <br> Since it's a new drive and smartctl doesn't show any errors I thoug= ht it <br> might be worth mentioning here.<br> <br> I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b5= 85f.<br> <br> <br> dmesg contains:<br> <br> nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8<br> nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0<b= r> (nda0:nvme0:0:0:1): WRITE. NCB: opc=3D1 fuse=3D0 nsid=3D1 prp1=3D0 prp2=3D0= <br> cdw=3D98085b90 0 7 0 0 0<br> (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error<br> (nda0:nvme0:0:0:1): Error 5, Retries exhausted<br> <br> <br> nvmecontrol identify nvme0 shows:<br> <br> Vendor ID:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 144d<br> Subsystem Vendor ID:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 144d<b= r> Model Number:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 SAMSUNG MZPLJ6T4HALA-00007<br> Firmware Version:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 EPK9CB5Q<br> Recommended Arb Burst:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 8<br> IEEE OUI Identifier:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 00 25 = 38<br> Multi-Path I/O Capabilities: Multiple controllers, Multiple ports<br> Max Data Transfer Size:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 131072 bytes<br> Sanitize Crypto Erase:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Supported<br> Sanitize Block Erase:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Supported<b= r> Sanitize Overwrite:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 N= ot Supported<br> Sanitize NDI:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Not Supported<br> Sanitize NODMMAS:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 Undefined<br> Controller ID:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 0x0041<br> Version:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1.3.0<br></blockquot= e><div><br></div><div>PCIe 3 or PCIe 4?</div><div><br></div><div>So the onl= y documented reason for this error is if we setup the memory wrong</div><di= v>such that the drive couldn't start a transfer from the specified addr= ess. This seems</div><div>weird to me... But in the prior paragraph it talk= s about other types of aborts that</div><div>need software intervention. If= this is a transient error, then=C2=A0 maybe we should retry</div><div>it a= s part of the data recovery. Unless this do not retry bit is set. which it = isn't. I wonder</div><div>this is retried 5 times or not before generat= ing the error...</div><div><br></div><div>Warner</div><div>=C2=A0</div></di= v></div> --00000000000041c28005fd9850ff--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfrdRN%2BzGkk=V9Sk=uoZYgtEkRx9G5MKaJQ9tPMARDwjEA>