FreeBSD Mail Archives

Date:      Mon, 4 Nov 2024 18:14:43 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        Dave Cottlehuber <dch@freebsd.org>
Cc:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: nvme device errors & zfs
Message-ID:  <CANCZdfpPmVtt0wMWAYzhq4R0nkt39dg3S2-zVCCQcw%2BTSugkEg@mail.gmail.com>
In-Reply-To: <3293802b-3785-4715-8a6b-0802afb6f908@app.fastmail.com>
References:  <3293802b-3785-4715-8a6b-0802afb6f908@app.fastmail.com>

--000000000000f5fce406262020be
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, Nov 4, 2024 at 10:31=E2=80=AFAM Dave Cottlehuber <dch@freebsd.org> =
wrote:

> What's the best way to see error counters or states on an nvme
> device?
>

Sadly, I think dmesg | grep nvme and/or trolling through /var/log/messages.
Nvme drives don't generally keep good counters of errors...


> I have a typical mirrored nvme zpool, that reported enough errors
> in a burst last week, that 1 drive dropped off the bus [1].
>
> After a reboot, it resilvered, I cleared the errors, and it seems
> fine according to repeated scrubs and a few days of use.
>
> I was unable to see any errors from the nvme drive itself, but
> as its (just) in warranty for 2 more weeks I'd like to know
> if I should return it.
>
> I installed ports `sysutils/nvme-cli` and didn't see anything
> of note there either:
>
> $ doas nvme smart-log /dev/nvme1
> 0xc0484e41: opc: 0x2 fuse: 0 cid 0 nsid:0xffffffff cmd2: 0 cmd3: 0
>           : cdw10: 0x7f0002 cdw11: 0 cdw12: 0 cdw13: 0
>           : cdw14: 0 cdw15: 0 len: 0x200 is_read: 0
> <--- 0 cid: 0 status 0
> Smart Log for NVME device:nvme1 namespace-id:ffffffff
> critical_warning                    : 0
> temperature                         : 39 C
> available_spare                     : 100%
> available_spare_threshold           : 10%
> percentage_used                     : 3%
> data_units_read                     : 121681067
> data_units_written                  : 86619659
> host_read_commands                  : 695211450
> host_write_commands                 : 2187823697
> controller_busy_time                : 2554
> power_cycles                        : 48
> power_on_hours                      : 6342
> unsafe_shutdowns                    : 38
> media_errors                        : 0
> num_err_log_entries                 : 0
> Warning Temperature Time            : 0
> Critical Composite Temperature Time : 0
>

This suggests that the only 'badness' is 38 unsafe shutdowns (likely power
failures), since
either there were a bunch all at once (maybe when installing) or you've had
power off events
every week...

There's been no reported media errors (or the drive hasn't done a good job
of remembering
them, though most NVME is better than most for that).


> Temperature Sensor 1                : 39 C
> Temperature Sensor 2                : 43 C
> Thermal Management T1 Trans Count   : 0
> Thermal Management T2 Trans Count   : 0
> Thermal Management T1 Total Time    : 0
> Thermal Management T2 Total Time    : 0
>

There's been no time where the drive overheated either. That's good.


> [1]: zpool status
> status: One or more devices are faulted in response to persistent errors.
>         Sufficient replicas exist for the pool to continue functioning in=
 a
>         degraded state.
> action: Replace the faulted device, or use 'zpool clear' to mark the devi=
ce
>         repaired.
>   scan: scrub repaired 0B in 00:17:59 with 0 errors on Thu Oct 31 16:24:3=
6
> 2024
> config:
>
>         NAME          STATE     READ WRITE CKSUM
>         zroot         DEGRADED     0     0     0
>           mirror-0    DEGRADED     0     0     0
>             gpt/zfs0  ONLINE       0     0     0
>             gpt/zfs1  FAULTED      0     0     0  too many errors
>

I'm not sure how to reconcile this in the face of the above. I'd have to
see the
dmesg / messages logs for any non-boot messages for nvme / nda.  For bad
drives
at work, I typically see something like:

/var/log/messages.0.bz2:Nov  3 02:48:54 c001 kernel: nvme2: Resetting
controller due to a timeout.
/var/log/messages.0.bz2:Nov  3 02:48:54 c001 kernel: nvme2: Waiting for
reset to complete
/var/log/messages.0.bz2:Nov  3 02:49:05 c001 kernel: nvme2: controller
ready did not become 0 within 10500 ms

for drives that just 'hang' which would cause ZFS to drop them out. I'd see
if there's new firmware or return
the drive.

I also see:
nvme8: READ sqid:3 cid:117 nsid:1 lba:1875786352 len:1024
nvme8: nsid:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0x40defd000 prp2:0x1395a2400
nvme8: cdw10: 0x6fce3a70 cdw11:0 cdw12:0x3ff cdw13:0 cdw14:0 cdw15:0
nvme8: UNRECOVERED READ ERROR (02/81) crd:0 m:1 dnr:1 p:1 sqid:3 cid:117
cdw0:0
(nda8:nvme8:0:0:1): READ. NCB: opc=3D2 fuse=3D0 nsid=3D1 prp1=3D0 prp2=3D0
cdw=3D6fce3a70 0 3ff 0 0 0
(nda8:nvme8:0:0:1): CAM status: NVME Status Error
(nda8:nvme8:0:0:1): Error 5, Retries exhausted
g_vfs_done():nda8p8[READ(offset=3D960402063360, length=3D1048576)]error =3D=
 5

when there's a media error. But the brand of NVMe drives we by report this
as an error:

c029.for002.ix# nvmecontrol logpage -p 2 nvme8
SMART/Health Information Log
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
Critical Warning State:         0x04
 Available spare:               0
 Temperature:                   0
 Device reliability:            1
 Read only:                     0
 Volatile memory backup:        0
[[... but this says the drive has lost data ]]
Power cycles:                   106
Power on hours:                 30250
Unsafe shutdowns:               19
Media errors:                   3
No. error info log entries:     3
Warning Temp Composite Time:    0
Error Temp Composite Time:      0
Temperature 1 Transition Count: 0
Temperature 2 Transition Count: 0
Total Time For Temperature 1:   0
Total Time For Temperature 2:   0

so there's 3 media errors. I can read the log page to find the LBA too (I'm
working on
enhancing the errors we report for NVMe to include LBA of first error too,
but that's not
there yet).

But since you don't have any media errors, I'd check history to see if the
nvme drives
are resetting (either successfully or not). But I don't know how to get
that data from just
the drive logs.

Warner

--000000000000f5fce406262020be
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Mon, Nov 4, 2024 at 10:31=E2=80=AF=
AM Dave Cottlehuber &lt;<a href=3D"mailto:dch@freebsd.org">dch@freebsd.org<=
/a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0=
px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">W=
hat&#39;s the best way to see error counters or states on an nvme<br>
device?<br></blockquote><div><br></div><div>Sadly, I think dmesg | grep nvm=
e and/or trolling through /var/log/messages.</div><div>Nvme drives don&#39;=
t generally keep good counters of errors...</div><div>=C2=A0<br></div><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:=
1px solid rgb(204,204,204);padding-left:1ex">
I have a typical mirrored nvme zpool, that reported enough errors<br>
in a burst last week, that 1 drive dropped off the bus [1].<br>
<br>
After a reboot, it resilvered, I cleared the errors, and it seems<br>
fine according to repeated scrubs and a few days of use.<br>
<br>
I was unable to see any errors from the nvme drive itself, but<br>
as its (just) in warranty for 2 more weeks I&#39;d like to know<br>
if I should return it.<br>
<br>
I installed ports `sysutils/nvme-cli` and didn&#39;t see anything <br>
of note there either:<br>
<br>
$ doas nvme smart-log /dev/nvme1<br>
0xc0484e41: opc: 0x2 fuse: 0 cid 0 nsid:0xffffffff cmd2: 0 cmd3: 0<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : cdw10: 0x7f0002 cdw11: 0 cdw12: 0 cdw1=
3: 0<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : cdw14: 0 cdw15: 0 len: 0x200 is_read: =
0<br>
&lt;--- 0 cid: 0 status 0<br>
Smart Log for NVME device:nvme1 namespace-id:ffffffff<br>
critical_warning=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 : 0<br>
temperature=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0: 39 C<br>
available_spare=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0: 100%<br>
available_spare_threshold=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0: 10%<br>
percentage_used=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0: 3%<br>
data_units_read=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0: 121681067<br>
data_units_written=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 : 86619659<br>
host_read_commands=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 : 695211450<br>
host_write_commands=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0: 2187823697<br>
controller_busy_time=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 : 2554<br>
power_cycles=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 : 48<br>
power_on_hours=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 : 6342<br>
unsafe_shutdowns=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 : 38<br>
media_errors=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 : 0<br>
num_err_log_entries=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0: 0<br>
Warning Temperature Time=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : 0<br>
Critical Composite Temperature Time : 0<br></blockquote><div><br></div><div=
>This suggests that the only &#39;badness&#39; is 38 unsafe shutdowns (like=
ly power failures), since</div><div>either there were a bunch all at once (=
maybe when installing) or you&#39;ve had power off events</div><div>every w=
eek...</div><div><br></div><div>There&#39;s been no reported media errors (=
or the drive hasn&#39;t done a good job of remembering</div><div>them, thou=
gh most NVME is better than most for that).</div><div>=C2=A0</div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex">
Temperature Sensor 1=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 : 39 C<br>
Temperature Sensor 2=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 : 43 C<br>
Thermal Management T1 Trans Count=C2=A0 =C2=A0: 0<br>
Thermal Management T2 Trans Count=C2=A0 =C2=A0: 0<br>
Thermal Management T1 Total Time=C2=A0 =C2=A0 : 0<br>
Thermal Management T2 Total Time=C2=A0 =C2=A0 : 0<br></blockquote><div><br>=
</div><div>There&#39;s been no time where the drive overheated either. That=
&#39;s good.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">
[1]: zpool status<br>
status: One or more devices are faulted in response to persistent errors.<b=
r>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 Sufficient replicas exist for the pool to conti=
nue functioning in a<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 degraded state.<br>
action: Replace the faulted device, or use &#39;zpool clear&#39; to mark th=
e device<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 repaired.<br>
=C2=A0 scan: scrub repaired 0B in 00:17:59 with 0 errors on Thu Oct 31 16:2=
4:36 2024<br>
config:<br>
<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 NAME=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 STATE=C2=
=A0 =C2=A0 =C2=A0READ WRITE CKSUM<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 zroot=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DEGRADED=
=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mirror-0=C2=A0 =C2=A0 DEGRADED=C2=A0 =C2=
=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 gpt/zfs0=C2=A0 ONLINE=C2=A0 =C2=
=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 gpt/zfs1=C2=A0 FAULTED=C2=A0 =C2=
=A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 too many error=
s<br></blockquote><div><br></div><div>I&#39;m not sure how to reconcile thi=
s in the face of the above. I&#39;d have to see the</div><div>dmesg / messa=
ges logs for any non-boot messages for nvme / nda.=C2=A0 For bad drives</di=
v><div>at work, I typically see something like:</div><div><br></div>/var/lo=
g/messages.0.bz2:Nov =C2=A03 02:48:54 c001 kernel: nvme2: Resetting control=
ler due to a timeout.<br>/var/log/messages.0.bz2:Nov =C2=A03 02:48:54 c001 =
kernel: nvme2: Waiting for reset to complete<br>/var/log/messages.0.bz2:Nov=
 =C2=A03 02:49:05 c001 kernel: nvme2: controller ready did not become 0 wit=
hin 10500 ms<br><div><br></div><div>for drives that just &#39;hang&#39; whi=
ch would cause ZFS to drop them out. I&#39;d see if there&#39;s new firmwar=
e or return</div><div>the drive.</div><div><br></div><div>I also see:</div>=
<div>nvme8: READ sqid:3 cid:117 nsid:1 lba:1875786352 len:1024<br>nvme8: ns=
id:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0x40defd000 prp2:0x1395a2400<br>nvme8: c=
dw10: 0x6fce3a70 cdw11:0 cdw12:0x3ff cdw13:0 cdw14:0 cdw15:0<br>nvme8: UNRE=
COVERED READ ERROR (02/81) crd:0 m:1 dnr:1 p:1 sqid:3 cid:117 cdw0:0<br>(nd=
a8:nvme8:0:0:1): READ. NCB: opc=3D2 fuse=3D0 nsid=3D1 prp1=3D0 prp2=3D0 cdw=
=3D6fce3a70 0 3ff 0 0 0<br>(nda8:nvme8:0:0:1): CAM status: NVME Status Erro=
r<br>(nda8:nvme8:0:0:1): Error 5, Retries exhausted<br>g_vfs_done():nda8p8[=
READ(offset=3D960402063360, length=3D1048576)]error =3D 5<br></div><div><br=
></div><div>when there&#39;s a media error. But the brand of NVMe drives we=
 by report this as an error:</div><div><br></div><div>c029.for002.ix# nvmec=
ontrol logpage -p 2 nvme8<br>SMART/Health Information Log<br>=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br=
>Critical Warning State: =C2=A0 =C2=A0 =C2=A0 =C2=A0 0x04<br>=C2=A0Availabl=
e spare: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br>=C2=A0Temper=
ature: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br>=
=C2=A0Device reliability: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01<br>=C2=
=A0Read only: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 0<br>=C2=A0Volatile memory backup: =C2=A0 =C2=A0 =C2=A0 =C2=A00<=
/div><div>[[... but this says the drive has lost data ]]</div><div>Power cy=
cles: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 106<br=
>Power on hours: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 30=
250<br>Unsafe shutdowns: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 1=
9<br>Media errors: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 3<br>No. error info log entries: =C2=A0 =C2=A0 3<br>Warning Temp Com=
posite Time: =C2=A0 =C2=A00<br>Error Temp Composite Time: =C2=A0 =C2=A0 =C2=
=A00<br>Temperature 1 Transition Count: 0<br>Temperature 2 Transition Count=
: 0<br>Total Time For Temperature 1: =C2=A0 0<br>Total Time For Temperature=
 2: =C2=A0 0<br></div><div><br></div><div>so there&#39;s 3 media errors. I =
can read the log page to find the LBA too (I&#39;m working on</div><div>enh=
ancing the errors we report for NVMe to include LBA of first error too, but=
 that&#39;s not</div><div>there yet).</div><div><br></div><div>But since yo=
u don&#39;t have any media errors, I&#39;d check history to see if the nvme=
 drives</div><div>are resetting (either successfully or not). But I don&#39=
;t know how to get that data from just</div><div>the drive logs.</div><div>=
<br></div><div>Warner</div></div></div>

--000000000000f5fce406262020be--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfpPmVtt0wMWAYzhq4R0nkt39dg3S2-zVCCQcw%2BTSugkEg>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation