Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 05 Nov 2024 09:14:59 +0000
From:      "Dave Cottlehuber" <dch@freebsd.org>
To:        "Warner Losh" <imp@bsdimp.com>
Cc:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: nvme device errors & zfs
Message-ID:  <ad8551cc-a595-454b-8645-89a16f60ab0f@app.fastmail.com>
In-Reply-To:  <CANCZdfpPmVtt0wMWAYzhq4R0nkt39dg3S2-zVCCQcw%2BTSugkEg@mail.gmail.com>
References:  <3293802b-3785-4715-8a6b-0802afb6f908@app.fastmail.com> <CANCZdfpPmVtt0wMWAYzhq4R0nkt39dg3S2-zVCCQcw%2BTSugkEg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 5 Nov 2024, at 01:14, Warner Losh wrote:
> On Mon, Nov 4, 2024 at 10:31=E2=80=AFAM Dave Cottlehuber <dch@freebsd.=
org> wrote:
>> What's the best way to see error counters or states on an nvme
>> device?
>
> Sadly, I think dmesg | grep nvme and/or trolling through /var/log/mess=
ages.
> Nvme drives don't generally keep good counters of errors...

Thanks Warner,

good to know. brb, writing all my critical data onto stone tablets
and vellum... mirrored vdevs from different batches & vendors..

these are samsung 990, mainly chosen for low price at the time:=20

nda0: <Samsung SSD 990 PRO 2TB 0B2QJXG7 S7DNNJ0WC12665P>
nda1: <Samsung SSD 990 PRO 2TB 0B2QJXG7 S7DNNJ0WC12664X>

https://gist.github.com/dch/6523082557066f7d95af82000be74a8c is
more readable version of below.

I forgot to mention dmesg prior:

Oct 31 16:11:05 wintermute kernel[9406]: nvme1: Resetting controller due=
 to a timeout.
Oct 31 16:11:05 wintermute kernel[9406]: nvme1: event=3D"start"
Oct 31 16:11:05 wintermute kernel[9406]: nvme1: Waiting for reset to com=
plete
Oct 31 16:11:05 wintermute kernel[9406]: nvme1: Waiting for reset to com=
plete
... repeated x400

then the vdev is yanked by zfs:
Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30294]: vdev I/O failure, z=
pool=3Dzroot path=3D/dev/gpt/zfs1 offset=3D270336 size=3D8192 error=3D6
Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30305]: vdev state changed,=
 pool_guid=3D16468628842577781486 vdev_guid=3D1873129772219000696
Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30309]: vdev is removed, po=
ol_guid=3D16468628842577781486 vdev_guid=3D1873129772219000696
Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30313]: vdev state changed,=
 pool_guid=3D16468628842577781486 vdev_guid=3D1873129772219000696
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to com=
plete
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to com=
plete
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to com=
plete
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to com=
plete
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to com=
plete
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: controller ready did not=
 become 0 within 20500 ms
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: event=3D"timed_out"
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: failing outstanding i/o
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: READ sqid:3 cid:126 nsid=
:1 lba:667089368 len:2048
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: ABORTED - BY REQUEST (00=
/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:126 cdw0:0
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: WRITE sqid:3 cid:121 nsi=
d:1 lba:2533579032 len:256
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: ABORTED - BY REQUEST (00=
/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:121 cdw0:0
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: WRITE sqid:3 cid:127 nsi=
d:1 lba:2533579456 len:256
Oct 31 16:11:26 wintermute kernel[9426]: (nda1:nvme1:0:0:1): READ. NCB: =
opc=3D2 fuse=3D0 nsid=3D1 prp1=3D0 prp2=3D0 cdw=3D27c2f9d8 0 7ff 0 0 0
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: ABORTED - BY REQUEST (00=
/07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:127 cdw0:0
Oct 31 16:11:26 wintermute kernel[9426]: (nda1:nvme1:0:0:1): CAM status:=
 NVME Status Error
Oct 31 16:11:26 wintermute kernel[9426]: (nda1:nvme1:0:0:1): Error 5, Re=
tries exhausted
Oct 31 16:11:26 wintermute kernel[9426]: nvme1: failing outstanding i/o

this also repeated a few times.

>> I have a typical mirrored nvme zpool, that reported enough errors
>> in a burst last week, that 1 drive dropped off the bus [1].
>>=20
>> After a reboot, it resilvered, I cleared the errors, and it seems
>> fine according to repeated scrubs and a few days of use.
>>=20
>> I was unable to see any errors from the nvme drive itself, but
>> as its (just) in warranty for 2 more weeks I'd like to know
>> if I should return it.
>>=20
>> I installed ports `sysutils/nvme-cli` and didn't see anything=20
>> of note there either:
>>=20
>> $ doas nvme smart-log /dev/nvme1

[snip]

`nvmecontrol logpage -p 2 nvme1` returns the same info as the ports
nvme-cli tool, but its in base already.

I will add this example to the manpage, as I tried logpage 0 & 1
without getting any info back.

>> 0xc0484e41: opc: 0x2 fuse: 0 cid 0 nsid:0xffffffff cmd2: 0 cmd3: 0
>>           : cdw10: 0x7f0002 cdw11: 0 cdw12: 0 cdw13: 0
>>           : cdw14: 0 cdw15: 0 len: 0x200 is_read: 0
>> <--- 0 cid: 0 status 0
>> Smart Log for NVME device:nvme1 namespace-id:ffffffff
>> critical_warning                    : 0
>> temperature                         : 39 C
>> available_spare                     : 100%
>> available_spare_threshold           : 10%
>> percentage_used                     : 3%
>> data_units_read                     : 121681067
>> data_units_written                  : 86619659
>> host_read_commands                  : 695211450
>> host_write_commands                 : 2187823697
>> controller_busy_time                : 2554
>> power_cycles                        : 48
>> power_on_hours                      : 6342
>> unsafe_shutdowns                    : 38
>> media_errors                        : 0
>> num_err_log_entries                 : 0
>> Warning Temperature Time            : 0
>> Critical Composite Temperature Time : 0
>
> This suggests that the only 'badness' is 38 unsafe shutdowns (likely=20
> power failures), since
> either there were a bunch all at once (maybe when installing) or you'v=
e=20
> had power off events
> every week...

Could that be panics? This box typically runs for a fortnight and then
gets CURRENT bumped. It also feels like a lot more panics than I would
have, I don't know if these counters were zero on purchase tho.

> There's been no reported media errors (or the drive hasn't done a good=20
> job of remembering
> them, though most NVME is better than most for that).
>=20
>> Temperature Sensor 1                : 39 C
>> Temperature Sensor 2                : 43 C
>> Thermal Management T1 Trans Count   : 0
>> Thermal Management T2 Trans Count   : 0
>> Thermal Management T1 Total Time    : 0
>> Thermal Management T2 Total Time    : 0
>
> There's been no time where the drive overheated either. That's good.
>=20
>> [1]: zpool status
>> status: One or more devices are faulted in response to persistent err=
ors.
>>         Sufficient replicas exist for the pool to continue functionin=
g in a
>>         degraded state.
>> action: Replace the faulted device, or use 'zpool clear' to mark the =
device
>>         repaired.
>>   scan: scrub repaired 0B in 00:17:59 with 0 errors on Thu Oct 31 16:=
24:36 2024
>> config:
>>=20
>>         NAME          STATE     READ WRITE CKSUM
>>         zroot         DEGRADED     0     0     0
>>           mirror-0    DEGRADED     0     0     0
>>             gpt/zfs0  ONLINE       0     0     0
>>             gpt/zfs1  FAULTED      0     0     0  too many errors
>
> I'm not sure how to reconcile this in the face of the above. I'd have =
to see the
> dmesg / messages logs for any non-boot messages for nvme / nda.  For b=
ad drives
> at work, I typically see something like:
>
> /var/log/messages.0.bz2:Nov  3 02:48:54 c001 kernel: nvme2: Resetting=20
> controller due to a timeout.
> /var/log/messages.0.bz2:Nov  3 02:48:54 c001 kernel: nvme2: Waiting fo=
r=20
> reset to complete
> /var/log/messages.0.bz2:Nov  3 02:49:05 c001 kernel: nvme2: controller=20
> ready did not become 0 within 10500 ms
>
> for drives that just 'hang' which would cause ZFS to drop them out. I'=
d=20
> see if there's new firmware or return
> the drive.

I will give this a go if this issue returns, its the only one I had
in a year so far.

> I also see:
> nvme8: READ sqid:3 cid:117 nsid:1 lba:1875786352 len:1024
> nvme8: nsid:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0x40defd000 prp2:0x1395a24=
00
> nvme8: cdw10: 0x6fce3a70 cdw11:0 cdw12:0x3ff cdw13:0 cdw14:0 cdw15:0
> nvme8: UNRECOVERED READ ERROR (02/81) crd:0 m:1 dnr:1 p:1 sqid:3=20
> cid:117 cdw0:0
> (nda8:nvme8:0:0:1): READ. NCB: opc=3D2 fuse=3D0 nsid=3D1 prp1=3D0 prp2=
=3D0=20
> cdw=3D6fce3a70 0 3ff 0 0 0
> (nda8:nvme8:0:0:1): CAM status: NVME Status Error
> (nda8:nvme8:0:0:1): Error 5, Retries exhausted
> g_vfs_done():nda8p8[READ(offset=3D960402063360, length=3D1048576)]erro=
r =3D 5
>
> when there's a media error. But the brand of NVMe drives we by report=20
> this as an error:
>
> c029.for002.ix# nvmecontrol logpage -p 2 nvme8
> SMART/Health Information Log
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
> Critical Warning State:         0x04
>  Available spare:               0
>  Temperature:                   0
>  Device reliability:            1
>  Read only:                     0
>  Volatile memory backup:        0
> [[... but this says the drive has lost data ]]
> Power cycles:                   106
> Power on hours:                 30250
> Unsafe shutdowns:               19
> Media errors:                   3
> No. error info log entries:     3
> Warning Temp Composite Time:    0
> Error Temp Composite Time:      0
> Temperature 1 Transition Count: 0
> Temperature 2 Transition Count: 0
> Total Time For Temperature 1:   0
> Total Time For Temperature 2:   0
>
> so there's 3 media errors. I can read the log page to find the LBA too=20
> (I'm working on
> enhancing the errors we report for NVMe to include LBA of first error=20
> too, but that's not
> there yet).
>
> But since you don't have any media errors, I'd check history to see if=20
> the nvme drives
> are resetting (either successfully or not). But I don't know how to ge=
t=20
> that data from just
> the drive logs.
>
> Warner

There's nothing at all prior to last week, and I assume the unsafe
shutdowns are unrelated panics while running CURRENT for a year.

I'll cross my fingers and hope this is a transient bus problem.

thanks!
Dave
=E2=80=94=E2=80=94=E2=80=94
O for a muse of fire, that would ascend the brightest heaven of inventio=
n!



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ad8551cc-a595-454b-8645-89a16f60ab0f>