Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 04 Nov 2024 17:28:42 +0000
From:      "Dave Cottlehuber" <dch@FreeBSD.org>
To:        freebsd-fs <freebsd-fs@freebsd.org>
Subject:   nvme device errors & zfs
Message-ID:  <3293802b-3785-4715-8a6b-0802afb6f908@app.fastmail.com>

next in thread | raw e-mail | index | archive | help
What's the best way to see error counters or states on an nvme
device?

I have a typical mirrored nvme zpool, that reported enough errors
in a burst last week, that 1 drive dropped off the bus [1].

After a reboot, it resilvered, I cleared the errors, and it seems
fine according to repeated scrubs and a few days of use.

I was unable to see any errors from the nvme drive itself, but
as its (just) in warranty for 2 more weeks I'd like to know
if I should return it.

I installed ports `sysutils/nvme-cli` and didn't see anything=20
of note there either:

$ doas nvme smart-log /dev/nvme1
0xc0484e41: opc: 0x2 fuse: 0 cid 0 nsid:0xffffffff cmd2: 0 cmd3: 0
          : cdw10: 0x7f0002 cdw11: 0 cdw12: 0 cdw13: 0
          : cdw14: 0 cdw15: 0 len: 0x200 is_read: 0
<--- 0 cid: 0 status 0
Smart Log for NVME device:nvme1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 39 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 3%
data_units_read                     : 121681067
data_units_written                  : 86619659
host_read_commands                  : 695211450
host_write_commands                 : 2187823697
controller_busy_time                : 2554
power_cycles                        : 48
power_on_hours                      : 6342
unsafe_shutdowns                    : 38
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 39 C
Temperature Sensor 2                : 43 C
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0
=20
[1]: zpool status
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning i=
n a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the dev=
ice
        repaired.
  scan: scrub repaired 0B in 00:17:59 with 0 errors on Thu Oct 31 16:24:=
36 2024
config:

        NAME          STATE     READ WRITE CKSUM
        zroot         DEGRADED     0     0     0
          mirror-0    DEGRADED     0     0     0
            gpt/zfs0  ONLINE       0     0     0
            gpt/zfs1  FAULTED      0     0     0  too many errors

A+
Dave
=E2=80=94=E2=80=94=E2=80=94
O for a muse of fire, that would ascend the brightest heaven of inventio=
n!



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3293802b-3785-4715-8a6b-0802afb6f908>