Date: Tue, 05 Nov 2024 09:14:59 +0000 From: "Dave Cottlehuber" <dch@freebsd.org> To: "Warner Losh" <imp@bsdimp.com> Cc: freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: nvme device errors & zfs Message-ID: <ad8551cc-a595-454b-8645-89a16f60ab0f@app.fastmail.com> In-Reply-To: <CANCZdfpPmVtt0wMWAYzhq4R0nkt39dg3S2-zVCCQcw%2BTSugkEg@mail.gmail.com> References: <3293802b-3785-4715-8a6b-0802afb6f908@app.fastmail.com> <CANCZdfpPmVtt0wMWAYzhq4R0nkt39dg3S2-zVCCQcw%2BTSugkEg@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 5 Nov 2024, at 01:14, Warner Losh wrote: > On Mon, Nov 4, 2024 at 10:31=E2=80=AFAM Dave Cottlehuber <dch@freebsd.= org> wrote: >> What's the best way to see error counters or states on an nvme >> device? > > Sadly, I think dmesg | grep nvme and/or trolling through /var/log/mess= ages. > Nvme drives don't generally keep good counters of errors... Thanks Warner, good to know. brb, writing all my critical data onto stone tablets and vellum... mirrored vdevs from different batches & vendors.. these are samsung 990, mainly chosen for low price at the time:=20 nda0: <Samsung SSD 990 PRO 2TB 0B2QJXG7 S7DNNJ0WC12665P> nda1: <Samsung SSD 990 PRO 2TB 0B2QJXG7 S7DNNJ0WC12664X> https://gist.github.com/dch/6523082557066f7d95af82000be74a8c is more readable version of below. I forgot to mention dmesg prior: Oct 31 16:11:05 wintermute kernel[9406]: nvme1: Resetting controller due= to a timeout. Oct 31 16:11:05 wintermute kernel[9406]: nvme1: event=3D"start" Oct 31 16:11:05 wintermute kernel[9406]: nvme1: Waiting for reset to com= plete Oct 31 16:11:05 wintermute kernel[9406]: nvme1: Waiting for reset to com= plete ... repeated x400 then the vdev is yanked by zfs: Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30294]: vdev I/O failure, z= pool=3Dzroot path=3D/dev/gpt/zfs1 offset=3D270336 size=3D8192 error=3D6 Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30305]: vdev state changed,= pool_guid=3D16468628842577781486 vdev_guid=3D1873129772219000696 Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30309]: vdev is removed, po= ol_guid=3D16468628842577781486 vdev_guid=3D1873129772219000696 Oct 31 16:11:26 wintermute.skunkwerks.at ZFS[30313]: vdev state changed,= pool_guid=3D16468628842577781486 vdev_guid=3D1873129772219000696 Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to com= plete Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to com= plete Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to com= plete Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to com= plete Oct 31 16:11:26 wintermute kernel[9426]: nvme1: Waiting for reset to com= plete Oct 31 16:11:26 wintermute kernel[9426]: nvme1: controller ready did not= become 0 within 20500 ms Oct 31 16:11:26 wintermute kernel[9426]: nvme1: event=3D"timed_out" Oct 31 16:11:26 wintermute kernel[9426]: nvme1: failing outstanding i/o Oct 31 16:11:26 wintermute kernel[9426]: nvme1: READ sqid:3 cid:126 nsid= :1 lba:667089368 len:2048 Oct 31 16:11:26 wintermute kernel[9426]: nvme1: ABORTED - BY REQUEST (00= /07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:126 cdw0:0 Oct 31 16:11:26 wintermute kernel[9426]: nvme1: WRITE sqid:3 cid:121 nsi= d:1 lba:2533579032 len:256 Oct 31 16:11:26 wintermute kernel[9426]: nvme1: ABORTED - BY REQUEST (00= /07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:121 cdw0:0 Oct 31 16:11:26 wintermute kernel[9426]: nvme1: WRITE sqid:3 cid:127 nsi= d:1 lba:2533579456 len:256 Oct 31 16:11:26 wintermute kernel[9426]: (nda1:nvme1:0:0:1): READ. NCB: = opc=3D2 fuse=3D0 nsid=3D1 prp1=3D0 prp2=3D0 cdw=3D27c2f9d8 0 7ff 0 0 0 Oct 31 16:11:26 wintermute kernel[9426]: nvme1: ABORTED - BY REQUEST (00= /07) crd:0 m:0 dnr:1 p:0 sqid:3 cid:127 cdw0:0 Oct 31 16:11:26 wintermute kernel[9426]: (nda1:nvme1:0:0:1): CAM status:= NVME Status Error Oct 31 16:11:26 wintermute kernel[9426]: (nda1:nvme1:0:0:1): Error 5, Re= tries exhausted Oct 31 16:11:26 wintermute kernel[9426]: nvme1: failing outstanding i/o this also repeated a few times. >> I have a typical mirrored nvme zpool, that reported enough errors >> in a burst last week, that 1 drive dropped off the bus [1]. >>=20 >> After a reboot, it resilvered, I cleared the errors, and it seems >> fine according to repeated scrubs and a few days of use. >>=20 >> I was unable to see any errors from the nvme drive itself, but >> as its (just) in warranty for 2 more weeks I'd like to know >> if I should return it. >>=20 >> I installed ports `sysutils/nvme-cli` and didn't see anything=20 >> of note there either: >>=20 >> $ doas nvme smart-log /dev/nvme1 [snip] `nvmecontrol logpage -p 2 nvme1` returns the same info as the ports nvme-cli tool, but its in base already. I will add this example to the manpage, as I tried logpage 0 & 1 without getting any info back. >> 0xc0484e41: opc: 0x2 fuse: 0 cid 0 nsid:0xffffffff cmd2: 0 cmd3: 0 >> : cdw10: 0x7f0002 cdw11: 0 cdw12: 0 cdw13: 0 >> : cdw14: 0 cdw15: 0 len: 0x200 is_read: 0 >> <--- 0 cid: 0 status 0 >> Smart Log for NVME device:nvme1 namespace-id:ffffffff >> critical_warning : 0 >> temperature : 39 C >> available_spare : 100% >> available_spare_threshold : 10% >> percentage_used : 3% >> data_units_read : 121681067 >> data_units_written : 86619659 >> host_read_commands : 695211450 >> host_write_commands : 2187823697 >> controller_busy_time : 2554 >> power_cycles : 48 >> power_on_hours : 6342 >> unsafe_shutdowns : 38 >> media_errors : 0 >> num_err_log_entries : 0 >> Warning Temperature Time : 0 >> Critical Composite Temperature Time : 0 > > This suggests that the only 'badness' is 38 unsafe shutdowns (likely=20 > power failures), since > either there were a bunch all at once (maybe when installing) or you'v= e=20 > had power off events > every week... Could that be panics? This box typically runs for a fortnight and then gets CURRENT bumped. It also feels like a lot more panics than I would have, I don't know if these counters were zero on purchase tho. > There's been no reported media errors (or the drive hasn't done a good=20 > job of remembering > them, though most NVME is better than most for that). >=20 >> Temperature Sensor 1 : 39 C >> Temperature Sensor 2 : 43 C >> Thermal Management T1 Trans Count : 0 >> Thermal Management T2 Trans Count : 0 >> Thermal Management T1 Total Time : 0 >> Thermal Management T2 Total Time : 0 > > There's been no time where the drive overheated either. That's good. >=20 >> [1]: zpool status >> status: One or more devices are faulted in response to persistent err= ors. >> Sufficient replicas exist for the pool to continue functionin= g in a >> degraded state. >> action: Replace the faulted device, or use 'zpool clear' to mark the = device >> repaired. >> scan: scrub repaired 0B in 00:17:59 with 0 errors on Thu Oct 31 16:= 24:36 2024 >> config: >>=20 >> NAME STATE READ WRITE CKSUM >> zroot DEGRADED 0 0 0 >> mirror-0 DEGRADED 0 0 0 >> gpt/zfs0 ONLINE 0 0 0 >> gpt/zfs1 FAULTED 0 0 0 too many errors > > I'm not sure how to reconcile this in the face of the above. I'd have = to see the > dmesg / messages logs for any non-boot messages for nvme / nda. For b= ad drives > at work, I typically see something like: > > /var/log/messages.0.bz2:Nov 3 02:48:54 c001 kernel: nvme2: Resetting=20 > controller due to a timeout. > /var/log/messages.0.bz2:Nov 3 02:48:54 c001 kernel: nvme2: Waiting fo= r=20 > reset to complete > /var/log/messages.0.bz2:Nov 3 02:49:05 c001 kernel: nvme2: controller=20 > ready did not become 0 within 10500 ms > > for drives that just 'hang' which would cause ZFS to drop them out. I'= d=20 > see if there's new firmware or return > the drive. I will give this a go if this issue returns, its the only one I had in a year so far. > I also see: > nvme8: READ sqid:3 cid:117 nsid:1 lba:1875786352 len:1024 > nvme8: nsid:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0x40defd000 prp2:0x1395a24= 00 > nvme8: cdw10: 0x6fce3a70 cdw11:0 cdw12:0x3ff cdw13:0 cdw14:0 cdw15:0 > nvme8: UNRECOVERED READ ERROR (02/81) crd:0 m:1 dnr:1 p:1 sqid:3=20 > cid:117 cdw0:0 > (nda8:nvme8:0:0:1): READ. NCB: opc=3D2 fuse=3D0 nsid=3D1 prp1=3D0 prp2= =3D0=20 > cdw=3D6fce3a70 0 3ff 0 0 0 > (nda8:nvme8:0:0:1): CAM status: NVME Status Error > (nda8:nvme8:0:0:1): Error 5, Retries exhausted > g_vfs_done():nda8p8[READ(offset=3D960402063360, length=3D1048576)]erro= r =3D 5 > > when there's a media error. But the brand of NVMe drives we by report=20 > this as an error: > > c029.for002.ix# nvmecontrol logpage -p 2 nvme8 > SMART/Health Information Log > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > Critical Warning State: 0x04 > Available spare: 0 > Temperature: 0 > Device reliability: 1 > Read only: 0 > Volatile memory backup: 0 > [[... but this says the drive has lost data ]] > Power cycles: 106 > Power on hours: 30250 > Unsafe shutdowns: 19 > Media errors: 3 > No. error info log entries: 3 > Warning Temp Composite Time: 0 > Error Temp Composite Time: 0 > Temperature 1 Transition Count: 0 > Temperature 2 Transition Count: 0 > Total Time For Temperature 1: 0 > Total Time For Temperature 2: 0 > > so there's 3 media errors. I can read the log page to find the LBA too=20 > (I'm working on > enhancing the errors we report for NVMe to include LBA of first error=20 > too, but that's not > there yet). > > But since you don't have any media errors, I'd check history to see if=20 > the nvme drives > are resetting (either successfully or not). But I don't know how to ge= t=20 > that data from just > the drive logs. > > Warner There's nothing at all prior to last week, and I assume the unsafe shutdowns are unrelated panics while running CURRENT for a year. I'll cross my fingers and hope this is a transient bus problem. thanks! Dave =E2=80=94=E2=80=94=E2=80=94 O for a muse of fire, that would ascend the brightest heaven of inventio= n!
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ad8551cc-a595-454b-8645-89a16f60ab0f>