Date: Mon, 4 Nov 2024 18:14:43 -0700 From: Warner Losh <imp@bsdimp.com> To: Dave Cottlehuber <dch@freebsd.org> Cc: freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: nvme device errors & zfs Message-ID: <CANCZdfpPmVtt0wMWAYzhq4R0nkt39dg3S2-zVCCQcw%2BTSugkEg@mail.gmail.com> In-Reply-To: <3293802b-3785-4715-8a6b-0802afb6f908@app.fastmail.com> References: <3293802b-3785-4715-8a6b-0802afb6f908@app.fastmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000f5fce406262020be Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Nov 4, 2024 at 10:31=E2=80=AFAM Dave Cottlehuber <dch@freebsd.org> = wrote: > What's the best way to see error counters or states on an nvme > device? > Sadly, I think dmesg | grep nvme and/or trolling through /var/log/messages. Nvme drives don't generally keep good counters of errors... > I have a typical mirrored nvme zpool, that reported enough errors > in a burst last week, that 1 drive dropped off the bus [1]. > > After a reboot, it resilvered, I cleared the errors, and it seems > fine according to repeated scrubs and a few days of use. > > I was unable to see any errors from the nvme drive itself, but > as its (just) in warranty for 2 more weeks I'd like to know > if I should return it. > > I installed ports `sysutils/nvme-cli` and didn't see anything > of note there either: > > $ doas nvme smart-log /dev/nvme1 > 0xc0484e41: opc: 0x2 fuse: 0 cid 0 nsid:0xffffffff cmd2: 0 cmd3: 0 > : cdw10: 0x7f0002 cdw11: 0 cdw12: 0 cdw13: 0 > : cdw14: 0 cdw15: 0 len: 0x200 is_read: 0 > <--- 0 cid: 0 status 0 > Smart Log for NVME device:nvme1 namespace-id:ffffffff > critical_warning : 0 > temperature : 39 C > available_spare : 100% > available_spare_threshold : 10% > percentage_used : 3% > data_units_read : 121681067 > data_units_written : 86619659 > host_read_commands : 695211450 > host_write_commands : 2187823697 > controller_busy_time : 2554 > power_cycles : 48 > power_on_hours : 6342 > unsafe_shutdowns : 38 > media_errors : 0 > num_err_log_entries : 0 > Warning Temperature Time : 0 > Critical Composite Temperature Time : 0 > This suggests that the only 'badness' is 38 unsafe shutdowns (likely power failures), since either there were a bunch all at once (maybe when installing) or you've had power off events every week... There's been no reported media errors (or the drive hasn't done a good job of remembering them, though most NVME is better than most for that). > Temperature Sensor 1 : 39 C > Temperature Sensor 2 : 43 C > Thermal Management T1 Trans Count : 0 > Thermal Management T2 Trans Count : 0 > Thermal Management T1 Total Time : 0 > Thermal Management T2 Total Time : 0 > There's been no time where the drive overheated either. That's good. > [1]: zpool status > status: One or more devices are faulted in response to persistent errors. > Sufficient replicas exist for the pool to continue functioning in= a > degraded state. > action: Replace the faulted device, or use 'zpool clear' to mark the devi= ce > repaired. > scan: scrub repaired 0B in 00:17:59 with 0 errors on Thu Oct 31 16:24:3= 6 > 2024 > config: > > NAME STATE READ WRITE CKSUM > zroot DEGRADED 0 0 0 > mirror-0 DEGRADED 0 0 0 > gpt/zfs0 ONLINE 0 0 0 > gpt/zfs1 FAULTED 0 0 0 too many errors > I'm not sure how to reconcile this in the face of the above. I'd have to see the dmesg / messages logs for any non-boot messages for nvme / nda. For bad drives at work, I typically see something like: /var/log/messages.0.bz2:Nov 3 02:48:54 c001 kernel: nvme2: Resetting controller due to a timeout. /var/log/messages.0.bz2:Nov 3 02:48:54 c001 kernel: nvme2: Waiting for reset to complete /var/log/messages.0.bz2:Nov 3 02:49:05 c001 kernel: nvme2: controller ready did not become 0 within 10500 ms for drives that just 'hang' which would cause ZFS to drop them out. I'd see if there's new firmware or return the drive. I also see: nvme8: READ sqid:3 cid:117 nsid:1 lba:1875786352 len:1024 nvme8: nsid:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0x40defd000 prp2:0x1395a2400 nvme8: cdw10: 0x6fce3a70 cdw11:0 cdw12:0x3ff cdw13:0 cdw14:0 cdw15:0 nvme8: UNRECOVERED READ ERROR (02/81) crd:0 m:1 dnr:1 p:1 sqid:3 cid:117 cdw0:0 (nda8:nvme8:0:0:1): READ. NCB: opc=3D2 fuse=3D0 nsid=3D1 prp1=3D0 prp2=3D0 cdw=3D6fce3a70 0 3ff 0 0 0 (nda8:nvme8:0:0:1): CAM status: NVME Status Error (nda8:nvme8:0:0:1): Error 5, Retries exhausted g_vfs_done():nda8p8[READ(offset=3D960402063360, length=3D1048576)]error =3D= 5 when there's a media error. But the brand of NVMe drives we by report this as an error: c029.for002.ix# nvmecontrol logpage -p 2 nvme8 SMART/Health Information Log =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D Critical Warning State: 0x04 Available spare: 0 Temperature: 0 Device reliability: 1 Read only: 0 Volatile memory backup: 0 [[... but this says the drive has lost data ]] Power cycles: 106 Power on hours: 30250 Unsafe shutdowns: 19 Media errors: 3 No. error info log entries: 3 Warning Temp Composite Time: 0 Error Temp Composite Time: 0 Temperature 1 Transition Count: 0 Temperature 2 Transition Count: 0 Total Time For Temperature 1: 0 Total Time For Temperature 2: 0 so there's 3 media errors. I can read the log page to find the LBA too (I'm working on enhancing the errors we report for NVMe to include LBA of first error too, but that's not there yet). But since you don't have any media errors, I'd check history to see if the nvme drives are resetting (either successfully or not). But I don't know how to get that data from just the drive logs. Warner --000000000000f5fce406262020be Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">= <div dir=3D"ltr" class=3D"gmail_attr">On Mon, Nov 4, 2024 at 10:31=E2=80=AF= AM Dave Cottlehuber <<a href=3D"mailto:dch@freebsd.org">dch@freebsd.org<= /a>> wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0= px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">W= hat's the best way to see error counters or states on an nvme<br> device?<br></blockquote><div><br></div><div>Sadly, I think dmesg | grep nvm= e and/or trolling through /var/log/messages.</div><div>Nvme drives don'= t generally keep good counters of errors...</div><div>=C2=A0<br></div><bloc= kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:= 1px solid rgb(204,204,204);padding-left:1ex"> I have a typical mirrored nvme zpool, that reported enough errors<br> in a burst last week, that 1 drive dropped off the bus [1].<br> <br> After a reboot, it resilvered, I cleared the errors, and it seems<br> fine according to repeated scrubs and a few days of use.<br> <br> I was unable to see any errors from the nvme drive itself, but<br> as its (just) in warranty for 2 more weeks I'd like to know<br> if I should return it.<br> <br> I installed ports `sysutils/nvme-cli` and didn't see anything <br> of note there either:<br> <br> $ doas nvme smart-log /dev/nvme1<br> 0xc0484e41: opc: 0x2 fuse: 0 cid 0 nsid:0xffffffff cmd2: 0 cmd3: 0<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : cdw10: 0x7f0002 cdw11: 0 cdw12: 0 cdw1= 3: 0<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : cdw14: 0 cdw15: 0 len: 0x200 is_read: = 0<br> <--- 0 cid: 0 status 0<br> Smart Log for NVME device:nvme1 namespace-id:ffffffff<br> critical_warning=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 : 0<br> temperature=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0: 39 C<br> available_spare=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0: 100%<br> available_spare_threshold=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0: 10%<br> percentage_used=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0: 3%<br> data_units_read=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0: 121681067<br> data_units_written=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 : 86619659<br> host_read_commands=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 : 695211450<br> host_write_commands=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0: 2187823697<br> controller_busy_time=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= : 2554<br> power_cycles=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 : 48<br> power_on_hours=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 : 6342<br> unsafe_shutdowns=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 : 38<br> media_errors=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 : 0<br> num_err_log_entries=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0: 0<br> Warning Temperature Time=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 : 0<br> Critical Composite Temperature Time : 0<br></blockquote><div><br></div><div= >This suggests that the only 'badness' is 38 unsafe shutdowns (like= ly power failures), since</div><div>either there were a bunch all at once (= maybe when installing) or you've had power off events</div><div>every w= eek...</div><div><br></div><div>There's been no reported media errors (= or the drive hasn't done a good job of remembering</div><div>them, thou= gh most NVME is better than most for that).</div><div>=C2=A0</div><blockquo= te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px = solid rgb(204,204,204);padding-left:1ex"> Temperature Sensor 1=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= : 39 C<br> Temperature Sensor 2=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= : 43 C<br> Thermal Management T1 Trans Count=C2=A0 =C2=A0: 0<br> Thermal Management T2 Trans Count=C2=A0 =C2=A0: 0<br> Thermal Management T1 Total Time=C2=A0 =C2=A0 : 0<br> Thermal Management T2 Total Time=C2=A0 =C2=A0 : 0<br></blockquote><div><br>= </div><div>There's been no time where the drive overheated either. That= 's good.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style= =3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding= -left:1ex"> [1]: zpool status<br> status: One or more devices are faulted in response to persistent errors.<b= r> =C2=A0 =C2=A0 =C2=A0 =C2=A0 Sufficient replicas exist for the pool to conti= nue functioning in a<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 degraded state.<br> action: Replace the faulted device, or use 'zpool clear' to mark th= e device<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 repaired.<br> =C2=A0 scan: scrub repaired 0B in 00:17:59 with 0 errors on Thu Oct 31 16:2= 4:36 2024<br> config:<br> <br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 NAME=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 STATE=C2= =A0 =C2=A0 =C2=A0READ WRITE CKSUM<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 zroot=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0DEGRADED= =C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 mirror-0=C2=A0 =C2=A0 DEGRADED=C2=A0 =C2= =A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 gpt/zfs0=C2=A0 ONLINE=C2=A0 =C2= =A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00<br> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 gpt/zfs1=C2=A0 FAULTED=C2=A0 =C2= =A0 =C2=A0 0=C2=A0 =C2=A0 =C2=A00=C2=A0 =C2=A0 =C2=A00=C2=A0 too many error= s<br></blockquote><div><br></div><div>I'm not sure how to reconcile thi= s in the face of the above. I'd have to see the</div><div>dmesg / messa= ges logs for any non-boot messages for nvme / nda.=C2=A0 For bad drives</di= v><div>at work, I typically see something like:</div><div><br></div>/var/lo= g/messages.0.bz2:Nov =C2=A03 02:48:54 c001 kernel: nvme2: Resetting control= ler due to a timeout.<br>/var/log/messages.0.bz2:Nov =C2=A03 02:48:54 c001 = kernel: nvme2: Waiting for reset to complete<br>/var/log/messages.0.bz2:Nov= =C2=A03 02:49:05 c001 kernel: nvme2: controller ready did not become 0 wit= hin 10500 ms<br><div><br></div><div>for drives that just 'hang' whi= ch would cause ZFS to drop them out. I'd see if there's new firmwar= e or return</div><div>the drive.</div><div><br></div><div>I also see:</div>= <div>nvme8: READ sqid:3 cid:117 nsid:1 lba:1875786352 len:1024<br>nvme8: ns= id:0x1 rsvd2:0 rsvd3:0 mptr:0 prp1:0x40defd000 prp2:0x1395a2400<br>nvme8: c= dw10: 0x6fce3a70 cdw11:0 cdw12:0x3ff cdw13:0 cdw14:0 cdw15:0<br>nvme8: UNRE= COVERED READ ERROR (02/81) crd:0 m:1 dnr:1 p:1 sqid:3 cid:117 cdw0:0<br>(nd= a8:nvme8:0:0:1): READ. NCB: opc=3D2 fuse=3D0 nsid=3D1 prp1=3D0 prp2=3D0 cdw= =3D6fce3a70 0 3ff 0 0 0<br>(nda8:nvme8:0:0:1): CAM status: NVME Status Erro= r<br>(nda8:nvme8:0:0:1): Error 5, Retries exhausted<br>g_vfs_done():nda8p8[= READ(offset=3D960402063360, length=3D1048576)]error =3D 5<br></div><div><br= ></div><div>when there's a media error. But the brand of NVMe drives we= by report this as an error:</div><div><br></div><div>c029.for002.ix# nvmec= ontrol logpage -p 2 nvme8<br>SMART/Health Information Log<br>=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br= >Critical Warning State: =C2=A0 =C2=A0 =C2=A0 =C2=A0 0x04<br>=C2=A0Availabl= e spare: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br>=C2=A0Temper= ature: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br>= =C2=A0Device reliability: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A01<br>=C2= =A0Read only: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 0<br>=C2=A0Volatile memory backup: =C2=A0 =C2=A0 =C2=A0 =C2=A00<= /div><div>[[... but this says the drive has lost data ]]</div><div>Power cy= cles: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 106<br= >Power on hours: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 30= 250<br>Unsafe shutdowns: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 1= 9<br>Media errors: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 3<br>No. error info log entries: =C2=A0 =C2=A0 3<br>Warning Temp Com= posite Time: =C2=A0 =C2=A00<br>Error Temp Composite Time: =C2=A0 =C2=A0 =C2= =A00<br>Temperature 1 Transition Count: 0<br>Temperature 2 Transition Count= : 0<br>Total Time For Temperature 1: =C2=A0 0<br>Total Time For Temperature= 2: =C2=A0 0<br></div><div><br></div><div>so there's 3 media errors. I = can read the log page to find the LBA too (I'm working on</div><div>enh= ancing the errors we report for NVMe to include LBA of first error too, but= that's not</div><div>there yet).</div><div><br></div><div>But since yo= u don't have any media errors, I'd check history to see if the nvme= drives</div><div>are resetting (either successfully or not). But I don'= ;t know how to get that data from just</div><div>the drive logs.</div><div>= <br></div><div>Warner</div></div></div> --000000000000f5fce406262020be--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfpPmVtt0wMWAYzhq4R0nkt39dg3S2-zVCCQcw%2BTSugkEg>