Date: Mon, 12 Feb 2024 18:56:34 -0800 From: Maxim Sobolev <sobomax@freebsd.org> To: Don Lewis <truckman@freebsd.org> Cc: FreeBSD current <freebsd-current@freebsd.org>, John Baldwin <jhb@freebsd.org> Subject: Re: nvme controller reset failures on recent -CURRENT Message-ID: <CAH7qZfunD154VYPD1vh_GNtOMM-quX=S00iQGvrpbhaegpXRnw@mail.gmail.com> In-Reply-To: <tkrat.edddc2469f43baf6@FreeBSD.org> References: <tkrat.edddc2469f43baf6@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000061cf0606113a8b53 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Might be an overheating. Today's nvme drives are notoriously flaky if you run them without proper heat sink attached to it. -Max On Mon, Feb 12, 2024, 4:28=E2=80=AFPM Don Lewis <truckman@freebsd.org> wrot= e: > I just upgraded my package build machine to: > FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e > from: > FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 > and I've had two nvme-triggered panics in the last day. > > nvme is being used for swap and L2ARC. I'm not able to get a crash > dump, probably because the nvme device has gone away and I get an error > about not having a dump device. It looks like a low-memory panic > because free memory is low and zfs is calling malloc(). > > This shows up in the log leading up to the panic: > Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > timeout a > nd possible hot unplug. > Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > Feb 12 10:07:41 zipper kernel: nvme0: resetting controller > Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > timeout a > nd possible hot unplug. > Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete > Feb 12 10:07:41 zipper syslogd: last message repeated 2 times > Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o > Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdo= g > ti > meout. > > The device looks healthy to me: > SMART/Health Information Log > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > Critical Warning State: 0x00 > Available spare: 0 > Temperature: 0 > Device reliability: 0 > Read only: 0 > Volatile memory backup: 0 > Temperature: 312 K, 38.85 C, 101.93 F > Available spare: 100 > Available spare threshold: 10 > Percentage used: 3 > Data units (512,000 byte) read: 5761183 > Data units written: 29911502 > Host read commands: 471921188 > Host write commands: 605394753 > Controller busy time (minutes): 32359 > Power cycles: 110 > Power on hours: 19297 > Unsafe shutdowns: 14 > Media errors: 0 > No. error info log entries: 0 > Warning Temp Composite Time: 0 > Error Temp Composite Time: 0 > Temperature 1 Transition Count: 5231 > Temperature 2 Transition Count: 0 > Total Time For Temperature 1: 41213 > Total Time For Temperature 2: 0 > > > --00000000000061cf0606113a8b53 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"auto">Might be an overheating. Today's nvme drives are noto= riously flaky if you run them without proper heat sink attached to it.=C2= =A0<div dir=3D"auto"><br></div><div dir=3D"auto">-Max<br><div dir=3D"auto">= <br></div><div dir=3D"auto"><br></div></div></div><br><div class=3D"gmail_q= uote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Feb 12, 2024, 4:28=E2= =80=AFPM Don Lewis <<a href=3D"mailto:truckman@freebsd.org">truckman@fre= ebsd.org</a>> wrote:<br></div><blockquote class=3D"gmail_quote" style=3D= "margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I just upgr= aded my package build machine to:<br> =C2=A0 FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e<br> from:<br> =C2=A0 FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38<br> and I've had two nvme-triggered panics in the last day.<br> <br> nvme is being used for swap and L2ARC.=C2=A0 I'm not able to get a cras= h<br> dump, probably because the nvme device has gone away and I get an error<br> about not having a dump device.=C2=A0 It looks like a low-memory panic<br> because free memory is low and zfs is calling malloc().<br> <br> This shows up in the log leading up to the panic:<br> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout= a<br> nd possible hot unplug.<br> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times<br> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller<br> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout= a<br> nd possible hot unplug.<br> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times<br> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete<br> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times<br> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o<br> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog = ti<br> meout.<br> <br> The device looks healthy to me:<br> SMART/Health Information Log<br> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D<br> Critical Warning State:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00<br> =C2=A0Available spare:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00<br> =C2=A0Temperature:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00<br> =C2=A0Device reliability:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br> =C2=A0Read only:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A00<br> =C2=A0Volatile memory backup:=C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br> Temperature:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 312 K, 38.85 C, 101.93 F<br> Available spare:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 100= <br> Available spare threshold:=C2=A0 =C2=A0 =C2=A0 10<br> Percentage used:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 3<b= r> Data units (512,000 byte) read: 5761183<br> Data units written:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A029911502= <br> Host read commands:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A047192118= 8<br> Host write commands:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 605394753<br> Controller busy time (minutes): 32359<br> Power cycles:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0110<br> Power on hours:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A019297<br> Unsafe shutdowns:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A014<= br> Media errors:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A00<br> No. error info log entries:=C2=A0 =C2=A0 =C2=A00<br> Warning Temp Composite Time:=C2=A0 =C2=A0 0<br> Error Temp Composite Time:=C2=A0 =C2=A0 =C2=A0 0<br> Temperature 1 Transition Count: 5231<br> Temperature 2 Transition Count: 0<br> Total Time For Temperature 1:=C2=A0 =C2=A041213<br> Total Time For Temperature 2:=C2=A0 =C2=A00<br> <br> <br> </blockquote></div> --00000000000061cf0606113a8b53--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAH7qZfunD154VYPD1vh_GNtOMM-quX=S00iQGvrpbhaegpXRnw>