Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Feb 2024 18:56:34 -0800
From:      Maxim Sobolev <sobomax@freebsd.org>
To:        Don Lewis <truckman@freebsd.org>
Cc:        FreeBSD current <freebsd-current@freebsd.org>, John Baldwin <jhb@freebsd.org>
Subject:   Re: nvme controller reset failures on recent -CURRENT
Message-ID:  <CAH7qZfunD154VYPD1vh_GNtOMM-quX=S00iQGvrpbhaegpXRnw@mail.gmail.com>
In-Reply-To: <tkrat.edddc2469f43baf6@FreeBSD.org>
References:  <tkrat.edddc2469f43baf6@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000061cf0606113a8b53
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Might be an overheating. Today's nvme drives are notoriously flaky if you
run them without proper heat sink attached to it.

-Max



On Mon, Feb 12, 2024, 4:28=E2=80=AFPM Don Lewis <truckman@freebsd.org> wrot=
e:

> I just upgraded my package build machine to:
>   FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e
> from:
>   FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38
> and I've had two nvme-triggered panics in the last day.
>
> nvme is being used for swap and L2ARC.  I'm not able to get a crash
> dump, probably because the nvme device has gone away and I get an error
> about not having a dump device.  It looks like a low-memory panic
> because free memory is low and zfs is calling malloc().
>
> This shows up in the log leading up to the panic:
> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
> timeout a
> nd possible hot unplug.
> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
> timeout a
> nd possible hot unplug.
> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete
> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdo=
g
> ti
> meout.
>
> The device looks healthy to me:
> SMART/Health Information Log
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
> Critical Warning State:         0x00
>  Available spare:               0
>  Temperature:                   0
>  Device reliability:            0
>  Read only:                     0
>  Volatile memory backup:        0
> Temperature:                    312 K, 38.85 C, 101.93 F
> Available spare:                100
> Available spare threshold:      10
> Percentage used:                3
> Data units (512,000 byte) read: 5761183
> Data units written:             29911502
> Host read commands:             471921188
> Host write commands:            605394753
> Controller busy time (minutes): 32359
> Power cycles:                   110
> Power on hours:                 19297
> Unsafe shutdowns:               14
> Media errors:                   0
> No. error info log entries:     0
> Warning Temp Composite Time:    0
> Error Temp Composite Time:      0
> Temperature 1 Transition Count: 5231
> Temperature 2 Transition Count: 0
> Total Time For Temperature 1:   41213
> Total Time For Temperature 2:   0
>
>
>

--00000000000061cf0606113a8b53
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto">Might be an overheating. Today&#39;s nvme drives are noto=
riously flaky if you run them without proper heat sink attached to it.=C2=
=A0<div dir=3D"auto"><br></div><div dir=3D"auto">-Max<br><div dir=3D"auto">=
<br></div><div dir=3D"auto"><br></div></div></div><br><div class=3D"gmail_q=
uote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Feb 12, 2024, 4:28=E2=
=80=AFPM Don Lewis &lt;<a href=3D"mailto:truckman@freebsd.org">truckman@fre=
ebsd.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D=
"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I just upgr=
aded my package build machine to:<br>
=C2=A0 FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e<br>
from:<br>
=C2=A0 FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38<br>
and I&#39;ve had two nvme-triggered panics in the last day.<br>
<br>
nvme is being used for swap and L2ARC.=C2=A0 I&#39;m not able to get a cras=
h<br>
dump, probably because the nvme device has gone away and I get an error<br>
about not having a dump device.=C2=A0 It looks like a low-memory panic<br>
because free memory is low and zfs is calling malloc().<br>
<br>
This shows up in the log leading up to the panic:<br>
Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout=
 a<br>
nd possible hot unplug.<br>
Feb 12 10:07:41 zipper syslogd: last message repeated 1 times<br>
Feb 12 10:07:41 zipper kernel: nvme0: resetting controller<br>
Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout=
 a<br>
nd possible hot unplug.<br>
Feb 12 10:07:41 zipper syslogd: last message repeated 1 times<br>
Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete<br>
Feb 12 10:07:41 zipper syslogd: last message repeated 2 times<br>
Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o<br>
Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog =
ti<br>
meout.<br>
<br>
The device looks healthy to me:<br>
SMART/Health Information Log<br>
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D<br>
Critical Warning State:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00<br>
=C2=A0Available spare:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A00<br>
=C2=A0Temperature:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00<br>
=C2=A0Device reliability:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br>
=C2=A0Read only:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A00<br>
=C2=A0Volatile memory backup:=C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br>
Temperature:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 312 K, 38.85 C, 101.93 F<br>
Available spare:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 100=
<br>
Available spare threshold:=C2=A0 =C2=A0 =C2=A0 10<br>
Percentage used:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 3<b=
r>
Data units (512,000 byte) read: 5761183<br>
Data units written:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A029911502=
<br>
Host read commands:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A047192118=
8<br>
Host write commands:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 605394753<br>
Controller busy time (minutes): 32359<br>
Power cycles:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0110<br>
Power on hours:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A019297<br>
Unsafe shutdowns:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A014<=
br>
Media errors:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A00<br>
No. error info log entries:=C2=A0 =C2=A0 =C2=A00<br>
Warning Temp Composite Time:=C2=A0 =C2=A0 0<br>
Error Temp Composite Time:=C2=A0 =C2=A0 =C2=A0 0<br>
Temperature 1 Transition Count: 5231<br>
Temperature 2 Transition Count: 0<br>
Total Time For Temperature 1:=C2=A0 =C2=A041213<br>
Total Time For Temperature 2:=C2=A0 =C2=A00<br>
<br>
<br>
</blockquote></div>

--00000000000061cf0606113a8b53--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAH7qZfunD154VYPD1vh_GNtOMM-quX=S00iQGvrpbhaegpXRnw>