Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Feb 2024 18:56:34 -0800
From:      Maxim Sobolev <sobomax@freebsd.org>
To:        Don Lewis <truckman@freebsd.org>
Cc:        FreeBSD current <freebsd-current@freebsd.org>, John Baldwin <jhb@freebsd.org>
Subject:   Re: nvme controller reset failures on recent -CURRENT
Message-ID:  <CAH7qZfunD154VYPD1vh_GNtOMM-quX=S00iQGvrpbhaegpXRnw@mail.gmail.com>
In-Reply-To: <tkrat.edddc2469f43baf6@FreeBSD.org>
References:  <tkrat.edddc2469f43baf6@FreeBSD.org>

index | next in thread | previous in thread | raw e-mail

[-- Attachment #1 --]
Might be an overheating. Today's nvme drives are notoriously flaky if you
run them without proper heat sink attached to it.

-Max



On Mon, Feb 12, 2024, 4:28 PM Don Lewis <truckman@freebsd.org> wrote:

> I just upgraded my package build machine to:
>   FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e
> from:
>   FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38
> and I've had two nvme-triggered panics in the last day.
>
> nvme is being used for swap and L2ARC.  I'm not able to get a crash
> dump, probably because the nvme device has gone away and I get an error
> about not having a dump device.  It looks like a low-memory panic
> because free memory is low and zfs is calling malloc().
>
> This shows up in the log leading up to the panic:
> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
> timeout a
> nd possible hot unplug.
> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
> timeout a
> nd possible hot unplug.
> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete
> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog
> ti
> meout.
>
> The device looks healthy to me:
> SMART/Health Information Log
> ============================
> Critical Warning State:         0x00
>  Available spare:               0
>  Temperature:                   0
>  Device reliability:            0
>  Read only:                     0
>  Volatile memory backup:        0
> Temperature:                    312 K, 38.85 C, 101.93 F
> Available spare:                100
> Available spare threshold:      10
> Percentage used:                3
> Data units (512,000 byte) read: 5761183
> Data units written:             29911502
> Host read commands:             471921188
> Host write commands:            605394753
> Controller busy time (minutes): 32359
> Power cycles:                   110
> Power on hours:                 19297
> Unsafe shutdowns:               14
> Media errors:                   0
> No. error info log entries:     0
> Warning Temp Composite Time:    0
> Error Temp Composite Time:      0
> Temperature 1 Transition Count: 5231
> Temperature 2 Transition Count: 0
> Total Time For Temperature 1:   41213
> Total Time For Temperature 2:   0
>
>
>

[-- Attachment #2 --]
<div dir="auto">Might be an overheating. Today&#39;s nvme drives are notoriously flaky if you run them without proper heat sink attached to it. <div dir="auto"><br></div><div dir="auto">-Max<br><div dir="auto"><br></div><div dir="auto"><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Feb 12, 2024, 4:28 PM Don Lewis &lt;<a href="mailto:truckman@freebsd.org">truckman@freebsd.org</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I just upgraded my package build machine to:<br>
  FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e<br>
from:<br>
  FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38<br>
and I&#39;ve had two nvme-triggered panics in the last day.<br>
<br>
nvme is being used for swap and L2ARC.  I&#39;m not able to get a crash<br>
dump, probably because the nvme device has gone away and I get an error<br>
about not having a dump device.  It looks like a low-memory panic<br>
because free memory is low and zfs is calling malloc().<br>
<br>
This shows up in the log leading up to the panic:<br>
Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a<br>
nd possible hot unplug.<br>
Feb 12 10:07:41 zipper syslogd: last message repeated 1 times<br>
Feb 12 10:07:41 zipper kernel: nvme0: resetting controller<br>
Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a<br>
nd possible hot unplug.<br>
Feb 12 10:07:41 zipper syslogd: last message repeated 1 times<br>
Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete<br>
Feb 12 10:07:41 zipper syslogd: last message repeated 2 times<br>
Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o<br>
Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog ti<br>
meout.<br>
<br>
The device looks healthy to me:<br>
SMART/Health Information Log<br>
============================<br>
Critical Warning State:         0x00<br>
 Available spare:               0<br>
 Temperature:                   0<br>
 Device reliability:            0<br>
 Read only:                     0<br>
 Volatile memory backup:        0<br>
Temperature:                    312 K, 38.85 C, 101.93 F<br>
Available spare:                100<br>
Available spare threshold:      10<br>
Percentage used:                3<br>
Data units (512,000 byte) read: 5761183<br>
Data units written:             29911502<br>
Host read commands:             471921188<br>
Host write commands:            605394753<br>
Controller busy time (minutes): 32359<br>
Power cycles:                   110<br>
Power on hours:                 19297<br>
Unsafe shutdowns:               14<br>
Media errors:                   0<br>
No. error info log entries:     0<br>
Warning Temp Composite Time:    0<br>
Error Temp Composite Time:      0<br>
Temperature 1 Transition Count: 5231<br>
Temperature 2 Transition Count: 0<br>
Total Time For Temperature 1:   41213<br>
Total Time For Temperature 2:   0<br>
<br>
<br>
</blockquote></div>
help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAH7qZfunD154VYPD1vh_GNtOMM-quX=S00iQGvrpbhaegpXRnw>