FreeBSD Mail Archives

Date:      Mon, 12 Feb 2024 20:15:31 -0800 (PST)
From:      Don Lewis <truckman@FreeBSD.org>
To:        Maxim Sobolev <sobomax@freebsd.org>
Cc:        FreeBSD current <freebsd-current@freebsd.org>,  John Baldwin <jhb@freebsd.org>
Subject:   Re: nvme controller reset failures on recent -CURRENT
Message-ID:  <tkrat.76b39844cd6da514@FreeBSD.org>
References:  <tkrat.edddc2469f43baf6@FreeBSD.org> <CAH7qZfunD154VYPD1vh_GNtOMM-quX=S00iQGvrpbhaegpXRnw@mail.gmail.com>

index | next in thread | previous in thread | raw e-mail


On 12 Feb, Maxim Sobolev wrote:
> Might be an overheating. Today's nvme drives are notoriously flaky if you
> run them without proper heat sink attached to it.

I don't think it is a thermal problem.  According to the drive health
page, the device temperature has never reached Temperature 2, whatever
that is.  The room temperature is around 65F.  The system was stable
last summer when the room temperature spent a lot of time in the 80-85F
range.  The device temperature depends a lot on the I/O rate, and the
last panic happened when the I/O rate had been below 40tps for quite a
while.

> On Mon, Feb 12, 2024, 4:28 PM Don Lewis <truckman@freebsd.org> wrote:
> 
>> I just upgraded my package build machine to:
>>   FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e
>> from:
>>   FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38
>> and I've had two nvme-triggered panics in the last day.
>>
>> nvme is being used for swap and L2ARC.  I'm not able to get a crash
>> dump, probably because the nvme device has gone away and I get an error
>> about not having a dump device.  It looks like a low-memory panic
>> because free memory is low and zfs is calling malloc().
>>
>> This shows up in the log leading up to the panic:
>> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
>> timeout a
>> nd possible hot unplug.
>> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
>> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
>> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
>> timeout a
>> nd possible hot unplug.
>> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
>> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete
>> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
>> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
>> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog
>> ti
>> meout.
>>
>> The device looks healthy to me:
>> SMART/Health Information Log
>> ============================
>> Critical Warning State:         0x00
>>  Available spare:               0
>>  Temperature:                   0
>>  Device reliability:            0
>>  Read only:                     0
>>  Volatile memory backup:        0
>> Temperature:                    312 K, 38.85 C, 101.93 F
>> Available spare:                100
>> Available spare threshold:      10
>> Percentage used:                3
>> Data units (512,000 byte) read: 5761183
>> Data units written:             29911502
>> Host read commands:             471921188
>> Host write commands:            605394753
>> Controller busy time (minutes): 32359
>> Power cycles:                   110
>> Power on hours:                 19297
>> Unsafe shutdowns:               14
>> Media errors:                   0
>> No. error info log entries:     0
>> Warning Temp Composite Time:    0
>> Error Temp Composite Time:      0
>> Temperature 1 Transition Count: 5231
>> Temperature 2 Transition Count: 0
>> Total Time For Temperature 1:   41213
>> Total Time For Temperature 2:   0
>>
>>
>>

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?tkrat.76b39844cd6da514>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation