Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Feb 2024 22:03:29 -0500
From:      Mark Johnston <markj@freebsd.org>
To:        Don Lewis <truckman@freebsd.org>
Cc:        FreeBSD current <freebsd-current@freebsd.org>, John Baldwin <jhb@freebsd.org>
Subject:   Re: nvme controller reset failures on recent -CURRENT
Message-ID:  <ZcrcAbxh9Ii8hicC@nuc>
In-Reply-To: <tkrat.edddc2469f43baf6@FreeBSD.org>
References:  <tkrat.edddc2469f43baf6@FreeBSD.org>

index | next in thread | previous in thread | raw e-mail

On Mon, Feb 12, 2024 at 04:28:10PM -0800, Don Lewis wrote:
> I just upgraded my package build machine to:
>   FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e
> from:
>   FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38
> and I've had two nvme-triggered panics in the last day.
> 
> nvme is being used for swap and L2ARC.  I'm not able to get a crash
> dump, probably because the nvme device has gone away and I get an error
> about not having a dump device.  It looks like a low-memory panic
> because free memory is low and zfs is calling malloc().
> 
> This shows up in the log leading up to the panic:
> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a
> nd possible hot unplug.
> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a
> nd possible hot unplug.
> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete
> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog ti
> meout.

Are you by chance using the drive mentioned here? https://github.com/openzfs/zfs/discussions/14793

I was bitten by that and ended up replacing the drive with a different
model.  The crash manifested exactly as you describe, though I didn't
have L2ARC or swap enabled on it.

> The device looks healthy to me:
> SMART/Health Information Log
> ============================
> Critical Warning State:         0x00
>  Available spare:               0
>  Temperature:                   0
>  Device reliability:            0
>  Read only:                     0
>  Volatile memory backup:        0
> Temperature:                    312 K, 38.85 C, 101.93 F
> Available spare:                100
> Available spare threshold:      10
> Percentage used:                3
> Data units (512,000 byte) read: 5761183
> Data units written:             29911502
> Host read commands:             471921188
> Host write commands:            605394753
> Controller busy time (minutes): 32359
> Power cycles:                   110
> Power on hours:                 19297
> Unsafe shutdowns:               14
> Media errors:                   0
> No. error info log entries:     0
> Warning Temp Composite Time:    0
> Error Temp Composite Time:      0
> Temperature 1 Transition Count: 5231
> Temperature 2 Transition Count: 0
> Total Time For Temperature 1:   41213
> Total Time For Temperature 2:   0
> 
> 


help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ZcrcAbxh9Ii8hicC>