From nobody Tue Feb 13 13:12:51 2024
X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TZ1w72rffz5BHJJ
	for <freebsd-current@mlmmj.nyi.freebsd.org>; Tue, 13 Feb 2024 13:12:55 +0000 (UTC)
	(envelope-from truckman@FreeBSD.org)
Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "smtp.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4TZ1w72H1Bz4HPK
	for <freebsd-current@freebsd.org>; Tue, 13 Feb 2024 13:12:55 +0000 (UTC)
	(envelope-from truckman@FreeBSD.org)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim;
	t=1707829975;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:  references:references;
	bh=mpsephCL1aYeikFvtyMfy/iYZxXgNOVMVN83jVb8HVo=;
	b=ZXU/So4eF+L55fBK4JET7slwNE/fkRfYEEZlU1yOo9N9ulebSBdcmwHSuKhuY4cYbHbuYY
	51R/qRAY//7g6abJhVjXgRJ5eiQ2frPYER+Ijt92s/5fJNORVCMq0ptcg4g3lk58cLeMEy
	nPFCni2h7wZNZMXWhAiIG4hYJisPbKs6fnET683XA+wAnBmn+iDtwizOjT4okIM0aDJEX/
	Gbv8DA0/yr6ztp1Gr/6pNtTUzeGxBx/5eQXUlGv0DJ59czRHJWome3jJVuest03VEWCXUH
	40PhdlN+qMb8Cghp8QdVgP9rZ9ZXl5bqbDS6wixIQcme36n+tlf/vm4hpy7RlQ==
ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1707829975; a=rsa-sha256; cv=none;
	b=WzfJLyvKcgVrO/51ffKmenq1xm2OfTWKq7W/Uzfd9+wq2DN8rYe1lURrp54EqNyxGvUBM/
	jNHzq8NdMa8DFyagSFw10cbvvS4aAPOar6o2wp+9/nwzwJxzrG5UPg/rvzZLWjXhD+2GwY
	sw1Ia4czQBB9O0ZEVp7OX/K82KRcPqz/As2TSDNuDPwskx7mfz8eSMbdSO9XHUwr/qoc7m
	vAGLd2VwMWPiYxAO2LpeB8AbUHEQCLZqnGlmLKEj5biKGlJ2xj2xiDlT1PDDxdIU+60TMd
	H+9hbOe7s+BU6YKOoBVzDzfPWxAC+7PLx9JVJCoNBCchTFiYlVO1BlA9SJKIxQ==
ARC-Authentication-Results: i=1;
	mx1.freebsd.org;
	none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org;
	s=dkim; t=1707829975;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:  references:references;
	bh=mpsephCL1aYeikFvtyMfy/iYZxXgNOVMVN83jVb8HVo=;
	b=mRJdLh7ziSwX/ESWmrFoVfyVCnpnbtv0hBJFqKhafz9mGZAGoMBHT0GlIIya9HSlVzaPPw
	TuvvfSvlyt1AWn0jbeq/81xKs1ytYaOQFhyhYNiciixEj0yuJwDEwYbbowjQ5YQYHRJJnx
	DCGvk3jkab9HyoVBvrSVuttf7JcEkkg078sF7uJ8rsklWmo/4G6E24/Psq96if+UkQWGo6
	TTXdD6IjXoA5I0dcnBLv+HPB2xzTeNxnEkREjaIxJjJTPw/+wOsNxGTnxFQGUY7ut+/KSI
	tAOrwI2HRL9Lr52MZwBVb0RZKVSNkxcmvHxyhNS95epD+YLlBvB+OVApwJHuJg==
Received: from gw.catspoiler.org (unknown [IPv6:2602:304:cd45:5b11::2])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	(Authenticated sender: truckman)
	by smtp.freebsd.org (Postfix) with ESMTPSA id 4TZ1w64lN4z16yW
	for <freebsd-current@freebsd.org>; Tue, 13 Feb 2024 13:12:54 +0000 (UTC)
	(envelope-from truckman@FreeBSD.org)
Received: from dl (uid 1001)
	(envelope-from truckman@FreeBSD.org)
	id 21a34c
	by gw.catspoiler.org (DragonFly Mail Agent v0.13 on mousie.catspoiler.org);
	Tue, 13 Feb 2024 05:12:52 -0800
Date: Tue, 13 Feb 2024 05:12:51 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
Subject: Re: nvme controller reset failures on recent -CURRENT
To: Warner Losh <imp@bsdimp.com>
cc: Maxim Sobolev <sobomax@freebsd.org>, 
    FreeBSD current <freebsd-current@freebsd.org>, 
    John Baldwin <jhb@freebsd.org>
Message-ID: <tkrat.9717b2cdbbab83de@FreeBSD.org>
References: <tkrat.edddc2469f43baf6@FreeBSD.org>
 <CAH7qZfunD154VYPD1vh_GNtOMM-quX=S00iQGvrpbhaegpXRnw@mail.gmail.com>
 <tkrat.76b39844cd6da514@FreeBSD.org>
 <CANCZdfrKeHJg5Tt-3cUq9hBgwwNqF4qnOWyFpF=TUjMdANOMfg@mail.gmail.com>
List-Id: Discussions about the use of FreeBSD-current <freebsd-current.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-current
List-Help: <mailto:freebsd-current+help@freebsd.org>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Subscribe: <mailto:freebsd-current+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-current+unsubscribe@freebsd.org>
Sender: owner-freebsd-current@freebsd.org
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; CHARSET=utf-8
Content-Transfer-Encoding: 8BIT
Content-Disposition: INLINE

On 12 Feb, Warner Losh wrote:
> On Mon, Feb 12, 2024 at 9:15 PM Don Lewis <truckman@freebsd.org> wrote:
> 
>> On 12 Feb, Maxim Sobolev wrote:
>> > Might be an overheating. Today's nvme drives are notoriously flaky if you
>> > run them without proper heat sink attached to it.
>>
>> I don't think it is a thermal problem.  According to the drive health
>> page, the device temperature has never reached Temperature 2, whatever
>> that is.  The room temperature is around 65F.  The system was stable
>> last summer when the room temperature spent a lot of time in the 80-85F
>> range.  The device temperature depends a lot on the I/O rate, and the
>> last panic happened when the I/O rate had been below 40tps for quite a
>> while.
>>
> 
> It did reach temperature 1, though. That's the 'Warning this drive is too
> hot' temperature. It has spent 41213 minutes of your 19297 hours of up
> time, or an average of 2 minutes per hour. That's too much. Temperature
> 2 is critical error: we are about to shut down completely due to it
> being too hot. It's only a couple degrees below hardware power off
> due to temperature in many drives. Some really cheap ones don't really
> implement it at all. On my card with the bad heat sink, Warning temp is
> 70C while critical is 75C while IIRC thermal shutdown is 78C or 80C.
> 
> I don't think we report these values in nvmecontrol identify. But you can
> do a raw dump with -x look at bytes 266:267 for warning and 268:269
> for critical.
> 
> In contrast, the few dozen drives that I have, all of which have been
> abused in various ways, And only one of them has any heat issues,
> and that one is an engineering special / sample with what I think is
> a damaged heat sink. If your card has no heat sink, this could well
> be what's going on.
> 
> This panic means "the nvme card lost its mind and stopped talking
> to the host". Its status registers read 0xff's, which means that the card
> isn't decoding bus signals. Usually this means that the firmware on the
> card has faulted and rebooted. If the card is overheating, then this could
> well be what's happening.
> 
> There's a tiny chance that this could be something more exotic,
> but my money is on hardware gone bad after 2 years of service. I don't think
> this is 'wear out' of the NAND (it's only 15TB written, but it could be if
> this
> drive is really really crappy nand: first generation QLC maybe, but it seems
> too new). It might also be a connector problem that's developed over time.
> There might be a few other things too, but I don't think this is a U.2 drive
> with funky cables.

The system was probably idle the majority of those two years of power on
time.

It's one of these:
https://www.techpowerup.com/ssd-specs/intel-660p-512-gb.d437
I've seen comments that these generally don't need cooling.

I just ordered a heatsink with some nice big fins, but it will take a
week or more to arrive.

> 
>> > On Mon, Feb 12, 2024, 4:28 PM Don Lewis <truckman@freebsd.org> wrote:
>> >
>> >> I just upgraded my package build machine to:
>> >>   FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e
>> >> from:
>> >>   FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38
>> >> and I've had two nvme-triggered panics in the last day.
>> >>
>> >> nvme is being used for swap and L2ARC.  I'm not able to get a crash
>> >> dump, probably because the nvme device has gone away and I get an error
>> >> about not having a dump device.  It looks like a low-memory panic
>> >> because free memory is low and zfs is calling malloc().
>> >>
>> >> This shows up in the log leading up to the panic:
>> >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
>> >> timeout a
>> >> nd possible hot unplug.
>> >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
>> >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
>> >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
>> >> timeout a
>> >> nd possible hot unplug.
>> >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
>> >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete
>> >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
>> >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
>> >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping
>> watchdog
>> >> ti
>> >> meout.
>> >>
>> >> The device looks healthy to me:
>> >> SMART/Health Information Log
>> >> ============================
>> >> Critical Warning State:         0x00
>> >>  Available spare:               0
>> >>  Temperature:                   0
>> >>  Device reliability:            0
>> >>  Read only:                     0
>> >>  Volatile memory backup:        0
>> >> Temperature:                    312 K, 38.85 C, 101.93 F
>> >> Available spare:                100
>> >> Available spare threshold:      10
>> >> Percentage used:                3
>> >> Data units (512,000 byte) read: 5761183
>> >> Data units written:             29911502
>> >> Host read commands:             471921188
>> >> Host write commands:            605394753
>> >> Controller busy time (minutes): 32359
>> >> Power cycles:                   110
>> >> Power on hours:                 19297
>> >> Unsafe shutdowns:               14
>> >> Media errors:                   0
>> >> No. error info log entries:     0
>> >> Warning Temp Composite Time:    0
>> >> Error Temp Composite Time:      0
>> >> Temperature 1 Transition Count: 5231
>> >> Temperature 2 Transition Count: 0
>> >> Total Time For Temperature 1:   41213
>> >> Total Time For Temperature 2:   0
>> >>
>> >>
>> >>
>>
>>
>>