From nobody Tue Feb 13 05:31:46 2024
X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TYqhG66gnz59t3t
	for <freebsd-current@mlmmj.nyi.freebsd.org>; Tue, 13 Feb 2024 05:31:58 +0000 (UTC)
	(envelope-from wlosh@bsdimp.com)
Received: from mail-ed1-x52c.google.com (mail-ed1-x52c.google.com [IPv6:2a00:1450:4864:20::52c])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4TYqhG19tSz4MCG
	for <freebsd-current@freebsd.org>; Tue, 13 Feb 2024 05:31:58 +0000 (UTC)
	(envelope-from wlosh@bsdimp.com)
Authentication-Results: mx1.freebsd.org;
	none
Received: by mail-ed1-x52c.google.com with SMTP id 4fb4d7f45d1cf-55f0b2c79cdso5316519a12.3
        for <freebsd-current@freebsd.org>; Mon, 12 Feb 2024 21:31:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bsdimp-com.20230601.gappssmtp.com; s=20230601; t=1707802315; x=1708407115; darn=freebsd.org;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=EZ2/UX9FY6GAHdvcf41d4eoOarN0qHc7DamnD8KcfWs=;
        b=FWIal0ShMuakMuI7RVkkPGhExFGwsLXj3ZyyxhJbjVRVzV5AvUncCX4qx2LIIqIL5L
         V1LKjqt35VmuXoSxDc2tM1IaUJLUFBYszQkgBuHhkSXHYm3l/f/imW6nnqcJkQVmmZYJ
         jnV9AaDOQGFlINMsYaa9zmQ+9EqF4QLyPXKHrE4DEF2LbYqUUyOpJ11DUQrG44mX2hWf
         VUgXggIM5oSPOIJkbh1XIgrqxej4rl6tIFwFQ8ewVF3/j7Kb0NhPbX+SDGFZGegunoun
         N5zvc5tsuAK12CpIzdklCHnsTBeGp1wlvopiyACr0CDXO6n64bGvZsZjlBk0wkIJWjq4
         zXww==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1707802315; x=1708407115;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=EZ2/UX9FY6GAHdvcf41d4eoOarN0qHc7DamnD8KcfWs=;
        b=CJ/HAoHkuQUPXTKyYFFDT19MX63jskGd8j/Ea1+cSu6txQ9EbBxovQspAtADup6WHw
         XhsP8j7qGmYjx1JCBMgtLb1OPqG3GZkJxI/EX3zIl5d1XKLjbGtmjjpURvXWU5liojZX
         OPHdESZRsyDHadLE74x67kQunkkmbBHKaahUq8h5pHLpkIIizoEhG+sCagnqisu34hUI
         I1IpnAfyOXQOraw9+tNbNvv0VZ+WbhCjrCUbg2PkhkBpOWiP78TApC5LI5ZhTktNTFq1
         +V1KPiLOpEiY+ApyjWmIRSkQnFojWqfdRX5/j4N82mCnhlAbiEm3aHRThzvBCFBtHf3o
         YzFQ==
X-Forwarded-Encrypted: i=1; AJvYcCXKbxLfH8g3iUIBH+fm1SyKQ0hd/hv3ivU0d3EKYY0KwxBqvkfMoZVMJi0qDc5tuSGld8zJBTxLWhmgFDhHzTBAYfuTluS9GLg4RUE=
X-Gm-Message-State: AOJu0YzyRj1u1gaPjBu3YYzI1xWYAw4nJnnX5+XlHlOnJS87fIjfO8Oj
	DHDrwf87niojfw9NCpaxss9p6eDW8pNk3sC8JiCjpABS6w45I//1fafj2ZXcv+1UcHpoAVINV3n
	793MyL6x5Fb/bocRzjirikwKn6ZC+CxYnmCd+3Q==
X-Google-Smtp-Source: AGHT+IH24MSqMu9foMNXPi2k3TJBgXDWP9q5OQfbLJru0JzjcqimhHeFx/jif9Bru9nyp8flz1uR6Wo1M7w28HMQ87o=
X-Received: by 2002:aa7:d9cc:0:b0:561:ef01:3fa3 with SMTP id
 v12-20020aa7d9cc000000b00561ef013fa3mr629314eds.39.1707802315317; Mon, 12 Feb
 2024 21:31:55 -0800 (PST)
List-Id: Discussions about the use of FreeBSD-current <freebsd-current.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-current
List-Help: <mailto:freebsd-current+help@freebsd.org>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Subscribe: <mailto:freebsd-current+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-current+unsubscribe@freebsd.org>
Sender: owner-freebsd-current@freebsd.org
MIME-Version: 1.0
References: <tkrat.edddc2469f43baf6@FreeBSD.org> <CAH7qZfunD154VYPD1vh_GNtOMM-quX=S00iQGvrpbhaegpXRnw@mail.gmail.com>
 <tkrat.76b39844cd6da514@FreeBSD.org>
In-Reply-To: <tkrat.76b39844cd6da514@FreeBSD.org>
From: Warner Losh <imp@bsdimp.com>
Date: Mon, 12 Feb 2024 22:31:46 -0700
Message-ID: <CANCZdfrKeHJg5Tt-3cUq9hBgwwNqF4qnOWyFpF=TUjMdANOMfg@mail.gmail.com>
Subject: Re: nvme controller reset failures on recent -CURRENT
To: Don Lewis <truckman@freebsd.org>
Cc: Maxim Sobolev <sobomax@freebsd.org>, FreeBSD current <freebsd-current@freebsd.org>, 
	John Baldwin <jhb@freebsd.org>
Content-Type: multipart/alternative; boundary="00000000000059f3b006113cb621"
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]
X-Rspamd-Queue-Id: 4TYqhG19tSz4MCG
X-Spamd-Bar: ----
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated

--00000000000059f3b006113cb621
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, Feb 12, 2024 at 9:15=E2=80=AFPM Don Lewis <truckman@freebsd.org> wr=
ote:

> On 12 Feb, Maxim Sobolev wrote:
> > Might be an overheating. Today's nvme drives are notoriously flaky if y=
ou
> > run them without proper heat sink attached to it.
>
> I don't think it is a thermal problem.  According to the drive health
> page, the device temperature has never reached Temperature 2, whatever
> that is.  The room temperature is around 65F.  The system was stable
> last summer when the room temperature spent a lot of time in the 80-85F
> range.  The device temperature depends a lot on the I/O rate, and the
> last panic happened when the I/O rate had been below 40tps for quite a
> while.
>

It did reach temperature 1, though. That's the 'Warning this drive is too
hot' temperature. It has spent 41213 minutes of your 19297 hours of up
time, or an average of 2 minutes per hour. That's too much. Temperature
2 is critical error: we are about to shut down completely due to it
being too hot. It's only a couple degrees below hardware power off
due to temperature in many drives. Some really cheap ones don't really
implement it at all. On my card with the bad heat sink, Warning temp is
70C while critical is 75C while IIRC thermal shutdown is 78C or 80C.

I don't think we report these values in nvmecontrol identify. But you can
do a raw dump with -x look at bytes 266:267 for warning and 268:269
for critical.

In contrast, the few dozen drives that I have, all of which have been
abused in various ways, And only one of them has any heat issues,
and that one is an engineering special / sample with what I think is
a damaged heat sink. If your card has no heat sink, this could well
be what's going on.

This panic means "the nvme card lost its mind and stopped talking
to the host". Its status registers read 0xff's, which means that the card
isn't decoding bus signals. Usually this means that the firmware on the
card has faulted and rebooted. If the card is overheating, then this could
well be what's happening.

There's a tiny chance that this could be something more exotic,
but my money is on hardware gone bad after 2 years of service. I don't thin=
k
this is 'wear out' of the NAND (it's only 15TB written, but it could be if
this
drive is really really crappy nand: first generation QLC maybe, but it seem=
s
too new). It might also be a connector problem that's developed over time.
There might be a few other things too, but I don't think this is a U.2 driv=
e
with funky cables.

Warner


> > On Mon, Feb 12, 2024, 4:28=E2=80=AFPM Don Lewis <truckman@freebsd.org> =
wrote:
> >
> >> I just upgraded my package build machine to:
> >>   FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e
> >> from:
> >>   FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38
> >> and I've had two nvme-triggered panics in the last day.
> >>
> >> nvme is being used for swap and L2ARC.  I'm not able to get a crash
> >> dump, probably because the nvme device has gone away and I get an erro=
r
> >> about not having a dump device.  It looks like a low-memory panic
> >> because free memory is low and zfs is calling malloc().
> >>
> >> This shows up in the log leading up to the panic:
> >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
> >> timeout a
> >> nd possible hot unplug.
> >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
> >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
> >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a
> >> timeout a
> >> nd possible hot unplug.
> >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
> >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete
> >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
> >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
> >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping
> watchdog
> >> ti
> >> meout.
> >>
> >> The device looks healthy to me:
> >> SMART/Health Information Log
> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
> >> Critical Warning State:         0x00
> >>  Available spare:               0
> >>  Temperature:                   0
> >>  Device reliability:            0
> >>  Read only:                     0
> >>  Volatile memory backup:        0
> >> Temperature:                    312 K, 38.85 C, 101.93 F
> >> Available spare:                100
> >> Available spare threshold:      10
> >> Percentage used:                3
> >> Data units (512,000 byte) read: 5761183
> >> Data units written:             29911502
> >> Host read commands:             471921188
> >> Host write commands:            605394753
> >> Controller busy time (minutes): 32359
> >> Power cycles:                   110
> >> Power on hours:                 19297
> >> Unsafe shutdowns:               14
> >> Media errors:                   0
> >> No. error info log entries:     0
> >> Warning Temp Composite Time:    0
> >> Error Temp Composite Time:      0
> >> Temperature 1 Transition Count: 5231
> >> Temperature 2 Transition Count: 0
> >> Total Time For Temperature 1:   41213
> >> Total Time For Temperature 2:   0
> >>
> >>
> >>
>
>
>

--00000000000059f3b006113cb621
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Mon, Feb 12, 2024 at 9:15=E2=80=AF=
PM Don Lewis &lt;<a href=3D"mailto:truckman@freebsd.org">truckman@freebsd.o=
rg</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
">On 12 Feb, Maxim Sobolev wrote:<br>
&gt; Might be an overheating. Today&#39;s nvme drives are notoriously flaky=
 if you<br>
&gt; run them without proper heat sink attached to it.<br>
<br>
I don&#39;t think it is a thermal problem.=C2=A0 According to the drive hea=
lth<br>
page, the device temperature has never reached Temperature 2, whatever<br>
that is.=C2=A0 The room temperature is around 65F.=C2=A0 The system was sta=
ble<br>
last summer when the room temperature spent a lot of time in the 80-85F<br>
range.=C2=A0 The device temperature depends a lot on the I/O rate, and the<=
br>
last panic happened when the I/O rate had been below 40tps for quite a<br>
while.<br></blockquote><div><br></div><div>It did reach temperature 1, thou=
gh. That&#39;s the &#39;Warning this drive is too</div><div>hot&#39; temper=
ature. It has spent 41213 minutes of your 19297 hours of up</div><div>time,=
 or an average of 2 minutes per hour. That&#39;s too much. Temperature</div=
><div>2 is critical error: we are about to shut down completely due to it</=
div><div>being too hot. It&#39;s only a couple degrees below hardware power=
 off</div><div>due to temperature in many drives. Some really cheap ones do=
n&#39;t really</div><div>implement it at all. On my card with the bad heat =
sink, Warning temp is</div><div>70C while critical is 75C while IIRC therma=
l shutdown is 78C or 80C.</div><div><br></div><div>I don&#39;t think we rep=
ort these values in nvmecontrol identify. But you can</div><div>do a raw du=
mp with -x look at bytes 266:267 for warning and 268:269</div><div>for crit=
ical.<br></div><div><br></div><div>In contrast, the few dozen drives that I=
 have, all of which have been</div><div>abused in various ways, And only on=
e of them has any heat issues,</div><div>and that one is an engineering spe=
cial / sample with what I think is</div><div>a damaged heat sink. If your c=
ard has no heat sink, this could well</div><div>be what&#39;s going on.<br>=
</div><div><br></div><div>This panic means &quot;the nvme card lost its min=
d and stopped talking</div><div>to the host&quot;. Its status registers rea=
d 0xff&#39;s, which means that the card</div><div>isn&#39;t decoding bus si=
gnals. Usually this means that the firmware on the</div><div>card has fault=
ed and rebooted. If the card is overheating, then this could</div><div>well=
 be what&#39;s happening.</div><div><br></div><div>There&#39;s a tiny chanc=
e that this could be something more exotic,</div><div>but my money is on ha=
rdware gone bad after 2 years of service. I don&#39;t think</div><div>this =
is &#39;wear out&#39; of the NAND (it&#39;s only 15TB written, but it could=
 be if this</div><div>drive is really really crappy nand: first generation =
QLC maybe, but it seems</div><div>too new). It might also be a connector pr=
oblem that&#39;s developed over time.</div><div>There might be a few other =
things too, but I don&#39;t think this is a U.2 drive</div><div>with funky =
cables.</div><div><br></div><div>Warner<br></div><div>=C2=A0</div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex">
&gt; On Mon, Feb 12, 2024, 4:28=E2=80=AFPM Don Lewis &lt;<a href=3D"mailto:=
truckman@freebsd.org" target=3D"_blank">truckman@freebsd.org</a>&gt; wrote:=
<br>
&gt; <br>
&gt;&gt; I just upgraded my package build machine to:<br>
&gt;&gt;=C2=A0 =C2=A0FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e<br=
>
&gt;&gt; from:<br>
&gt;&gt;=C2=A0 =C2=A0FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38<br=
>
&gt;&gt; and I&#39;ve had two nvme-triggered panics in the last day.<br>
&gt;&gt;<br>
&gt;&gt; nvme is being used for swap and L2ARC.=C2=A0 I&#39;m not able to g=
et a crash<br>
&gt;&gt; dump, probably because the nvme device has gone away and I get an =
error<br>
&gt;&gt; about not having a dump device.=C2=A0 It looks like a low-memory p=
anic<br>
&gt;&gt; because free memory is low and zfs is calling malloc().<br>
&gt;&gt;<br>
&gt;&gt; This shows up in the log leading up to the panic:<br>
&gt;&gt; Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to =
a<br>
&gt;&gt; timeout a<br>
&gt;&gt; nd possible hot unplug.<br>
&gt;&gt; Feb 12 10:07:41 zipper syslogd: last message repeated 1 times<br>
&gt;&gt; Feb 12 10:07:41 zipper kernel: nvme0: resetting controller<br>
&gt;&gt; Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to =
a<br>
&gt;&gt; timeout a<br>
&gt;&gt; nd possible hot unplug.<br>
&gt;&gt; Feb 12 10:07:41 zipper syslogd: last message repeated 1 times<br>
&gt;&gt; Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complet=
e<br>
&gt;&gt; Feb 12 10:07:41 zipper syslogd: last message repeated 2 times<br>
&gt;&gt; Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o<br>
&gt;&gt; Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping =
watchdog<br>
&gt;&gt; ti<br>
&gt;&gt; meout.<br>
&gt;&gt;<br>
&gt;&gt; The device looks healthy to me:<br>
&gt;&gt; SMART/Health Information Log<br>
&gt;&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D<br>
&gt;&gt; Critical Warning State:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00<br>
&gt;&gt;=C2=A0 Available spare:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A00<br>
&gt;&gt;=C2=A0 Temperature:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A00<br>
&gt;&gt;=C2=A0 Device reliability:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 0<br>
&gt;&gt;=C2=A0 Read only:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A00<br>
&gt;&gt;=C2=A0 Volatile memory backup:=C2=A0 =C2=A0 =C2=A0 =C2=A0 0<br>
&gt;&gt; Temperature:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 312 K, 38.85 C, 101.93 F<br>
&gt;&gt; Available spare:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 100<br>
&gt;&gt; Available spare threshold:=C2=A0 =C2=A0 =C2=A0 10<br>
&gt;&gt; Percentage used:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 3<br>
&gt;&gt; Data units (512,000 byte) read: 5761183<br>
&gt;&gt; Data units written:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A029911502<br>
&gt;&gt; Host read commands:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0471921188<br>
&gt;&gt; Host write commands:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 6053=
94753<br>
&gt;&gt; Controller busy time (minutes): 32359<br>
&gt;&gt; Power cycles:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0110<br>
&gt;&gt; Power on hours:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A019297<br>
&gt;&gt; Unsafe shutdowns:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A014<br>
&gt;&gt; Media errors:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A00<br>
&gt;&gt; No. error info log entries:=C2=A0 =C2=A0 =C2=A00<br>
&gt;&gt; Warning Temp Composite Time:=C2=A0 =C2=A0 0<br>
&gt;&gt; Error Temp Composite Time:=C2=A0 =C2=A0 =C2=A0 0<br>
&gt;&gt; Temperature 1 Transition Count: 5231<br>
&gt;&gt; Temperature 2 Transition Count: 0<br>
&gt;&gt; Total Time For Temperature 1:=C2=A0 =C2=A041213<br>
&gt;&gt; Total Time For Temperature 2:=C2=A0 =C2=A00<br>
&gt;&gt;<br>
&gt;&gt;<br>
&gt;&gt;<br>
<br>
<br>
</blockquote></div></div>

--00000000000059f3b006113cb621--