From nobody Tue Feb 13 05:31:46 2024 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TYqhG66gnz59t3t for ; Tue, 13 Feb 2024 05:31:58 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-ed1-x52c.google.com (mail-ed1-x52c.google.com [IPv6:2a00:1450:4864:20::52c]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4TYqhG19tSz4MCG for ; Tue, 13 Feb 2024 05:31:58 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-ed1-x52c.google.com with SMTP id 4fb4d7f45d1cf-55f0b2c79cdso5316519a12.3 for ; Mon, 12 Feb 2024 21:31:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20230601.gappssmtp.com; s=20230601; t=1707802315; x=1708407115; darn=freebsd.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=EZ2/UX9FY6GAHdvcf41d4eoOarN0qHc7DamnD8KcfWs=; b=FWIal0ShMuakMuI7RVkkPGhExFGwsLXj3ZyyxhJbjVRVzV5AvUncCX4qx2LIIqIL5L V1LKjqt35VmuXoSxDc2tM1IaUJLUFBYszQkgBuHhkSXHYm3l/f/imW6nnqcJkQVmmZYJ jnV9AaDOQGFlINMsYaa9zmQ+9EqF4QLyPXKHrE4DEF2LbYqUUyOpJ11DUQrG44mX2hWf VUgXggIM5oSPOIJkbh1XIgrqxej4rl6tIFwFQ8ewVF3/j7Kb0NhPbX+SDGFZGegunoun N5zvc5tsuAK12CpIzdklCHnsTBeGp1wlvopiyACr0CDXO6n64bGvZsZjlBk0wkIJWjq4 zXww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707802315; x=1708407115; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=EZ2/UX9FY6GAHdvcf41d4eoOarN0qHc7DamnD8KcfWs=; b=CJ/HAoHkuQUPXTKyYFFDT19MX63jskGd8j/Ea1+cSu6txQ9EbBxovQspAtADup6WHw XhsP8j7qGmYjx1JCBMgtLb1OPqG3GZkJxI/EX3zIl5d1XKLjbGtmjjpURvXWU5liojZX OPHdESZRsyDHadLE74x67kQunkkmbBHKaahUq8h5pHLpkIIizoEhG+sCagnqisu34hUI I1IpnAfyOXQOraw9+tNbNvv0VZ+WbhCjrCUbg2PkhkBpOWiP78TApC5LI5ZhTktNTFq1 +V1KPiLOpEiY+ApyjWmIRSkQnFojWqfdRX5/j4N82mCnhlAbiEm3aHRThzvBCFBtHf3o YzFQ== X-Forwarded-Encrypted: i=1; AJvYcCXKbxLfH8g3iUIBH+fm1SyKQ0hd/hv3ivU0d3EKYY0KwxBqvkfMoZVMJi0qDc5tuSGld8zJBTxLWhmgFDhHzTBAYfuTluS9GLg4RUE= X-Gm-Message-State: AOJu0YzyRj1u1gaPjBu3YYzI1xWYAw4nJnnX5+XlHlOnJS87fIjfO8Oj DHDrwf87niojfw9NCpaxss9p6eDW8pNk3sC8JiCjpABS6w45I//1fafj2ZXcv+1UcHpoAVINV3n 793MyL6x5Fb/bocRzjirikwKn6ZC+CxYnmCd+3Q== X-Google-Smtp-Source: AGHT+IH24MSqMu9foMNXPi2k3TJBgXDWP9q5OQfbLJru0JzjcqimhHeFx/jif9Bru9nyp8flz1uR6Wo1M7w28HMQ87o= X-Received: by 2002:aa7:d9cc:0:b0:561:ef01:3fa3 with SMTP id v12-20020aa7d9cc000000b00561ef013fa3mr629314eds.39.1707802315317; Mon, 12 Feb 2024 21:31:55 -0800 (PST) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Warner Losh Date: Mon, 12 Feb 2024 22:31:46 -0700 Message-ID: Subject: Re: nvme controller reset failures on recent -CURRENT To: Don Lewis Cc: Maxim Sobolev , FreeBSD current , John Baldwin Content-Type: multipart/alternative; boundary="00000000000059f3b006113cb621" X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US] X-Rspamd-Queue-Id: 4TYqhG19tSz4MCG X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated --00000000000059f3b006113cb621 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Mon, Feb 12, 2024 at 9:15=E2=80=AFPM Don Lewis wr= ote: > On 12 Feb, Maxim Sobolev wrote: > > Might be an overheating. Today's nvme drives are notoriously flaky if y= ou > > run them without proper heat sink attached to it. > > I don't think it is a thermal problem. According to the drive health > page, the device temperature has never reached Temperature 2, whatever > that is. The room temperature is around 65F. The system was stable > last summer when the room temperature spent a lot of time in the 80-85F > range. The device temperature depends a lot on the I/O rate, and the > last panic happened when the I/O rate had been below 40tps for quite a > while. > It did reach temperature 1, though. That's the 'Warning this drive is too hot' temperature. It has spent 41213 minutes of your 19297 hours of up time, or an average of 2 minutes per hour. That's too much. Temperature 2 is critical error: we are about to shut down completely due to it being too hot. It's only a couple degrees below hardware power off due to temperature in many drives. Some really cheap ones don't really implement it at all. On my card with the bad heat sink, Warning temp is 70C while critical is 75C while IIRC thermal shutdown is 78C or 80C. I don't think we report these values in nvmecontrol identify. But you can do a raw dump with -x look at bytes 266:267 for warning and 268:269 for critical. In contrast, the few dozen drives that I have, all of which have been abused in various ways, And only one of them has any heat issues, and that one is an engineering special / sample with what I think is a damaged heat sink. If your card has no heat sink, this could well be what's going on. This panic means "the nvme card lost its mind and stopped talking to the host". Its status registers read 0xff's, which means that the card isn't decoding bus signals. Usually this means that the firmware on the card has faulted and rebooted. If the card is overheating, then this could well be what's happening. There's a tiny chance that this could be something more exotic, but my money is on hardware gone bad after 2 years of service. I don't thin= k this is 'wear out' of the NAND (it's only 15TB written, but it could be if this drive is really really crappy nand: first generation QLC maybe, but it seem= s too new). It might also be a connector problem that's developed over time. There might be a few other things too, but I don't think this is a U.2 driv= e with funky cables. Warner > > On Mon, Feb 12, 2024, 4:28=E2=80=AFPM Don Lewis = wrote: > > > >> I just upgraded my package build machine to: > >> FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e > >> from: > >> FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 > >> and I've had two nvme-triggered panics in the last day. > >> > >> nvme is being used for swap and L2ARC. I'm not able to get a crash > >> dump, probably because the nvme device has gone away and I get an erro= r > >> about not having a dump device. It looks like a low-memory panic > >> because free memory is low and zfs is calling malloc(). > >> > >> This shows up in the log leading up to the panic: > >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > >> timeout a > >> nd possible hot unplug. > >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > >> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller > >> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > >> timeout a > >> nd possible hot unplug. > >> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > >> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete > >> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times > >> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o > >> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping > watchdog > >> ti > >> meout. > >> > >> The device looks healthy to me: > >> SMART/Health Information Log > >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D > >> Critical Warning State: 0x00 > >> Available spare: 0 > >> Temperature: 0 > >> Device reliability: 0 > >> Read only: 0 > >> Volatile memory backup: 0 > >> Temperature: 312 K, 38.85 C, 101.93 F > >> Available spare: 100 > >> Available spare threshold: 10 > >> Percentage used: 3 > >> Data units (512,000 byte) read: 5761183 > >> Data units written: 29911502 > >> Host read commands: 471921188 > >> Host write commands: 605394753 > >> Controller busy time (minutes): 32359 > >> Power cycles: 110 > >> Power on hours: 19297 > >> Unsafe shutdowns: 14 > >> Media errors: 0 > >> No. error info log entries: 0 > >> Warning Temp Composite Time: 0 > >> Error Temp Composite Time: 0 > >> Temperature 1 Transition Count: 5231 > >> Temperature 2 Transition Count: 0 > >> Total Time For Temperature 1: 41213 > >> Total Time For Temperature 2: 0 > >> > >> > >> > > > --00000000000059f3b006113cb621 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Mon, Feb 12, 2024 at 9:15=E2=80=AF= PM Don Lewis <truckman@freebsd.o= rg> wrote:
> On Mon, Feb 12, 2024, 4:28=E2=80=AFPM Don Lewis <truckman@freebsd.org> wrote:=
>
>> I just upgraded my package build machine to:
>>=C2=A0 =C2=A0FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e >> from:
>>=C2=A0 =C2=A0FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 >> and I've had two nvme-triggered panics in the last day.
>>
>> nvme is being used for swap and L2ARC.=C2=A0 I'm not able to g= et a crash
>> dump, probably because the nvme device has gone away and I get an = error
>> about not having a dump device.=C2=A0 It looks like a low-memory p= anic
>> because free memory is low and zfs is calling malloc().
>>
>> This shows up in the log leading up to the panic:
>> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to = a
>> timeout a
>> nd possible hot unplug.
>> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
>> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
>> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to = a
>> timeout a
>> nd possible hot unplug.
>> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
>> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complet= e
>> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
>> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
>> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping = watchdog
>> ti
>> meout.
>>
>> The device looks healthy to me:
>> SMART/Health Information Log
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D
>> Critical Warning State:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00
>>=C2=A0 Available spare:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00
>>=C2=A0 Temperature:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A00
>>=C2=A0 Device reliability:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= 0
>>=C2=A0 Read only:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A00
>>=C2=A0 Volatile memory backup:=C2=A0 =C2=A0 =C2=A0 =C2=A0 0
>> Temperature:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 312 K, 38.85 C, 101.93 F
>> Available spare:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 100
>> Available spare threshold:=C2=A0 =C2=A0 =C2=A0 10
>> Percentage used:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 3
>> Data units (512,000 byte) read: 5761183
>> Data units written:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A029911502
>> Host read commands:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0471921188
>> Host write commands:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 6053= 94753
>> Controller busy time (minutes): 32359
>> Power cycles:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0110
>> Power on hours:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A019297
>> Unsafe shutdowns:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A014
>> Media errors:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A00
>> No. error info log entries:=C2=A0 =C2=A0 =C2=A00
>> Warning Temp Composite Time:=C2=A0 =C2=A0 0
>> Error Temp Composite Time:=C2=A0 =C2=A0 =C2=A0 0
>> Temperature 1 Transition Count: 5231
>> Temperature 2 Transition Count: 0
>> Total Time For Temperature 1:=C2=A0 =C2=A041213
>> Total Time For Temperature 2:=C2=A0 =C2=A00
>>
>>
>>


--00000000000059f3b006113cb621--