From nobody Tue Feb 13 02:56:34 2024 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TYmFC1yY6z59bnc for ; Tue, 13 Feb 2024 02:56:47 +0000 (UTC) (envelope-from sobomax@sippysoft.com) Received: from mail-lj1-f176.google.com (mail-lj1-f176.google.com [209.85.208.176]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4TYmFB52wBz419V for ; Tue, 13 Feb 2024 02:56:46 +0000 (UTC) (envelope-from sobomax@sippysoft.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-lj1-f176.google.com with SMTP id 38308e7fff4ca-2d0ed7cbd76so30946981fa.1 for ; Mon, 12 Feb 2024 18:56:46 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707793004; x=1708397804; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=vceK35YAtNLYk3P7Obzd/nlU8fP7ewXITpsb3WYZ9fw=; b=pNvYnMxdPPrv5idTNjpoQTIcnVuev6hltO7IT6qgfPPYV6MzEdI8nRYGDzHAFV4N5/ f0fTL7jyUwXE9dgfQflPXk/4/bRou+BJW6nty5IoyEwOngpDuEmGaRVtSim6I+JK2Cui RMp3GWZ+sA2o0O4Gx2OQqrKK0ikEAzi2JKgonnrVzpt0gcYaN06cynX7KIElEHqH2F50 EFlSzciNW8foldNXzuJGBhbSNwy5YDWKaFBnpKETRnl5N0UgebfmtDvxwAgvTlrK/O43 9BNfstElxKkDknJApxk9Nnyy9AzjhMGTwPlCxUpn9G62oBgGrT0T8r0CGdC1/uFbD9Ag 9cNw== X-Gm-Message-State: AOJu0YzTYcIArnwpYYBRUa/KJEiHaOu283uG2W/PtWvIYk6Q0oopLjuA DlvY+CCf9/SVRem5OMTKK27NHpIn5NRBghszxscDw3A48kWr+IgDNDNMbR9vExRT2yWzCp/6PXS rQGIIokhNu1FCSq3xkW1S49LJVzJJ2YFIdG/6TA== X-Google-Smtp-Source: AGHT+IGSG5+WESWfNXq4GSheaGXHQu0gwTPrDmRw5vPgogNgTD5uXns/w3fptobW8zkf4evSxSCTJ88PKBn/sSl4bGU= X-Received: by 2002:a05:651c:b27:b0:2d0:f62b:63c9 with SMTP id b39-20020a05651c0b2700b002d0f62b63c9mr4846678ljr.31.1707793004476; Mon, 12 Feb 2024 18:56:44 -0800 (PST) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Maxim Sobolev Date: Mon, 12 Feb 2024 18:56:34 -0800 Message-ID: Subject: Re: nvme controller reset failures on recent -CURRENT To: Don Lewis Cc: FreeBSD current , John Baldwin Content-Type: multipart/alternative; boundary="00000000000061cf0606113a8b53" X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US] X-Rspamd-Queue-Id: 4TYmFB52wBz419V X-Spamd-Bar: ---- X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated --00000000000061cf0606113a8b53 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Might be an overheating. Today's nvme drives are notoriously flaky if you run them without proper heat sink attached to it. -Max On Mon, Feb 12, 2024, 4:28=E2=80=AFPM Don Lewis wrot= e: > I just upgraded my package build machine to: > FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e > from: > FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38 > and I've had two nvme-triggered panics in the last day. > > nvme is being used for swap and L2ARC. I'm not able to get a crash > dump, probably because the nvme device has gone away and I get an error > about not having a dump device. It looks like a low-memory panic > because free memory is low and zfs is calling malloc(). > > This shows up in the log leading up to the panic: > Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > timeout a > nd possible hot unplug. > Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > Feb 12 10:07:41 zipper kernel: nvme0: resetting controller > Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a > timeout a > nd possible hot unplug. > Feb 12 10:07:41 zipper syslogd: last message repeated 1 times > Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete > Feb 12 10:07:41 zipper syslogd: last message repeated 2 times > Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o > Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdo= g > ti > meout. > > The device looks healthy to me: > SMART/Health Information Log > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D > Critical Warning State: 0x00 > Available spare: 0 > Temperature: 0 > Device reliability: 0 > Read only: 0 > Volatile memory backup: 0 > Temperature: 312 K, 38.85 C, 101.93 F > Available spare: 100 > Available spare threshold: 10 > Percentage used: 3 > Data units (512,000 byte) read: 5761183 > Data units written: 29911502 > Host read commands: 471921188 > Host write commands: 605394753 > Controller busy time (minutes): 32359 > Power cycles: 110 > Power on hours: 19297 > Unsafe shutdowns: 14 > Media errors: 0 > No. error info log entries: 0 > Warning Temp Composite Time: 0 > Error Temp Composite Time: 0 > Temperature 1 Transition Count: 5231 > Temperature 2 Transition Count: 0 > Total Time For Temperature 1: 41213 > Total Time For Temperature 2: 0 > > > --00000000000061cf0606113a8b53 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Might be an overheating. Today's nvme drives are noto= riously flaky if you run them without proper heat sink attached to it.=C2= =A0

-Max
=


On Mon, Feb 12, 2024, 4:28=E2= =80=AFPM Don Lewis <truckman@fre= ebsd.org> wrote:
I just upgr= aded my package build machine to:
=C2=A0 FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e
from:
=C2=A0 FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38
and I've had two nvme-triggered panics in the last day.

nvme is being used for swap and L2ARC.=C2=A0 I'm not able to get a cras= h
dump, probably because the nvme device has gone away and I get an error
about not having a dump device.=C2=A0 It looks like a low-memory panic
because free memory is low and zfs is calling malloc().

This shows up in the log leading up to the panic:
Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout= a
nd possible hot unplug.
Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout= a
nd possible hot unplug.
Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete
Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog = ti
meout.

The device looks healthy to me:
SMART/Health Information Log
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D
Critical Warning State:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00x00
=C2=A0Available spare:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A00
=C2=A0Temperature:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A00
=C2=A0Device reliability:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0
=C2=A0Read only:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A00
=C2=A0Volatile memory backup:=C2=A0 =C2=A0 =C2=A0 =C2=A0 0
Temperature:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 312 K, 38.85 C, 101.93 F
Available spare:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 100=
Available spare threshold:=C2=A0 =C2=A0 =C2=A0 10
Percentage used:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 3 Data units (512,000 byte) read: 5761183
Data units written:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A029911502=
Host read commands:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A047192118= 8
Host write commands:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 605394753
Controller busy time (minutes): 32359
Power cycles:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0110
Power on hours:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A019297
Unsafe shutdowns:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A014<= br> Media errors:=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A00
No. error info log entries:=C2=A0 =C2=A0 =C2=A00
Warning Temp Composite Time:=C2=A0 =C2=A0 0
Error Temp Composite Time:=C2=A0 =C2=A0 =C2=A0 0
Temperature 1 Transition Count: 5231
Temperature 2 Transition Count: 0
Total Time For Temperature 1:=C2=A0 =C2=A041213
Total Time For Temperature 2:=C2=A0 =C2=A00


--00000000000061cf0606113a8b53--