From nobody Thu Dec 7 23:59:08 2023 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4SmWTN60vDz53qTy for ; Thu, 7 Dec 2023 23:59:20 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-ed1-x52b.google.com (mail-ed1-x52b.google.com [IPv6:2a00:1450:4864:20::52b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4SmWTN45nCz3fMj for ; Thu, 7 Dec 2023 23:59:20 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-ed1-x52b.google.com with SMTP id 4fb4d7f45d1cf-54f4b31494fso375211a12.1 for ; Thu, 07 Dec 2023 15:59:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20230601.gappssmtp.com; s=20230601; t=1701993559; x=1702598359; darn=freebsd.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=0+/GpNECtoZAKMpsNfdLxZeZQs2P98BU6TkVJDPFo4k=; b=mQ293sw+Sd9sIcnQnDhE/C/OyBt/1ROb4pQe5rOLNmOyIp4dZagwe6rvPlp4TlbFCk xYJUzmpPwtVU9rSEfd1+JpMuGnmghgeP/yXZJ3uvsqYnUlf5L3+uxAnGKQD6p4ChqPJE fcmiC6qsz4LjGjTCaMaBHq94fC1r+uE68l/n9OxkOfEFy6Pq+nkYxE5bZLmkj6fGd3W9 R6j1kXkaxmmmEt8FELXtWqJQKAKNdu2uYx0tlyM29bgRJ1bAEaIv10FLPCPYAEWR1IJp A5zD/0muhQdeaR4R37wxlXcve3gtuACY0PYMJclLNZeHQAOwbLBn/vcJd1KvObitO1zR ZKtw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701993559; x=1702598359; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=0+/GpNECtoZAKMpsNfdLxZeZQs2P98BU6TkVJDPFo4k=; b=Wh84uaWKQqOlZQeXxaosocKDD+n2ruLj3hTlgQ/XwAzeaZEgpxM3xpDqy4Y2NWOE6R 4o1xfHDykfL+lkOLNFQfmHSXeivfjyuJz+sU/B0sXH3LcQR08COQsrk2wWT11CbdOgFU t+RUm8yqUYcXKgyLzVN3JPv6bu8ygQcG8RlWvxV3Z+biT0fOi2MqeOezj055GMKj4uGc Gs8Ryc+8hKDNxbhbaf4cl8PRpqmn4RbUGPU9I4B0gO15iNweSmY22tnn07aq1srAPtb4 TcJ3DyqJ9uQX0CZCmeyWTnnZYCjJ4boLd6fvMLoDyLQU+0vS9X7mai2Hzk9tl9B6jDAj w3Uw== X-Gm-Message-State: AOJu0YxgY76G6QBUPUiS9kV821bR5lVE2kcF0OVLp4FRbTq93cLDNCIw FM5/pHBfTVnG9i2kwZ4Ki8BK+KVUkNb8du4flhTWxqLeboFCM6vg X-Google-Smtp-Source: AGHT+IFCX+vDQ/YbtVGqndvi6pHEz9tPTwaOXJJtEFw/X1qF3eOwl8ztU6BkzZ+KIsYWBYreo3Rs5KY4R7eJHjeY+RE= X-Received: by 2002:a50:d59c:0:b0:54c:7235:92a0 with SMTP id v28-20020a50d59c000000b0054c723592a0mr18266edi.43.1701993559209; Thu, 07 Dec 2023 15:59:19 -0800 (PST) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: <90d3e532-8ea7-4eea-8e31-8c363285a156@nomadlogic.org> <0ad493d5-1c1e-4370-977a-118f46ebd677@nomadlogic.org> <0c4f8149-89dd-4635-a5ed-4766fffd2553@nomadlogic.org> <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp> In-Reply-To: <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp> From: Warner Losh Date: Thu, 7 Dec 2023 16:59:08 -0700 Message-ID: Subject: Re: nvme timeout issues with hardware and bhyve vm's To: Tomoaki AOKI Cc: freebsd-current@freebsd.org Content-Type: multipart/alternative; boundary="00000000000081ccc6060bf44134" X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US] X-Spamd-Bar: ---- X-Rspamd-Queue-Id: 4SmWTN45nCz3fMj --00000000000081ccc6060bf44134 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Dec 7, 2023 at 4:09=E2=80=AFPM Tomoaki AOKI wrote: > On Thu, 7 Dec 2023 14:38:37 -0800 > Pete Wright wrote: > > > > > > > On 10/13/23 7:34 PM, Warner Losh wrote: > > > > > > > > > > > the messages i posted in the start of the thread are from the VM > itself > > > (13.2-RELEASE). The zpool on the hypervisor (13.2-RELEASE) showe= d > no > > > such issues. > > > > > > Based on your comment about the improvements in 14 I'll focus my > > > efforts > > > on my workstation, it seemed to happen regularly so hopefully i c= an > > > find > > > a repo case. > > > > > > > > > Let me now if you see similar messages in stable/14. I think I've > fixed > > > all the > > > issues with timeouts, though you shouldn't ever seem them in a vm set= up > > > unless something else weird is going on. > > > > > > > > > Hi Warner, just resurfacing this thread because I've had a few lockups > > on my workstation running 14.0-STABLE. I was able to capture a photo o= f > > the hang and this seems to be the most important line: > > > > nvme0: Resetting controller due to a timeout and possible hot unplug. > > > > When I scan the device after reboot I don't see any errors, but if ther= e > > is a particular thing I should check via nvmecontrol please let me know= . > > Also, since it mentions possible hot unplug I wonder if this is > > hardware/firmware related to my system? > > > > Anyway, haven't found a repro case yet but it has locked up a few times > > the past two weeks. > > > > -pete > > > > > > -- > > Pete Wright > > pete@nomadlogic.org > > If I myself encounter this kind of problem ON BARE METAL HARDWARE, > I would usually suspect > > *Overheating caused hang of NVMe controller or PCI bridge on SSD, or > Yes. Most drive's firmware when it overheats resets. There might be something that the pci code can do when this happens to retrain the link, reprogram the config registers, etc. > *Unstable physical connection (bad contact) > Yea, hot plug controller is required for this, but this will be bouncing. Warner --00000000000081ccc6060bf44134 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Thu, Dec 7, 2023 at 4:09=E2=80=AFP= M Tomoaki AOKI <junchoon@de= c.sakura.ne.jp> wrote:
On Thu, 7 Dec 2023 14:38:37 -0800
Pete Wright <pe= te@nomadlogic.org> wrote:

>
>
> On 10/13/23 7:34 PM, Warner Losh wrote:
> >
>
> >
> >=C2=A0 =C2=A0 =C2=A0the messages i posted in the start of the thre= ad are from the VM itself
> >=C2=A0 =C2=A0 =C2=A0(13.2-RELEASE).=C2=A0 The zpool on the hypervi= sor (13.2-RELEASE) showed no
> >=C2=A0 =C2=A0 =C2=A0such issues.
> >
> >=C2=A0 =C2=A0 =C2=A0Based on your comment about the improvements i= n 14 I'll focus my
> >=C2=A0 =C2=A0 =C2=A0efforts
> >=C2=A0 =C2=A0 =C2=A0on my workstation, it seemed to happen regular= ly so hopefully i can
> >=C2=A0 =C2=A0 =C2=A0find
> >=C2=A0 =C2=A0 =C2=A0a repo case.
> >
> >
> > Let me now if you see similar messages in stable/14. I think I= 9;ve fixed
> > all the
> > issues with timeouts, though you shouldn't ever seem them in = a vm setup
> > unless something else weird is going on.
> >
>
>
> Hi Warner, just resurfacing this thread because I've had a few loc= kups
> on my workstation running 14.0-STABLE.=C2=A0 I was able to capture a p= hoto of
> the hang and this seems to be the most important line:
>
> nvme0: Resetting controller due to a timeout and possible hot unplug.<= br> >
> When I scan the device after reboot I don't see any errors, but if= there
> is a particular thing I should check via nvmecontrol please let me kno= w.
>=C2=A0 =C2=A0Also, since it mentions possible hot unplug I wonder if th= is is
> hardware/firmware related to my system?
>
> Anyway, haven't found a repro case yet but it has locked up a few = times
> the past two weeks.
>
> -pete
>
>
> --
> Pete Wright
> pete@nomadlog= ic.org

If I myself encounter this kind of problem ON BARE METAL HARDWARE,
I would usually suspect

=C2=A0*Overheating caused hang of NVMe controller or PCI bridge on SSD, or<= br>

Yes. Most drive's firmware when it = overheats resets. There might be something
that the pci code can = do when this happens to retrain the link, reprogram the
config re= gisters, etc.
=C2=A0
=C2=A0*Unstable physical connection (bad contact)

=
Yea, hot plug controller is required for this, but this will be = bouncing.

Warner
--00000000000081ccc6060bf44134--