Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Dec 2023 16:59:08 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        Tomoaki AOKI <junchoon@dec.sakura.ne.jp>
Cc:        freebsd-current@freebsd.org
Subject:   Re: nvme timeout issues with hardware and bhyve vm's
Message-ID:  <CANCZdfpN=GwJQT%2BrK=TMUj6niajw-0C=957gV655s2FqJ79nKw@mail.gmail.com>
In-Reply-To: <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp>
References:  <90d3e532-8ea7-4eea-8e31-8c363285a156@nomadlogic.org> <CANCZdfrQTd3F-j81HsamUCJG4DyUk_-yPOtbZY4Q926_ihatsQ@mail.gmail.com> <0ad493d5-1c1e-4370-977a-118f46ebd677@nomadlogic.org> <CANCZdfrwzmZ=iHj_vm2nsi72ceRQ81KY5DjiuML3udEaWTBanA@mail.gmail.com> <0c4f8149-89dd-4635-a5ed-4766fffd2553@nomadlogic.org> <CANCZdfpgw_sm4couYx9%2Bcgp-q_2jmPC2Q7TSeD9Yb3VYoiDQhQ@mail.gmail.com> <ec08484d-b49f-4aa3-adf4-b96570083b9c@nomadlogic.org> <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp>

next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000081ccc6060bf44134
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, Dec 7, 2023 at 4:09=E2=80=AFPM Tomoaki AOKI <junchoon@dec.sakura.ne=
.jp>
wrote:

> On Thu, 7 Dec 2023 14:38:37 -0800
> Pete Wright <pete@nomadlogic.org> wrote:
>
> >
> >
> > On 10/13/23 7:34 PM, Warner Losh wrote:
> > >
> >
> > >
> > >     the messages i posted in the start of the thread are from the VM
> itself
> > >     (13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showe=
d
> no
> > >     such issues.
> > >
> > >     Based on your comment about the improvements in 14 I'll focus my
> > >     efforts
> > >     on my workstation, it seemed to happen regularly so hopefully i c=
an
> > >     find
> > >     a repo case.
> > >
> > >
> > > Let me now if you see similar messages in stable/14. I think I've
> fixed
> > > all the
> > > issues with timeouts, though you shouldn't ever seem them in a vm set=
up
> > > unless something else weird is going on.
> > >
> >
> >
> > Hi Warner, just resurfacing this thread because I've had a few lockups
> > on my workstation running 14.0-STABLE.  I was able to capture a photo o=
f
> > the hang and this seems to be the most important line:
> >
> > nvme0: Resetting controller due to a timeout and possible hot unplug.
> >
> > When I scan the device after reboot I don't see any errors, but if ther=
e
> > is a particular thing I should check via nvmecontrol please let me know=
.
> >   Also, since it mentions possible hot unplug I wonder if this is
> > hardware/firmware related to my system?
> >
> > Anyway, haven't found a repro case yet but it has locked up a few times
> > the past two weeks.
> >
> > -pete
> >
> >
> > --
> > Pete Wright
> > pete@nomadlogic.org
>
> If I myself encounter this kind of problem ON BARE METAL HARDWARE,
> I would usually suspect
>
>  *Overheating caused hang of NVMe controller or PCI bridge on SSD, or
>

Yes. Most drive's firmware when it overheats resets. There might be
something
that the pci code can do when this happens to retrain the link, reprogram
the
config registers, etc.


>  *Unstable physical connection (bad contact)
>

Yea, hot plug controller is required for this, but this will be bouncing.

Warner

--00000000000081ccc6060bf44134
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Thu, Dec 7, 2023 at 4:09=E2=80=AFP=
M Tomoaki AOKI &lt;<a href=3D"mailto:junchoon@dec.sakura.ne.jp">junchoon@de=
c.sakura.ne.jp</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padd=
ing-left:1ex">On Thu, 7 Dec 2023 14:38:37 -0800<br>
Pete Wright &lt;<a href=3D"mailto:pete@nomadlogic.org" target=3D"_blank">pe=
te@nomadlogic.org</a>&gt; wrote:<br>
<br>
&gt; <br>
&gt; <br>
&gt; On 10/13/23 7:34 PM, Warner Losh wrote:<br>
&gt; &gt; <br>
&gt; <br>
&gt; &gt; <br>
&gt; &gt;=C2=A0 =C2=A0 =C2=A0the messages i posted in the start of the thre=
ad are from the VM itself<br>
&gt; &gt;=C2=A0 =C2=A0 =C2=A0(13.2-RELEASE).=C2=A0 The zpool on the hypervi=
sor (13.2-RELEASE) showed no<br>
&gt; &gt;=C2=A0 =C2=A0 =C2=A0such issues.<br>
&gt; &gt; <br>
&gt; &gt;=C2=A0 =C2=A0 =C2=A0Based on your comment about the improvements i=
n 14 I&#39;ll focus my<br>
&gt; &gt;=C2=A0 =C2=A0 =C2=A0efforts<br>
&gt; &gt;=C2=A0 =C2=A0 =C2=A0on my workstation, it seemed to happen regular=
ly so hopefully i can<br>
&gt; &gt;=C2=A0 =C2=A0 =C2=A0find<br>
&gt; &gt;=C2=A0 =C2=A0 =C2=A0a repo case.<br>
&gt; &gt; <br>
&gt; &gt; <br>
&gt; &gt; Let me now if you see similar messages in stable/14. I think I&#3=
9;ve fixed <br>
&gt; &gt; all the<br>
&gt; &gt; issues with timeouts, though you shouldn&#39;t ever seem them in =
a vm setup<br>
&gt; &gt; unless something else weird is going on.<br>
&gt; &gt; <br>
&gt; <br>
&gt; <br>
&gt; Hi Warner, just resurfacing this thread because I&#39;ve had a few loc=
kups <br>
&gt; on my workstation running 14.0-STABLE.=C2=A0 I was able to capture a p=
hoto of <br>
&gt; the hang and this seems to be the most important line:<br>
&gt; <br>
&gt; nvme0: Resetting controller due to a timeout and possible hot unplug.<=
br>
&gt; <br>
&gt; When I scan the device after reboot I don&#39;t see any errors, but if=
 there <br>
&gt; is a particular thing I should check via nvmecontrol please let me kno=
w. <br>
&gt;=C2=A0 =C2=A0Also, since it mentions possible hot unplug I wonder if th=
is is <br>
&gt; hardware/firmware related to my system?<br>
&gt; <br>
&gt; Anyway, haven&#39;t found a repro case yet but it has locked up a few =
times <br>
&gt; the past two weeks.<br>
&gt; <br>
&gt; -pete<br>
&gt; <br>
&gt; <br>
&gt; -- <br>
&gt; Pete Wright<br>
&gt; <a href=3D"mailto:pete@nomadlogic.org" target=3D"_blank">pete@nomadlog=
ic.org</a><br>
<br>
If I myself encounter this kind of problem ON BARE METAL HARDWARE,<br>
I would usually suspect<br>
<br>
=C2=A0*Overheating caused hang of NVMe controller or PCI bridge on SSD, or<=
br></blockquote><div><br></div><div>Yes. Most drive&#39;s firmware when it =
overheats resets. There might be something</div><div>that the pci code can =
do when this happens to retrain the link, reprogram the</div><div>config re=
gisters, etc.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">
=C2=A0*Unstable physical connection (bad contact)<br></blockquote><div><br>=
</div><div>Yea, hot plug controller is required for this, but this will be =
bouncing.</div><div><br></div><div>Warner</div></div></div>

--00000000000081ccc6060bf44134--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfpN=GwJQT%2BrK=TMUj6niajw-0C=957gV655s2FqJ79nKw>