Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Dec 2023 16:59:08 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        Tomoaki AOKI <junchoon@dec.sakura.ne.jp>
Cc:        freebsd-current@freebsd.org
Subject:   Re: nvme timeout issues with hardware and bhyve vm's
Message-ID:  <CANCZdfpN=GwJQT%2BrK=TMUj6niajw-0C=957gV655s2FqJ79nKw@mail.gmail.com>
In-Reply-To: <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp>
References:  <90d3e532-8ea7-4eea-8e31-8c363285a156@nomadlogic.org> <CANCZdfrQTd3F-j81HsamUCJG4DyUk_-yPOtbZY4Q926_ihatsQ@mail.gmail.com> <0ad493d5-1c1e-4370-977a-118f46ebd677@nomadlogic.org> <CANCZdfrwzmZ=iHj_vm2nsi72ceRQ81KY5DjiuML3udEaWTBanA@mail.gmail.com> <0c4f8149-89dd-4635-a5ed-4766fffd2553@nomadlogic.org> <CANCZdfpgw_sm4couYx9%2Bcgp-q_2jmPC2Q7TSeD9Yb3VYoiDQhQ@mail.gmail.com> <ec08484d-b49f-4aa3-adf4-b96570083b9c@nomadlogic.org> <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp>

index | next in thread | previous in thread | raw e-mail

[-- Attachment #1 --]
On Thu, Dec 7, 2023 at 4:09 PM Tomoaki AOKI <junchoon@dec.sakura.ne.jp>
wrote:

> On Thu, 7 Dec 2023 14:38:37 -0800
> Pete Wright <pete@nomadlogic.org> wrote:
>
> >
> >
> > On 10/13/23 7:34 PM, Warner Losh wrote:
> > >
> >
> > >
> > >     the messages i posted in the start of the thread are from the VM
> itself
> > >     (13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showed
> no
> > >     such issues.
> > >
> > >     Based on your comment about the improvements in 14 I'll focus my
> > >     efforts
> > >     on my workstation, it seemed to happen regularly so hopefully i can
> > >     find
> > >     a repo case.
> > >
> > >
> > > Let me now if you see similar messages in stable/14. I think I've
> fixed
> > > all the
> > > issues with timeouts, though you shouldn't ever seem them in a vm setup
> > > unless something else weird is going on.
> > >
> >
> >
> > Hi Warner, just resurfacing this thread because I've had a few lockups
> > on my workstation running 14.0-STABLE.  I was able to capture a photo of
> > the hang and this seems to be the most important line:
> >
> > nvme0: Resetting controller due to a timeout and possible hot unplug.
> >
> > When I scan the device after reboot I don't see any errors, but if there
> > is a particular thing I should check via nvmecontrol please let me know.
> >   Also, since it mentions possible hot unplug I wonder if this is
> > hardware/firmware related to my system?
> >
> > Anyway, haven't found a repro case yet but it has locked up a few times
> > the past two weeks.
> >
> > -pete
> >
> >
> > --
> > Pete Wright
> > pete@nomadlogic.org
>
> If I myself encounter this kind of problem ON BARE METAL HARDWARE,
> I would usually suspect
>
>  *Overheating caused hang of NVMe controller or PCI bridge on SSD, or
>

Yes. Most drive's firmware when it overheats resets. There might be
something
that the pci code can do when this happens to retrain the link, reprogram
the
config registers, etc.


>  *Unstable physical connection (bad contact)
>

Yea, hot plug controller is required for this, but this will be bouncing.

Warner

[-- Attachment #2 --]
<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 7, 2023 at 4:09 PM Tomoaki AOKI &lt;<a href="mailto:junchoon@dec.sakura.ne.jp">junchoon@dec.sakura.ne.jp</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, 7 Dec 2023 14:38:37 -0800<br>
Pete Wright &lt;<a href="mailto:pete@nomadlogic.org" target="_blank">pete@nomadlogic.org</a>&gt; wrote:<br>
<br>
&gt; <br>
&gt; <br>
&gt; On 10/13/23 7:34 PM, Warner Losh wrote:<br>
&gt; &gt; <br>
&gt; <br>
&gt; &gt; <br>
&gt; &gt;     the messages i posted in the start of the thread are from the VM itself<br>
&gt; &gt;     (13.2-RELEASE).  The zpool on the hypervisor (13.2-RELEASE) showed no<br>
&gt; &gt;     such issues.<br>
&gt; &gt; <br>
&gt; &gt;     Based on your comment about the improvements in 14 I&#39;ll focus my<br>
&gt; &gt;     efforts<br>
&gt; &gt;     on my workstation, it seemed to happen regularly so hopefully i can<br>
&gt; &gt;     find<br>
&gt; &gt;     a repo case.<br>
&gt; &gt; <br>
&gt; &gt; <br>
&gt; &gt; Let me now if you see similar messages in stable/14. I think I&#39;ve fixed <br>
&gt; &gt; all the<br>
&gt; &gt; issues with timeouts, though you shouldn&#39;t ever seem them in a vm setup<br>
&gt; &gt; unless something else weird is going on.<br>
&gt; &gt; <br>
&gt; <br>
&gt; <br>
&gt; Hi Warner, just resurfacing this thread because I&#39;ve had a few lockups <br>
&gt; on my workstation running 14.0-STABLE.  I was able to capture a photo of <br>
&gt; the hang and this seems to be the most important line:<br>
&gt; <br>
&gt; nvme0: Resetting controller due to a timeout and possible hot unplug.<br>
&gt; <br>
&gt; When I scan the device after reboot I don&#39;t see any errors, but if there <br>
&gt; is a particular thing I should check via nvmecontrol please let me know. <br>
&gt;   Also, since it mentions possible hot unplug I wonder if this is <br>
&gt; hardware/firmware related to my system?<br>
&gt; <br>
&gt; Anyway, haven&#39;t found a repro case yet but it has locked up a few times <br>
&gt; the past two weeks.<br>
&gt; <br>
&gt; -pete<br>
&gt; <br>
&gt; <br>
&gt; -- <br>
&gt; Pete Wright<br>
&gt; <a href="mailto:pete@nomadlogic.org" target="_blank">pete@nomadlogic.org</a><br>
<br>
If I myself encounter this kind of problem ON BARE METAL HARDWARE,<br>
I would usually suspect<br>
<br>
 *Overheating caused hang of NVMe controller or PCI bridge on SSD, or<br></blockquote><div><br></div><div>Yes. Most drive&#39;s firmware when it overheats resets. There might be something</div><div>that the pci code can do when this happens to retrain the link, reprogram the</div><div>config registers, etc.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
 *Unstable physical connection (bad contact)<br></blockquote><div><br></div><div>Yea, hot plug controller is required for this, but this will be bouncing.</div><div><br></div><div>Warner</div></div></div>
home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfpN=GwJQT%2BrK=TMUj6niajw-0C=957gV655s2FqJ79nKw>