Date: Thu, 7 Dec 2023 16:59:08 -0700 From: Warner Losh <imp@bsdimp.com> To: Tomoaki AOKI <junchoon@dec.sakura.ne.jp> Cc: freebsd-current@freebsd.org Subject: Re: nvme timeout issues with hardware and bhyve vm's Message-ID: <CANCZdfpN=GwJQT%2BrK=TMUj6niajw-0C=957gV655s2FqJ79nKw@mail.gmail.com> In-Reply-To: <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp> References: <90d3e532-8ea7-4eea-8e31-8c363285a156@nomadlogic.org> <CANCZdfrQTd3F-j81HsamUCJG4DyUk_-yPOtbZY4Q926_ihatsQ@mail.gmail.com> <0ad493d5-1c1e-4370-977a-118f46ebd677@nomadlogic.org> <CANCZdfrwzmZ=iHj_vm2nsi72ceRQ81KY5DjiuML3udEaWTBanA@mail.gmail.com> <0c4f8149-89dd-4635-a5ed-4766fffd2553@nomadlogic.org> <CANCZdfpgw_sm4couYx9%2Bcgp-q_2jmPC2Q7TSeD9Yb3VYoiDQhQ@mail.gmail.com> <ec08484d-b49f-4aa3-adf4-b96570083b9c@nomadlogic.org> <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp>
next in thread | previous in thread | raw e-mail | index | archive | help
--00000000000081ccc6060bf44134 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Dec 7, 2023 at 4:09=E2=80=AFPM Tomoaki AOKI <junchoon@dec.sakura.ne= .jp> wrote: > On Thu, 7 Dec 2023 14:38:37 -0800 > Pete Wright <pete@nomadlogic.org> wrote: > > > > > > > On 10/13/23 7:34 PM, Warner Losh wrote: > > > > > > > > > > > the messages i posted in the start of the thread are from the VM > itself > > > (13.2-RELEASE). The zpool on the hypervisor (13.2-RELEASE) showe= d > no > > > such issues. > > > > > > Based on your comment about the improvements in 14 I'll focus my > > > efforts > > > on my workstation, it seemed to happen regularly so hopefully i c= an > > > find > > > a repo case. > > > > > > > > > Let me now if you see similar messages in stable/14. I think I've > fixed > > > all the > > > issues with timeouts, though you shouldn't ever seem them in a vm set= up > > > unless something else weird is going on. > > > > > > > > > Hi Warner, just resurfacing this thread because I've had a few lockups > > on my workstation running 14.0-STABLE. I was able to capture a photo o= f > > the hang and this seems to be the most important line: > > > > nvme0: Resetting controller due to a timeout and possible hot unplug. > > > > When I scan the device after reboot I don't see any errors, but if ther= e > > is a particular thing I should check via nvmecontrol please let me know= . > > Also, since it mentions possible hot unplug I wonder if this is > > hardware/firmware related to my system? > > > > Anyway, haven't found a repro case yet but it has locked up a few times > > the past two weeks. > > > > -pete > > > > > > -- > > Pete Wright > > pete@nomadlogic.org > > If I myself encounter this kind of problem ON BARE METAL HARDWARE, > I would usually suspect > > *Overheating caused hang of NVMe controller or PCI bridge on SSD, or > Yes. Most drive's firmware when it overheats resets. There might be something that the pci code can do when this happens to retrain the link, reprogram the config registers, etc. > *Unstable physical connection (bad contact) > Yea, hot plug controller is required for this, but this will be bouncing. Warner --00000000000081ccc6060bf44134 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">= <div dir=3D"ltr" class=3D"gmail_attr">On Thu, Dec 7, 2023 at 4:09=E2=80=AFP= M Tomoaki AOKI <<a href=3D"mailto:junchoon@dec.sakura.ne.jp">junchoon@de= c.sakura.ne.jp</a>> wrote:<br></div><blockquote class=3D"gmail_quote" st= yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padd= ing-left:1ex">On Thu, 7 Dec 2023 14:38:37 -0800<br> Pete Wright <<a href=3D"mailto:pete@nomadlogic.org" target=3D"_blank">pe= te@nomadlogic.org</a>> wrote:<br> <br> > <br> > <br> > On 10/13/23 7:34 PM, Warner Losh wrote:<br> > > <br> > <br> > > <br> > >=C2=A0 =C2=A0 =C2=A0the messages i posted in the start of the thre= ad are from the VM itself<br> > >=C2=A0 =C2=A0 =C2=A0(13.2-RELEASE).=C2=A0 The zpool on the hypervi= sor (13.2-RELEASE) showed no<br> > >=C2=A0 =C2=A0 =C2=A0such issues.<br> > > <br> > >=C2=A0 =C2=A0 =C2=A0Based on your comment about the improvements i= n 14 I'll focus my<br> > >=C2=A0 =C2=A0 =C2=A0efforts<br> > >=C2=A0 =C2=A0 =C2=A0on my workstation, it seemed to happen regular= ly so hopefully i can<br> > >=C2=A0 =C2=A0 =C2=A0find<br> > >=C2=A0 =C2=A0 =C2=A0a repo case.<br> > > <br> > > <br> > > Let me now if you see similar messages in stable/14. I think I= 9;ve fixed <br> > > all the<br> > > issues with timeouts, though you shouldn't ever seem them in = a vm setup<br> > > unless something else weird is going on.<br> > > <br> > <br> > <br> > Hi Warner, just resurfacing this thread because I've had a few loc= kups <br> > on my workstation running 14.0-STABLE.=C2=A0 I was able to capture a p= hoto of <br> > the hang and this seems to be the most important line:<br> > <br> > nvme0: Resetting controller due to a timeout and possible hot unplug.<= br> > <br> > When I scan the device after reboot I don't see any errors, but if= there <br> > is a particular thing I should check via nvmecontrol please let me kno= w. <br> >=C2=A0 =C2=A0Also, since it mentions possible hot unplug I wonder if th= is is <br> > hardware/firmware related to my system?<br> > <br> > Anyway, haven't found a repro case yet but it has locked up a few = times <br> > the past two weeks.<br> > <br> > -pete<br> > <br> > <br> > -- <br> > Pete Wright<br> > <a href=3D"mailto:pete@nomadlogic.org" target=3D"_blank">pete@nomadlog= ic.org</a><br> <br> If I myself encounter this kind of problem ON BARE METAL HARDWARE,<br> I would usually suspect<br> <br> =C2=A0*Overheating caused hang of NVMe controller or PCI bridge on SSD, or<= br></blockquote><div><br></div><div>Yes. Most drive's firmware when it = overheats resets. There might be something</div><div>that the pci code can = do when this happens to retrain the link, reprogram the</div><div>config re= gisters, etc.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style= =3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding= -left:1ex"> =C2=A0*Unstable physical connection (bad contact)<br></blockquote><div><br>= </div><div>Yea, hot plug controller is required for this, but this will be = bouncing.</div><div><br></div><div>Warner</div></div></div> --00000000000081ccc6060bf44134--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfpN=GwJQT%2BrK=TMUj6niajw-0C=957gV655s2FqJ79nKw>