Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Dec 2023 17:33:02 -0800
From:      Bakul Shah <bakul@iitbombay.org>
To:        Maxim Sobolev <sobomax@freebsd.org>
Cc:        Warner Losh <imp@bsdimp.com>, Tomoaki AOKI <junchoon@dec.sakura.ne.jp>, FreeBSD Current <freebsd-current@freebsd.org>
Subject:   Re: nvme timeout issues with hardware and bhyve vm's
Message-ID:  <BA104206-C41C-4A36-A0B1-D5735C2FCAAC@iitbombay.org>
In-Reply-To: <CAH7qZfuC8WHUpSvsT2tQo-9txcWkTg84GXbGHR5uBXtQaFw1aQ@mail.gmail.com>
References:  <90d3e532-8ea7-4eea-8e31-8c363285a156@nomadlogic.org> <CANCZdfrQTd3F-j81HsamUCJG4DyUk_-yPOtbZY4Q926_ihatsQ@mail.gmail.com> <0ad493d5-1c1e-4370-977a-118f46ebd677@nomadlogic.org> <CANCZdfrwzmZ=iHj_vm2nsi72ceRQ81KY5DjiuML3udEaWTBanA@mail.gmail.com> <0c4f8149-89dd-4635-a5ed-4766fffd2553@nomadlogic.org> <CANCZdfpgw_sm4couYx9%2Bcgp-q_2jmPC2Q7TSeD9Yb3VYoiDQhQ@mail.gmail.com> <ec08484d-b49f-4aa3-adf4-b96570083b9c@nomadlogic.org> <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp> <CANCZdfpN=GwJQT%2BrK=TMUj6niajw-0C=957gV655s2FqJ79nKw@mail.gmail.com> <10FD2FC6-1F39-4F7D-8BA8-976ADC0AE37A@iitbombay.org> <CAH7qZfuC8WHUpSvsT2tQo-9txcWkTg84GXbGHR5uBXtQaFw1aQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

--Apple-Mail=_15189078-E915-4290-888A-F62380A8027C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Thanks.

It may be worth checking the temp periodically and warning the user in =
case it is too high (70=C2=BAC+ or something). Even for devices that =
allow internal throttling, a user might wish to know whether the device =
neads a (better) heatsink.

> On Dec 7, 2023, at 5:02=E2=80=AFPM, Maxim Sobolev =
<sobomax@freebsd.org> wrote:
>=20
> How quickly it heats up depends on lots of factors. Usually those =
devices burn some 3-7 watts per stick at 100% load, so maybe this would =
give you some idea. At least some of them support several toggleable =
performance modes, which use throttling internally to limit power =
consumption to a certain level (man nvmecontril). It helped me recently =
to make a system stable, which otherwise would hang with timeout after =
reaching 70-75C until I got the chance to take it apart and attach a =
heatsinks to the nvmes. Once the temperature dropped to <=3D 50C the =
drives become 100% stable.
>=20
> -Max
>=20
> On Thu, Dec 7, 2023, 4:07=E2=80=AFPM Bakul Shah <bakul@iitbombay.org =
<mailto:bakul@iitbombay.org>> wrote:
>> On Dec 7, 2023, at 3:59=E2=80=AFPM, Warner Losh <imp@bsdimp.com =
<mailto:imp@bsdimp.com>> wrote:
>> >=20
>> >=20
>> >  *Overheating caused hang of NVMe controller or PCI bridge on SSD, =
or
>> >=20
>> > Yes. Most drive's firmware when it overheats resets. There might be =
something
>> > that the pci code can do when this happens to retrain the link, =
reprogram the
>> > config registers, etc.
>>=20
>> How quickly can the device heat up? Can it be queried frequently
>> enough act before it overheats by throttling io?
>>=20
>>=20
>>=20
>>=20


--Apple-Mail=_15189078-E915-4290-888A-F62380A8027C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"content-type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"overflow-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: =
after-white-space;">Thanks.<div><br></div><div>It may be worth checking =
the temp periodically and warning the user in case it is too high =
(70=C2=BAC+ or something). Even for devices that allow internal =
throttling, a user might wish to know whether the device neads a =
(better) heatsink.</div><div><div><br><blockquote type=3D"cite"><div>On =
Dec 7, 2023, at 5:02=E2=80=AFPM, Maxim Sobolev =
&lt;sobomax@freebsd.org&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div><div dir=3D"auto">How quickly =
it heats up depends on lots of factors. Usually those devices burn some =
3-7 watts per stick at 100% load, so maybe this would give you some =
idea. At least some of them support several toggleable performance =
modes, which use throttling internally to limit power consumption to a =
certain level (man nvmecontril). It helped me recently to make a system =
stable, which otherwise would hang with timeout after reaching 70-75C =
until I got the chance to take it apart and attach a heatsinks to the =
nvmes. Once the temperature dropped to &lt;=3D 50C the drives become =
100% stable.<div dir=3D"auto"><br></div><div =
dir=3D"auto">-Max</div></div><br><div class=3D"gmail_quote"><div =
dir=3D"ltr" class=3D"gmail_attr">On Thu, Dec 7, 2023, 4:07=E2=80=AFPM =
Bakul Shah &lt;<a =
href=3D"mailto:bakul@iitbombay.org">bakul@iitbombay.org</a>&gt; =
wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">On Dec 7, 2023, at =
3:59=E2=80=AFPM, Warner Losh &lt;<a href=3D"mailto:imp@bsdimp.com" =
target=3D"_blank" rel=3D"noreferrer">imp@bsdimp.com</a>&gt; wrote:<br>
&gt; <br>
&gt; <br>
&gt;&nbsp; *Overheating caused hang of NVMe controller or PCI bridge on =
SSD, or<br>
&gt; <br>
&gt; Yes. Most drive's firmware when it overheats resets. There might be =
something<br>
&gt; that the pci code can do when this happens to retrain the link, =
reprogram the<br>
&gt; config registers, etc.<br>
<br>
How quickly can the device heat up? Can it be queried frequently<br>
enough act before it overheats by throttling io?<br>
<br>
<br>
<br>
<br>
</blockquote></div>
</div></blockquote></div><br></div></body></html>=

--Apple-Mail=_15189078-E915-4290-888A-F62380A8027C--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BA104206-C41C-4A36-A0B1-D5735C2FCAAC>