Date: Thu, 7 Dec 2023 17:33:02 -0800 From: Bakul Shah <bakul@iitbombay.org> To: Maxim Sobolev <sobomax@freebsd.org> Cc: Warner Losh <imp@bsdimp.com>, Tomoaki AOKI <junchoon@dec.sakura.ne.jp>, FreeBSD Current <freebsd-current@freebsd.org> Subject: Re: nvme timeout issues with hardware and bhyve vm's Message-ID: <BA104206-C41C-4A36-A0B1-D5735C2FCAAC@iitbombay.org> In-Reply-To: <CAH7qZfuC8WHUpSvsT2tQo-9txcWkTg84GXbGHR5uBXtQaFw1aQ@mail.gmail.com> References: <90d3e532-8ea7-4eea-8e31-8c363285a156@nomadlogic.org> <CANCZdfrQTd3F-j81HsamUCJG4DyUk_-yPOtbZY4Q926_ihatsQ@mail.gmail.com> <0ad493d5-1c1e-4370-977a-118f46ebd677@nomadlogic.org> <CANCZdfrwzmZ=iHj_vm2nsi72ceRQ81KY5DjiuML3udEaWTBanA@mail.gmail.com> <0c4f8149-89dd-4635-a5ed-4766fffd2553@nomadlogic.org> <CANCZdfpgw_sm4couYx9%2Bcgp-q_2jmPC2Q7TSeD9Yb3VYoiDQhQ@mail.gmail.com> <ec08484d-b49f-4aa3-adf4-b96570083b9c@nomadlogic.org> <20231208080929.cfd9fca421fea81d89d2380b@dec.sakura.ne.jp> <CANCZdfpN=GwJQT%2BrK=TMUj6niajw-0C=957gV655s2FqJ79nKw@mail.gmail.com> <10FD2FC6-1F39-4F7D-8BA8-976ADC0AE37A@iitbombay.org> <CAH7qZfuC8WHUpSvsT2tQo-9txcWkTg84GXbGHR5uBXtQaFw1aQ@mail.gmail.com>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --] Thanks. It may be worth checking the temp periodically and warning the user in case it is too high (70ºC+ or something). Even for devices that allow internal throttling, a user might wish to know whether the device neads a (better) heatsink. > On Dec 7, 2023, at 5:02 PM, Maxim Sobolev <sobomax@freebsd.org> wrote: > > How quickly it heats up depends on lots of factors. Usually those devices burn some 3-7 watts per stick at 100% load, so maybe this would give you some idea. At least some of them support several toggleable performance modes, which use throttling internally to limit power consumption to a certain level (man nvmecontril). It helped me recently to make a system stable, which otherwise would hang with timeout after reaching 70-75C until I got the chance to take it apart and attach a heatsinks to the nvmes. Once the temperature dropped to <= 50C the drives become 100% stable. > > -Max > > On Thu, Dec 7, 2023, 4:07 PM Bakul Shah <bakul@iitbombay.org <mailto:bakul@iitbombay.org>> wrote: >> On Dec 7, 2023, at 3:59 PM, Warner Losh <imp@bsdimp.com <mailto:imp@bsdimp.com>> wrote: >> > >> > >> > *Overheating caused hang of NVMe controller or PCI bridge on SSD, or >> > >> > Yes. Most drive's firmware when it overheats resets. There might be something >> > that the pci code can do when this happens to retrain the link, reprogram the >> > config registers, etc. >> >> How quickly can the device heat up? Can it be queried frequently >> enough act before it overheats by throttling io? >> >> >> >> [-- Attachment #2 --] <html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body style="overflow-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">Thanks.<div><br></div><div>It may be worth checking the temp periodically and warning the user in case it is too high (70ºC+ or something). Even for devices that allow internal throttling, a user might wish to know whether the device neads a (better) heatsink.</div><div><div><br><blockquote type="cite"><div>On Dec 7, 2023, at 5:02 PM, Maxim Sobolev <sobomax@freebsd.org> wrote:</div><br class="Apple-interchange-newline"><div><div dir="auto">How quickly it heats up depends on lots of factors. Usually those devices burn some 3-7 watts per stick at 100% load, so maybe this would give you some idea. At least some of them support several toggleable performance modes, which use throttling internally to limit power consumption to a certain level (man nvmecontril). It helped me recently to make a system stable, which otherwise would hang with timeout after reaching 70-75C until I got the chance to take it apart and attach a heatsinks to the nvmes. Once the temperature dropped to <= 50C the drives become 100% stable.<div dir="auto"><br></div><div dir="auto">-Max</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Dec 7, 2023, 4:07 PM Bakul Shah <<a href="mailto:bakul@iitbombay.org">bakul@iitbombay.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Dec 7, 2023, at 3:59 PM, Warner Losh <<a href="mailto:imp@bsdimp.com" target="_blank" rel="noreferrer">imp@bsdimp.com</a>> wrote:<br> > <br> > <br> > *Overheating caused hang of NVMe controller or PCI bridge on SSD, or<br> > <br> > Yes. Most drive's firmware when it overheats resets. There might be something<br> > that the pci code can do when this happens to retrain the link, reprogram the<br> > config registers, etc.<br> <br> How quickly can the device heat up? Can it be queried frequently<br> enough act before it overheats by throttling io?<br> <br> <br> <br> <br> </blockquote></div> </div></blockquote></div><br></div></body></html>help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?BA104206-C41C-4A36-A0B1-D5735C2FCAAC>
