Date: Fri, 14 Jan 2022 11:53:06 -0700 From: Pavel Timofeev <timp87@gmail.com> To: Alexander Motin <mavbsd@gmail.com> Cc: Warner Losh <imp@bsdimp.com>, Chuck Tuffli <ctuffli@gmail.com>, freebsd-current <freebsd-current@freebsd.org> Subject: Re: Dell Latitude 7400 - nvme0: Missing interrupt Message-ID: <CAAoTqfstW3Q%2BKbG%2BuuOz848Wvg2t4hUjjGaMNBstNHb7uUkiGg@mail.gmail.com> In-Reply-To: <CAAoTqfv6SrdWbHZWzYtexr60gy5VOm8t3oYyo58T-5Lid=Ojkw@mail.gmail.com> References: <CAAoTqftwbxBbf96surAWrwqKLDrHAePQnbv_wksPGbKPXedsyw@mail.gmail.com> <CAKAYmM%2B9asxahYJpJB5R%2Byn-Eosmw3xsdOFBTMeyAyDvFkOB1Q@mail.gmail.com> <CAAoTqfuxDCG%2B8agV81NpLdPKJdBOBz0X-FYiApFibx2bjf9Zgg@mail.gmail.com> <CAAoTqfsfvGyhFduoQ3U%2B-Pdcwpp1OVk3aWpDc1Y=WYtgG_a_tA@mail.gmail.com> <CANCZdfocFNXSzax6E=Rhku4iQSu-MGLv7YBifppKcgCnV8eTbw@mail.gmail.com> <CAAoTqfvzxLHF_KGLQQBQmR8cVHPb5RP36Nn7zd0-wJ=DEJNYVQ@mail.gmail.com> <CANCZdfpmSV19bhbyFuNbKtca4uA3x34c4_C1hjX4%2BE=i3Pfamw@mail.gmail.com> <CAAoTqfvp7FX0Hcaqj4o0VrMsdHRW0pFiTOtcOW19fS%2BWps06Zg@mail.gmail.com> <CANCZdfqAHTEBJ=SchTDshKEfOLo2cNGrhRJqk-Lt7_KyrKxKnQ@mail.gmail.com> <CAAoTqfvmZCd9eVZZ93N=qrQvrD%2ByysTqpEmE3nqf8xEWiGnaMQ@mail.gmail.com> <CANCZdfq_oWdvTn_ZQXsfv2qR3sUFXB88VW6aff3JEC41XVrdzg@mail.gmail.com> <CAAoTqfvcVVs3uQL-kNxqKpAzYifr1_K0f=Yh=hE5MPW%2Bn5AgRA@mail.gmail.com> <CANCZdfqnkYuEr5bYeTWR=VVXkHF_RCEgaAw-7OBRC_XP=mk7fA@mail.gmail.com> <CANCZdfrouhaehS86BDu3t5PHuYkiujaZqnkvV4yd6RjKQ9VAWA@mail.gmail.com> <4fa413f4-f167-ce1d-ce2f-a2a05a34dc32@gmail.com> <CAAoTqfv6SrdWbHZWzYtexr60gy5VOm8t3oYyo58T-5Lid=Ojkw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --] вс, 17 окт. 2021 г. в 17:52, Pavel Timofeev <timp87@gmail.com>: > > > вс, 17 окт. 2021 г. в 11:19, Alexander Motin <mavbsd@gmail.com>: > >> It may be a noise, but comparing logs I see in reboot case also >> "acpi_ec0: not getting interrupts, switched to polled mode". I am >> thinking whether the problem may be caused not by SSD, but by some >> resource conflict/misconfiguration in the system. Pavel, can you >> compare `devinfo -vr` and `lspci -vvvvv` in both cases. looking for any >> differences? Are you running the latest BIOS? >> >> On 12.10.2021 15:56, Warner Losh wrote: >> > One piece of data that would be good to have: >> > >> > nvmecontrol identify nvme0 >> > >> > There's an optional feature that none of my drives have, but that the >> Linux >> > driver does oddly >> > weird things when enabled. The output of that command will help me >> > understand if that may >> > be in play. Maybe we need to do oddly weird things too :) >> > >> > Warner >> > >> > On Sun, Oct 10, 2021 at 11:00 PM Warner Losh <imp@bsdimp.com> wrote: >> > >> >> >> >> >> >> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev <timp87@gmail.com> >> wrote: >> >> >> >>> сб, 9 окт. 2021 г. в 14:59, Warner Losh <imp@bsdimp.com>: >> >>> >> >>>> >> >>>> >> >>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev <timp87@gmail.com> >> wrote: >> >>>> >> >>>>> >> >>>>> >> >>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh <imp@bsdimp.com>: >> >>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <timp87@gmail.com> >> >>>>>> wrote: >> >>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <imp@bsdimp.com>: >> >>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <timp87@gmail.com >> > >> >>>>>>>> wrote: >> >>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Warner Losh <imp@bsdimp.com>: >> >>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev < >> timp87@gmail.com> >> >>>>>>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>>> Pavel Timofeev <timp87@gmail.com>: >> >>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> Chuck Tuffli <ctuffli@gmail.com>: >> >>>>>>>>>>>> >> >>>>>>>>>>>>> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev < >> >>>>>>>>>>> timp87@gmail.com> wrote: >> >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> Hello >> >>>>>>>>>>>>>> I've got a Dell Latitude 7400 and tried installing the >> latest >> >>>>>>>>>>>>> 14.0-CURRENT >> >>>>>>>>>>>>>> (main-n248636-d20e9e02db3) on it. >> >>>>>>>>>>>>>> Despite other things the weird one which concerns me is >> >>>>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>>>> message I get sometimes on the console. >> >>>>>>>>>>>>>> It seems like I get it only after the reboot of the laptop, >> >>>>>>>>>>> i. e. not >> >>>>>>>>>>>>>> getting that message if I power cycle the laptop, at least >> I >> >>>>>>>>>>> haven't >> >>>>>>>>>>>>> seen >> >>>>>>>>>>>>>> them for now in such cases. >> >>>>>>>>>>>>>> So when the laptop is rebooted I can't even take advantage >> of >> >>>>>>>>>>>>>> nvmecontrol(8) quickly. >> >>>>>>>>>>>>>> Well, it still works, but it takes tens of seconds to >> return >> >>>>>>>>>>> the output. >> >>>>>>>>>>>>> ... >> >>>>>>>>>>>>>> dmesg when power cycled - >> >>>>>>>>>>>>>> >> >>>>>>>>>>> >> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ >> >>>>>>>>>>>>>> dmesg when rebooted - >> >>>>>>>>>>>>>> >> >>>>>>>>>>> >> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> I'm sort of curious about the time stamps for the log >> messages >> >>>>>>>>>>> in the >> >>>>>>>>>>>>> failing case. Something like: >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> --chuck >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>>> Well, I can't see timestamps in the verbose boot log. Am I >> >>>>>>>>>>> missing some >> >>>>>>>>>>>> configuration for that? >> >>>>>>>>>>>> >> >>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >> >>>>>>>>>>>> nvme0: <Generic NVMe Device> mem >> >>>>>>>>>>>> >> >>>>>>>>>>> >> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >> >>>>>>>>>>>> 0.0 on pci6 >> >>>>>>>>>>>> nvme0: attempting to allocate 5 MSI-X vectors (17 supported) >> >>>>>>>>>>>> nvme0: using IRQs 133-137 for MSI-X >> >>>>>>>>>>>> nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20 >> >>>>>>>>>>>> nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0, >> >>>>>>>>>>> MPSMAX 0 >> >>>>>>>>>>>> nvme0: Version: 0x00010300: 1.3 >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace >> >>>>>>>>>>>> GEOM: new disk nvd0 >> >>>>>>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors) >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> Ah, sorry, provided wrong output. >> >>>>>>>>>>> Here is what you requested: >> >>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device> >> mem >> >>>>>>>>>>> >> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >> >>>>>>>>>>> at device >> >>>>>>>>>>> 0.0 on pci6 >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to >> allocate 5 >> >>>>>>>>>>> MSI-X >> >>>>>>>>>>> vectors (17 supported) >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for >> >>>>>>>>>>> MSI-X >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: >> MQES >> >>>>>>>>>>> 1023, CQR, >> >>>>>>>>>>> TO 20 >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030: >> DSTRD >> >>>>>>>>>>> 0, NSSRS, >> >>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0 >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: >> 1.3 >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix >> >>>>>>>>>>> 512GB> NVMe >> >>>>>>>>>>> namespace >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0 >> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 >> 512 >> >>>>>>>>>>> byte >> >>>>>>>>>>> sectors) >> >>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and >> >>>>>>>>>> hw.cam.nda.nvd_compat=1 >> >>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd >> >>>>>>>>>> was? Or does >> >>>>>>>>>> it work? >> >>>>>>>>>> >> >>>>>>>>>> Something weird is going on in the interrupt assignment, I >> think, >> >>>>>>>>>> but I >> >>>>>>>>>> wanted to get any nvd vs nda issues out of the way first. >> >>>>>>>>>> >> >>>>>>>>>> Warner >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead >> >>>>>>>>> of hw.cam.nda.nvd_compat? >> >>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now. >> >>>>>>>>> >> >>>>>>>>> So I tried to set hw.nvme.use_nvd to 1 as suggested, but I >> still >> >>>>>>>>> see >> >>>>>>>>> nvme0: Missing interrupt >> >>>>>>>>> and now also >> >>>>>>>>> Root mount waiting for: CAM >> >>>>>>>>> messages besides those >> >>>>>>>>> >> >>>>>>>> >> >>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by >> default >> >>>>>>>> these >> >>>>>>>> days. >> >>>>>>>> >> >>>>>>>> I'll take a look on monday starting at the differences in >> interrupt >> >>>>>>>> assignment that >> >>>>>>>> are apparent when you cold boot vs reboot. >> >>>>>>>> >> >>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also >> >>>>>>>> didn't really >> >>>>>>>> expect it to be. >> >>>>>>>> >> >>>>>>>> Warner >> >>>>>>>> >> >>>>>>>> >> >>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even >> >>>>>>> worse now. >> >>>>>>> So clean poweron works as before. >> >>>>>>> But if rebooted nvme drive refuses to work, while before the code >> >>>>>>> upgrade it was just complaining about missing interrupts. >> >>>>>>> >> >>>>>>> currently dmesg show this: >> >>>>>>> nvme0: <Generic NVMe Device> mem >> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >> at device >> >>>>>>> 0.0 on pci6 >> >>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace >> >>>>>>> nvd0: 488386MB (1000215216 512 byte sectors) >> >>>>>>> nvme0: <Generic NVMe Device> mem >> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >> at device >> >>>>>>> 0.0 on pci6 >> >>>>>>> >> >>>>>> >> >>>>>> Why is this showing up twice? Or is everything above this line left >> >>>>>> over from the first, working boot? >> >>>>>> >> >>>>>> >> >>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288 >> >>>>>>> nvme0: timeout with nothing complete, resetting >> >>>>>>> nvme0: Resetting controller due to a timeout. >> >>>>>>> nvme0: RECOVERY_WAITING >> >>>>>>> nvme0: resetting controller >> >>>>>>> nvme0: aborting outstanding admin command >> >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 >> >>>>>>> cdw11:00000000 >> >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >> >>>>>>> nvme0: nvme_identify_controller failed! >> >>>>>>> nvme0: waiting >> >>>>>>> >> >>>>>> >> >>>>>> Clearly something bad is going on with the drive here... We looked >> >>>>>> into the completion queues when we didn't get an interrupt and >> there was >> >>>>>> nothing complete there.... >> >>>>>> >> >>>>>> The only thing I can think of is that this means there's a phase >> error >> >>>>>> between the drive and the system. I recently removed a second >> reset and >> >>>>>> made it an option NVME_2X_RESET. Can you see if adding >> >>>>>> 'options NVME_2X_RESET' to your kernel config fixes this? >> >>>>>> >> >>>>>> Warner >> >>>>>> >> >>>>>> >> >>>>>>> nvme0: <Generic NVMe Device> mem >> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >> at device >> >>>>>>> 0.0 on pci6 >> >>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561 >> >>>>>>> nvme0: timeout with nothing complete, resetting >> >>>>>>> nvme0: Resetting controller due to a timeout. >> >>>>>>> nvme0: RECOVERY_WAITING >> >>>>>>> nvme0: resetting controller >> >>>>>>> nvme0: aborting outstanding admin command >> >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 >> >>>>>>> cdw11:00000000 >> >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >> >>>>>>> nvme0: nvme_identify_controller failed! >> >>>>>>> nvme0: waiting >> >>>>>>> >> >>>>>>> >> >>>>> >> >>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's >> >>>>> like: >> >>>>> nvme0: <Generic NVMe Device> mem >> >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >> at device >> >>>>> 0.0 on pci6 >> >>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423 >> >>>>> nvme0: timeout with nothing complete, resetting >> >>>>> nvme0: Resetting controller due to a timeout. >> >>>>> nvme0: RECOVERY_WAITING >> >>>>> nvme0: resetting controller >> >>>>> nvme0: aborting outstanding admin command >> >>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 >> cdw11:00000000 >> >>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >> >>>>> nvme0: nvme_identify_controller failed! >> >>>>> nvme0: waiting >> >>>>> >> >>>>> Well, neither Windows not Linux have any problems with the device. I >> >>>>> understand they may be hiding it or workaround somehow. >> >>>>> >> >>>> >> >>>> Yea, I'm trying to figure out why your machine is different than >> mine, >> >>>> and what Windows or Linux do that is different. It may be dodgy >> hardware, >> >>>> but others have no trouble... >> >>>> >> >>>> I'll try setting NVME_2X_RESET in the kernel config and report back >> in a >> >>>>> while. >> >>>>> >> >>>> >> >>>> Thanks. If it helps, that tells me something. If it doesn't, that >> tells >> >>>> me something else. >> >>>> >> >>>> I suspect that it is somewhere else in the system, tbh, but I need to >> >>>> find it systematically. >> >>>> >> >>>> Warner >> >>>> >> >>> >> >>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't >> changed >> >>> anything. I. e it didn't help. >> >>> >> >> >> >> While it would have been nice to have this be the fix, I'm not that >> >> surprised either. >> >> It was the biggest change of late, apart from the big re-arrangement >> that >> >> I'd done. >> >> >> >> So the other changes have moved from the occasional missing interrupt >> >> message >> >> (which the old code would get when a command wasn't completed in the >> >> timeout >> >> period, but that we found to be done when we scanned the completion >> queue. >> >> Now >> >> the device is detected fine (as before), but then doesn't do I/O at all >> >> (including not >> >> completing the identify command!) and is worse. This is unexpected and >> I'm >> >> trying >> >> understand what happens on reboot that 'changes'the working state and >> why >> >> the >> >> new code behaves so differently. >> >> >> >> Warner >> >> >> > >> >> -- >> Alexander Motin >> > > > Thanks for the reply. > It's using the latest firmware. This is the first thing I do in such case. > > > Attaching devinfo and lspci output. > These are diffs showing the difference between clean boot and a reboot: > > $ diff -u devinfo.ok devinfo.nok > --- devinfo.ok 2021-10-17 17:48:07.964346000 -0600 > +++ devinfo.nok 2021-10-17 17:48:07.886881000 -0600 > @@ -214,10 +214,6 @@ > nvme0 pnpinfo vendor=0x1c5c device=0x1639 subvendor=0x1c5c > subdevice=0x1639 class=0x010802 at slot=0 function=0 dbsf=pci0:59:0:0 > handle=\_SB_.PCI0.RP13.PXSX > Interrupt request lines: > 0x85 > - 0x86 > - 0x87 > - 0x88 > - 0x89 > pcib7 memory window: > 0xcc100000-0xcc103fff > 0xcc104000-0xcc104fff > > $ diff -u lspci.ok lspci.nok > --- lspci.ok 2021-10-17 17:48:15.894470000 -0600 > +++ lspci.nok 2021-10-17 17:48:15.341379000 -0600 > @@ -132,7 +132,7 @@ > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA > PME(D0+,D1-,D2-,D3hot+,D3cold+) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+ > - Address: 00000000fee06000 Data: 0033 > + Address: 00000000fee06000 Data: 0034 > Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00 > DevCap: MaxPayload 128 bytes, PhantFunc 0 > ExtTag- RBE- FLReset+ > > Hi, I hope everyone is doing well. So several BIOS updates passed, now the BIOS version 1.15.1, but it works the same. At least on CURRENT built several days ago (main-n252414-0e8181c0123). What is interesting iwn(4) and iwlwifi(4) work the same way - only full power cycle makes wifi functional, simple reboot brakes it in most cases. Does anybody have any idea? [-- Attachment #2 --] <div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">вс, 17 окт. 2021 г. в 17:52, Pavel Timofeev <<a href="mailto:timp87@gmail.com">timp87@gmail.com</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">вс, 17 окт. 2021 г. в 11:19, Alexander Motin <<a href="mailto:mavbsd@gmail.com" target="_blank">mavbsd@gmail.com</a>>:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">It may be a noise, but comparing logs I see in reboot case also<br> "acpi_ec0: not getting interrupts, switched to polled mode". I am<br> thinking whether the problem may be caused not by SSD, but by some<br> resource conflict/misconfiguration in the system. Pavel, can you<br> compare `devinfo -vr` and `lspci -vvvvv` in both cases. looking for any<br> differences? Are you running the latest BIOS?<br> <br> On 12.10.2021 15:56, Warner Losh wrote:<br> > One piece of data that would be good to have:<br> > <br> > nvmecontrol identify nvme0<br> > <br> > There's an optional feature that none of my drives have, but that the Linux<br> > driver does oddly<br> > weird things when enabled. The output of that command will help me<br> > understand if that may<br> > be in play. Maybe we need to do oddly weird things too :)<br> > <br> > Warner<br> > <br> > On Sun, Oct 10, 2021 at 11:00 PM Warner Losh <<a href="mailto:imp@bsdimp.com" target="_blank">imp@bsdimp.com</a>> wrote:<br> > <br> >><br> >><br> >> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev <<a href="mailto:timp87@gmail.com" target="_blank">timp87@gmail.com</a>> wrote:<br> >><br> >>> сб, 9 окт. 2021 г. в 14:59, Warner Losh <<a href="mailto:imp@bsdimp.com" target="_blank">imp@bsdimp.com</a>>:<br> >>><br> >>>><br> >>>><br> >>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev <<a href="mailto:timp87@gmail.com" target="_blank">timp87@gmail.com</a>> wrote:<br> >>>><br> >>>>><br> >>>>><br> >>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh <<a href="mailto:imp@bsdimp.com" target="_blank">imp@bsdimp.com</a>>:<br> >>>>><br> >>>>>><br> >>>>>><br> >>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <<a href="mailto:timp87@gmail.com" target="_blank">timp87@gmail.com</a>><br> >>>>>> wrote:<br> >>>>>><br> >>>>>>><br> >>>>>>><br> >>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <<a href="mailto:imp@bsdimp.com" target="_blank">imp@bsdimp.com</a>>:<br> >>>>>>><br> >>>>>>>><br> >>>>>>>><br> >>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <<a href="mailto:timp87@gmail.com" target="_blank">timp87@gmail.com</a>><br> >>>>>>>> wrote:<br> >>>>>>>><br> >>>>>>>>><br> >>>>>>>>><br> >>>>>>>>> Warner Losh <<a href="mailto:imp@bsdimp.com" target="_blank">imp@bsdimp.com</a>>:<br> >>>>>>>>><br> >>>>>>>>>><br> >>>>>>>>>><br> >>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev <<a href="mailto:timp87@gmail.com" target="_blank">timp87@gmail.com</a>><br> >>>>>>>>>> wrote:<br> >>>>>>>>>><br> >>>>>>>>>>> Pavel Timofeev <<a href="mailto:timp87@gmail.com" target="_blank">timp87@gmail.com</a>>:<br> >>>>>>>>>>><br> >>>>>>>>>>>><br> >>>>>>>>>>>> Chuck Tuffli <<a href="mailto:ctuffli@gmail.com" target="_blank">ctuffli@gmail.com</a>>:<br> >>>>>>>>>>>><br> >>>>>>>>>>>>> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev <<br> >>>>>>>>>>> <a href="mailto:timp87@gmail.com" target="_blank">timp87@gmail.com</a>> wrote:<br> >>>>>>>>>>>>>><br> >>>>>>>>>>>>>> Hello<br> >>>>>>>>>>>>>> I've got a Dell Latitude 7400 and tried installing the latest<br> >>>>>>>>>>>>> 14.0-CURRENT<br> >>>>>>>>>>>>>> (main-n248636-d20e9e02db3) on it.<br> >>>>>>>>>>>>>> Despite other things the weird one which concerns me is<br> >>>>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>>>> message I get sometimes on the console.<br> >>>>>>>>>>>>>> It seems like I get it only after the reboot of the laptop,<br> >>>>>>>>>>> i. e. not<br> >>>>>>>>>>>>>> getting that message if I power cycle the laptop, at least I<br> >>>>>>>>>>> haven't<br> >>>>>>>>>>>>> seen<br> >>>>>>>>>>>>>> them for now in such cases.<br> >>>>>>>>>>>>>> So when the laptop is rebooted I can't even take advantage of<br> >>>>>>>>>>>>>> nvmecontrol(8) quickly.<br> >>>>>>>>>>>>>> Well, it still works, but it takes tens of seconds to return<br> >>>>>>>>>>> the output.<br> >>>>>>>>>>>>> ...<br> >>>>>>>>>>>>>> dmesg when power cycled -<br> >>>>>>>>>>>>>><br> >>>>>>>>>>> <a href="https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ" rel="noreferrer" target="_blank">https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ</a><br> >>>>>>>>>>>>>> dmesg when rebooted -<br> >>>>>>>>>>>>>><br> >>>>>>>>>>> <a href="https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh" rel="noreferrer" target="_blank">https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh</a><br> >>>>>>>>>>>>><br> >>>>>>>>>>>>> I'm sort of curious about the time stamps for the log messages<br> >>>>>>>>>>> in the<br> >>>>>>>>>>>>> failing case. Something like:<br> >>>>>>>>>>>>><br> >>>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages<br> >>>>>>>>>>>>><br> >>>>>>>>>>>>> --chuck<br> >>>>>>>>>>>>><br> >>>>>>>>>>>><br> >>>>>>>>>>>> Well, I can't see timestamps in the verbose boot log. Am I<br> >>>>>>>>>>> missing some<br> >>>>>>>>>>>> configuration for that?<br> >>>>>>>>>>>><br> >>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages<br> >>>>>>>>>>>> nvme0: <Generic NVMe Device> mem<br> >>>>>>>>>>>><br> >>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device<br> >>>>>>>>>>>> 0.0 on pci6<br> >>>>>>>>>>>> nvme0: attempting to allocate 5 MSI-X vectors (17 supported)<br> >>>>>>>>>>>> nvme0: using IRQs 133-137 for MSI-X<br> >>>>>>>>>>>> nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20<br> >>>>>>>>>>>> nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0,<br> >>>>>>>>>>> MPSMAX 0<br> >>>>>>>>>>>> nvme0: Version: 0x00010300: 1.3<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace<br> >>>>>>>>>>>> GEOM: new disk nvd0<br> >>>>>>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors)<br> >>>>>>>>>>>><br> >>>>>>>>>>><br> >>>>>>>>>>><br> >>>>>>>>>>> Ah, sorry, provided wrong output.<br> >>>>>>>>>>> Here is what you requested:<br> >>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device> mem<br> >>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff<br> >>>>>>>>>>> at device<br> >>>>>>>>>>> 0.0 on pci6<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to allocate 5<br> >>>>>>>>>>> MSI-X<br> >>>>>>>>>>> vectors (17 supported)<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for<br> >>>>>>>>>>> MSI-X<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: MQES<br> >>>>>>>>>>> 1023, CQR,<br> >>>>>>>>>>> TO 20<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030: DSTRD<br> >>>>>>>>>>> 0, NSSRS,<br> >>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: 1.3<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix<br> >>>>>>>>>>> 512GB> NVMe<br> >>>>>>>>>>> namespace<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0<br> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 512<br> >>>>>>>>>>> byte<br> >>>>>>>>>>> sectors)<br> >>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt<br> >>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt<br> >>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt<br> >>>>>>>>>>><br> >>>>>>>>>><br> >>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and<br> >>>>>>>>>> hw.cam.nda.nvd_compat=1<br> >>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd<br> >>>>>>>>>> was? Or does<br> >>>>>>>>>> it work?<br> >>>>>>>>>><br> >>>>>>>>>> Something weird is going on in the interrupt assignment, I think,<br> >>>>>>>>>> but I<br> >>>>>>>>>> wanted to get any nvd vs nda issues out of the way first.<br> >>>>>>>>>><br> >>>>>>>>>> Warner<br> >>>>>>>>>><br> >>>>>>>>><br> >>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead<br> >>>>>>>>> of hw.cam.nda.nvd_compat?<br> >>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now.<br> >>>>>>>>><br> >>>>>>>>> So I tried to set hw.nvme.use_nvd to 1 as suggested, but I still<br> >>>>>>>>> see<br> >>>>>>>>> nvme0: Missing interrupt<br> >>>>>>>>> and now also<br> >>>>>>>>> Root mount waiting for: CAM<br> >>>>>>>>> messages besides those<br> >>>>>>>>><br> >>>>>>>><br> >>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by default<br> >>>>>>>> these<br> >>>>>>>> days.<br> >>>>>>>><br> >>>>>>>> I'll take a look on monday starting at the differences in interrupt<br> >>>>>>>> assignment that<br> >>>>>>>> are apparent when you cold boot vs reboot.<br> >>>>>>>><br> >>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also<br> >>>>>>>> didn't really<br> >>>>>>>> expect it to be.<br> >>>>>>>><br> >>>>>>>> Warner<br> >>>>>>>><br> >>>>>>>><br> >>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even<br> >>>>>>> worse now.<br> >>>>>>> So clean poweron works as before.<br> >>>>>>> But if rebooted nvme drive refuses to work, while before the code<br> >>>>>>> upgrade it was just complaining about missing interrupts.<br> >>>>>>><br> >>>>>>> currently dmesg show this:<br> >>>>>>> nvme0: <Generic NVMe Device> mem<br> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device<br> >>>>>>> 0.0 on pci6<br> >>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace<br> >>>>>>> nvd0: 488386MB (1000215216 512 byte sectors)<br> >>>>>>> nvme0: <Generic NVMe Device> mem<br> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device<br> >>>>>>> 0.0 on pci6<br> >>>>>>><br> >>>>>><br> >>>>>> Why is this showing up twice? Or is everything above this line left<br> >>>>>> over from the first, working boot?<br> >>>>>><br> >>>>>><br> >>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288<br> >>>>>>> nvme0: timeout with nothing complete, resetting<br> >>>>>>> nvme0: Resetting controller due to a timeout.<br> >>>>>>> nvme0: RECOVERY_WAITING<br> >>>>>>> nvme0: resetting controller<br> >>>>>>> nvme0: aborting outstanding admin command<br> >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001<br> >>>>>>> cdw11:00000000<br> >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0<br> >>>>>>> nvme0: nvme_identify_controller failed!<br> >>>>>>> nvme0: waiting<br> >>>>>>><br> >>>>>><br> >>>>>> Clearly something bad is going on with the drive here... We looked<br> >>>>>> into the completion queues when we didn't get an interrupt and there was<br> >>>>>> nothing complete there....<br> >>>>>><br> >>>>>> The only thing I can think of is that this means there's a phase error<br> >>>>>> between the drive and the system. I recently removed a second reset and<br> >>>>>> made it an option NVME_2X_RESET. Can you see if adding<br> >>>>>> 'options NVME_2X_RESET' to your kernel config fixes this?<br> >>>>>><br> >>>>>> Warner<br> >>>>>><br> >>>>>><br> >>>>>>> nvme0: <Generic NVMe Device> mem<br> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device<br> >>>>>>> 0.0 on pci6<br> >>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561<br> >>>>>>> nvme0: timeout with nothing complete, resetting<br> >>>>>>> nvme0: Resetting controller due to a timeout.<br> >>>>>>> nvme0: RECOVERY_WAITING<br> >>>>>>> nvme0: resetting controller<br> >>>>>>> nvme0: aborting outstanding admin command<br> >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001<br> >>>>>>> cdw11:00000000<br> >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0<br> >>>>>>> nvme0: nvme_identify_controller failed!<br> >>>>>>> nvme0: waiting<br> >>>>>>><br> >>>>>>><br> >>>>><br> >>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's<br> >>>>> like:<br> >>>>> nvme0: <Generic NVMe Device> mem<br> >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device<br> >>>>> 0.0 on pci6<br> >>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423<br> >>>>> nvme0: timeout with nothing complete, resetting<br> >>>>> nvme0: Resetting controller due to a timeout.<br> >>>>> nvme0: RECOVERY_WAITING<br> >>>>> nvme0: resetting controller<br> >>>>> nvme0: aborting outstanding admin command<br> >>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:00000000<br> >>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0<br> >>>>> nvme0: nvme_identify_controller failed!<br> >>>>> nvme0: waiting<br> >>>>><br> >>>>> Well, neither Windows not Linux have any problems with the device. I<br> >>>>> understand they may be hiding it or workaround somehow.<br> >>>>><br> >>>><br> >>>> Yea, I'm trying to figure out why your machine is different than mine,<br> >>>> and what Windows or Linux do that is different. It may be dodgy hardware,<br> >>>> but others have no trouble...<br> >>>><br> >>>> I'll try setting NVME_2X_RESET in the kernel config and report back in a<br> >>>>> while.<br> >>>>><br> >>>><br> >>>> Thanks. If it helps, that tells me something. If it doesn't, that tells<br> >>>> me something else.<br> >>>><br> >>>> I suspect that it is somewhere else in the system, tbh, but I need to<br> >>>> find it systematically.<br> >>>><br> >>>> Warner<br> >>>><br> >>><br> >>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't changed<br> >>> anything. I. e it didn't help.<br> >>><br> >><br> >> While it would have been nice to have this be the fix, I'm not that<br> >> surprised either.<br> >> It was the biggest change of late, apart from the big re-arrangement that<br> >> I'd done.<br> >><br> >> So the other changes have moved from the occasional missing interrupt<br> >> message<br> >> (which the old code would get when a command wasn't completed in the<br> >> timeout<br> >> period, but that we found to be done when we scanned the completion queue.<br> >> Now<br> >> the device is detected fine (as before), but then doesn't do I/O at all<br> >> (including not<br> >> completing the identify command!) and is worse. This is unexpected and I'm<br> >> trying<br> >> understand what happens on reboot that 'changes'the working state and why<br> >> the<br> >> new code behaves so differently.<br> >><br> >> Warner<br> >><br> > <br> <br> -- <br> Alexander Motin<br></blockquote><div><br></div><div><br></div><div>Thanks for the reply.</div><div>It's using the latest firmware. This is the first thing I do in such case.</div><div><br></div><div><br></div><div>Attaching devinfo and lspci output.</div><div>These are diffs showing the difference between clean boot and a reboot:<br><br></div><div>$ diff -u devinfo.ok devinfo.nok <br>--- devinfo.ok 2021-10-17 17:48:07.964346000 -0600<br>+++ devinfo.nok 2021-10-17 17:48:07.886881000 -0600<br>@@ -214,10 +214,6 @@<br> nvme0 pnpinfo vendor=0x1c5c device=0x1639 subvendor=0x1c5c subdevice=0x1639 class=0x010802 at slot=0 function=0 dbsf=pci0:59:0:0 handle=\_SB_.PCI0.RP13.PXSX<br> Interrupt request lines:<br> 0x85<br>- 0x86<br>- 0x87<br>- 0x88<br>- 0x89<br> pcib7 memory window:<br> 0xcc100000-0xcc103fff<br> 0xcc104000-0xcc104fff<br></div><div><br></div><div>$ diff -u lspci.ok lspci.nok <br>--- lspci.ok 2021-10-17 17:48:15.894470000 -0600<br>+++ lspci.nok 2021-10-17 17:48:15.341379000 -0600<br>@@ -132,7 +132,7 @@<br> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)<br> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-<br> Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+<br>- Address: 00000000fee06000 Data: 0033<br>+ Address: 00000000fee06000 Data: 0034<br> Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00<br> DevCap: MaxPayload 128 bytes, PhantFunc 0<br> ExtTag- RBE- FLReset+<br></div><div><br></div></div></div></blockquote><div><br></div><div>Hi,</div><div>I hope everyone is doing well.</div><div>So several BIOS updates passed, now the BIOS version 1.15.1, but it works the same. At least on CURRENT built several days ago (main-n252414-0e8181c0123).</div><div>What is interesting iwn(4) and iwlwifi(4) work the same way - only full power cycle makes wifi functional, simple reboot brakes it in most cases.</div><div>Does anybody have any idea?</div></div></div>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAAoTqfstW3Q%2BKbG%2BuuOz848Wvg2t4hUjjGaMNBstNHb7uUkiGg>
