From nobody Mon Oct 11 05:00:40 2021 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 7FA8017EBAAA for ; Mon, 11 Oct 2021 05:00:58 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-ua1-x934.google.com (mail-ua1-x934.google.com [IPv6:2607:f8b0:4864:20::934]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4HSRT62nqBz4dN3 for ; Mon, 11 Oct 2021 05:00:58 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-ua1-x934.google.com with SMTP id q13so11361274uaq.2 for ; Sun, 10 Oct 2021 22:00:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=03zEoFrzc1//R3vlsBCnn7qprHVlcenWP/hvWy8YEI4=; b=OS/gbMNqCKNVOAuRlipLkuMghJj/UZ0MGJa4J1yiQTazR8x+pl5KoqyYn/9spWIONP fAVQ2zIvrh0XEU1tp7C99xv5IAo8h4dTDyn6yBxl/LPFjDWUGA1ZuzCYOw8SavYCBchM vbPLW9g98H7aq2VDwohjZx2gn9hO2i6HNKU3U3RRLPCOXf8yoGbOFPZlkayra0oGchfh izC3ObX2jO/t7F05NQVn9duzLAl9cxUU/vaI2Xi0V6Gsp3IU+oIMg1WZ03ThRtpfruK7 bOaJS4iRhh4B1hWFx3vxeQ21uCmnbHIAwDjht5fFraNPPkefHmMmmQ6uCqki24V1WTHB Tajg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=03zEoFrzc1//R3vlsBCnn7qprHVlcenWP/hvWy8YEI4=; b=eiu/eguPiwJ+affbusDVH1ApGfCAX0m+I/lHWQRUhfYSFHq1fWZZRzR4zXxjKbP5V1 CnXvqaKpREJ1PA5igMoK6zltztD988WtVNJCL8paxI78DK/ZyjUZyYbh1haWGO1o4mOg V9MvkgtLmEvmzoB2V2tPcU5KCz6L6etUzLNlAurOm+CS4mNFyFUJIAIPR4sqRcWzXkFY XRW+Dl3jRLrImdxgW/E4wW4Nvv0BTa3Bq+2OXAda5pjCMDBdd8Ir16WNXwk8sp506me3 KxWJRPJmJIlLQ3E5gnvgCBI0gdZ719CXl5ejv8vB9UG20C6/mQ2lWhRfy4e1L0fHiuNb lMAQ== X-Gm-Message-State: AOAM531DB2xnwKW6rcm84Mw3oGr5Cw8a97kbjsI8ipGI37nvkUYEolg1 G+qN0wpKbIK2U9EgrlOAchLJJU/ESZUkbniox7+k0g== X-Google-Smtp-Source: ABdhPJzvN89+5WDWcxgXhAZrTZnUpnYTUK4TuGbrQKUAwO3Fk4jZk3ScsFVmc6s811w+EVjKGppZLwoOF6b/xKaAvRc= X-Received: by 2002:ab0:55c9:: with SMTP id w9mr12609700uaa.77.1633928452107; Sun, 10 Oct 2021 22:00:52 -0700 (PDT) List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Warner Losh Date: Sun, 10 Oct 2021 23:00:40 -0600 Message-ID: Subject: Re: Dell Latitude 7400 - nvme0: Missing interrupt To: Pavel Timofeev Cc: Chuck Tuffli , freebsd-current Content-Type: multipart/alternative; boundary="000000000000fa0e4205ce0c9cba" X-Rspamd-Queue-Id: 4HSRT62nqBz4dN3 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] X-ThisMailContainsUnwantedMimeParts: Y --000000000000fa0e4205ce0c9cba Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev wrote: > =D1=81=D0=B1, 9 =D0=BE=D0=BA=D1=82. 2021 =D0=B3. =D0=B2 14:59, Warner Los= h : > >> >> >> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev wrote: >> >>> >>> >>> =D0=BF=D1=82, 8 =D0=BE=D0=BA=D1=82. 2021 =D0=B3. =D0=B2 14:49, Warner L= osh : >>> >>>> >>>> >>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev wrote= : >>>> >>>>> >>>>> >>>>> =D1=81=D0=B1, 21 =D0=B0=D0=B2=D0=B3. 2021 =D0=B3. =D0=B2 15:22, Warne= r Losh : >>>>> >>>>>> >>>>>> >>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> Warner Losh : >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Pavel Timofeev : >>>>>>>>> >>>>>>>>> > >>>>>>>>> > Chuck Tuffli : >>>>>>>>> > >>>>>>>>> >> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev < >>>>>>>>> timp87@gmail.com> wrote: >>>>>>>>> >> > >>>>>>>>> >> > Hello >>>>>>>>> >> > I've got a Dell Latitude 7400 and tried installing the lates= t >>>>>>>>> >> 14.0-CURRENT >>>>>>>>> >> > (main-n248636-d20e9e02db3) on it. >>>>>>>>> >> > Despite other things the weird one which concerns me is >>>>>>>>> >> > nvme0: Missing interrupt >>>>>>>>> >> > message I get sometimes on the console. >>>>>>>>> >> > It seems like I get it only after the reboot of the laptop, >>>>>>>>> i. e. not >>>>>>>>> >> > getting that message if I power cycle the laptop, at least I >>>>>>>>> haven't >>>>>>>>> >> seen >>>>>>>>> >> > them for now in such cases. >>>>>>>>> >> > So when the laptop is rebooted I can't even take advantage o= f >>>>>>>>> >> > nvmecontrol(8) quickly. >>>>>>>>> >> > Well, it still works, but it takes tens of seconds to return >>>>>>>>> the output. >>>>>>>>> >> ... >>>>>>>>> >> > dmesg when power cycled - >>>>>>>>> >> > >>>>>>>>> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ >>>>>>>>> >> > dmesg when rebooted - >>>>>>>>> >> > >>>>>>>>> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh >>>>>>>>> >> >>>>>>>>> >> I'm sort of curious about the time stamps for the log messages >>>>>>>>> in the >>>>>>>>> >> failing case. Something like: >>>>>>>>> >> >>>>>>>>> >> $ grep "nv\(me\|d\)" /var/log/messages >>>>>>>>> >> >>>>>>>>> >> --chuck >>>>>>>>> >> >>>>>>>>> > >>>>>>>>> > Well, I can't see timestamps in the verbose boot log. Am I >>>>>>>>> missing some >>>>>>>>> > configuration for that? >>>>>>>>> > >>>>>>>>> > $ grep "nv\(me\|d\)" /var/log/messages >>>>>>>>> > nvme0: mem >>>>>>>>> > >>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff= at device >>>>>>>>> > 0.0 on pci6 >>>>>>>>> > nvme0: attempting to allocate 5 MSI-X vectors (17 supported) >>>>>>>>> > nvme0: using IRQs 133-137 for MSI-X >>>>>>>>> > nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20 >>>>>>>>> > nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0, >>>>>>>>> MPSMAX 0 >>>>>>>>> > nvme0: Version: 0x00010300: 1.3 >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvme0: Missing interrupt >>>>>>>>> > nvd0: NVMe namespace >>>>>>>>> > GEOM: new disk nvd0 >>>>>>>>> > nvd0: 488386MB (1000215216 512 byte sectors) >>>>>>>>> > >>>>>>>>> >>>>>>>>> >>>>>>>>> Ah, sorry, provided wrong output. >>>>>>>>> Here is what you requested: >>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: mem >>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >>>>>>>>> at device >>>>>>>>> 0.0 on pci6 >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to allocate 5 >>>>>>>>> MSI-X >>>>>>>>> vectors (17 supported) >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for >>>>>>>>> MSI-X >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: MQES >>>>>>>>> 1023, CQR, >>>>>>>>> TO 20 >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030: DSTRD >>>>>>>>> 0, NSSRS, >>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0 >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: 1.3 >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: >>>>>>>>> NVMe >>>>>>>>> namespace >>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0 >>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 512 >>>>>>>>> byte >>>>>>>>> sectors) >>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt >>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt >>>>>>>>> >>>>>>>> >>>>>>>> What happens if you set hw.nvme.use_nvd=3D0 and >>>>>>>> hw.cam.nda.nvd_compat=3D1 >>>>>>>> in the boot loader and reboot? Same thing except nda where nvd was= ? >>>>>>>> Or does >>>>>>>> it work? >>>>>>>> >>>>>>>> Something weird is going on in the interrupt assignment, I think, >>>>>>>> but I >>>>>>>> wanted to get any nvd vs nda issues out of the way first. >>>>>>>> >>>>>>>> Warner >>>>>>>> >>>>>>> >>>>>>> Do you mean kern.cam.nda.nvd_compat instead of hw.cam.nda.nvd_compa= t? >>>>>>> kern.cam.nda.nvd_compat is 1 by default now. >>>>>>> >>>>>>> So I tried to set hw.nvme.use_nvd to 1 as suggested, but I still s= ee >>>>>>> nvme0: Missing interrupt >>>>>>> and now also >>>>>>> Root mount waiting for: CAM >>>>>>> messages besides those >>>>>>> >>>>>> >>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=3D1 by defau= lt >>>>>> these >>>>>> days. >>>>>> >>>>>> I'll take a look on monday starting at the differences in interrupt >>>>>> assignment that >>>>>> are apparent when you cold boot vs reboot. >>>>>> >>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also >>>>>> didn't really >>>>>> expect it to be. >>>>>> >>>>>> Warner >>>>>> >>>>>> >>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even >>>>> worse now. >>>>> So clean poweron works as before. >>>>> But if rebooted nvme drive refuses to work, while before the code >>>>> upgrade it was just complaining about missing interrupts. >>>>> >>>>> currently dmesg show this: >>>>> nvme0: mem >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at = device >>>>> 0.0 on pci6 >>>>> nvd0: NVMe namespace >>>>> nvd0: 488386MB (1000215216 512 byte sectors) >>>>> nvme0: mem >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at = device >>>>> 0.0 on pci6 >>>>> >>>> >>>> Why is this showing up twice? Or is everything above this line left >>>> over from the first, working boot? >>>> >>>> >>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288 >>>>> nvme0: timeout with nothing complete, resetting >>>>> nvme0: Resetting controller due to a timeout. >>>>> nvme0: RECOVERY_WAITING >>>>> nvme0: resetting controller >>>>> nvme0: aborting outstanding admin command >>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:000000= 00 >>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >>>>> nvme0: nvme_identify_controller failed! >>>>> nvme0: waiting >>>>> >>>> >>>> Clearly something bad is going on with the drive here... We looked int= o >>>> the completion queues when we didn't get an interrupt and there was no= thing >>>> complete there.... >>>> >>>> The only thing I can think of is that this means there's a phase error >>>> between the drive and the system. I recently removed a second reset an= d >>>> made it an option NVME_2X_RESET. Can you see if adding >>>> 'options NVME_2X_RESET' to your kernel config fixes this? >>>> >>>> Warner >>>> >>>> >>>>> nvme0: mem >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at = device >>>>> 0.0 on pci6 >>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561 >>>>> nvme0: timeout with nothing complete, resetting >>>>> nvme0: Resetting controller due to a timeout. >>>>> nvme0: RECOVERY_WAITING >>>>> nvme0: resetting controller >>>>> nvme0: aborting outstanding admin command >>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:000000= 00 >>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >>>>> nvme0: nvme_identify_controller failed! >>>>> nvme0: waiting >>>>> >>>>> >>> >>> Sorry, it's showing twice due to multiple reboots. For one boot it's >>> like: >>> nvme0: mem >>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at de= vice >>> 0.0 on pci6 >>> nvme0: RECOVERY_START 9633303481 vs 9365971423 >>> nvme0: timeout with nothing complete, resetting >>> nvme0: Resetting controller due to a timeout. >>> nvme0: RECOVERY_WAITING >>> nvme0: resetting controller >>> nvme0: aborting outstanding admin command >>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:00000000 >>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >>> nvme0: nvme_identify_controller failed! >>> nvme0: waiting >>> >>> Well, neither Windows not Linux have any problems with the device. I >>> understand they may be hiding it or workaround somehow. >>> >> >> Yea, I'm trying to figure out why your machine is different than mine, >> and what Windows or Linux do that is different. It may be dodgy hardware= , >> but others have no trouble... >> >> I'll try setting NVME_2X_RESET in the kernel config and report back in a >>> while. >>> >> >> Thanks. If it helps, that tells me something. If it doesn't, that tells >> me something else. >> >> I suspect that it is somewhere else in the system, tbh, but I need to >> find it systematically. >> >> Warner >> > > Surprisingly, setting NVME_2X_RESET in the kernel config hasn't changed > anything. I. e it didn't help. > While it would have been nice to have this be the fix, I'm not that surprised either. It was the biggest change of late, apart from the big re-arrangement that I'd done. So the other changes have moved from the occasional missing interrupt message (which the old code would get when a command wasn't completed in the timeou= t period, but that we found to be done when we scanned the completion queue. Now the device is detected fine (as before), but then doesn't do I/O at all (including not completing the identify command!) and is worse. This is unexpected and I'm trying understand what happens on reboot that 'changes'the working state and why the new code behaves so differently. Warner --000000000000fa0e4205ce0c9cba--