From nobody Sun Oct 17 17:25:42 2021 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 0775017F3FCA for ; Sun, 17 Oct 2021 17:26:02 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-qt1-x830.google.com (mail-qt1-x830.google.com [IPv6:2607:f8b0:4864:20::830]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4HXRk124rGz4RCW for ; Sun, 17 Oct 2021 17:26:01 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: by mail-qt1-x830.google.com with SMTP id o12so13355355qtq.7 for ; Sun, 17 Oct 2021 10:26:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=sender:subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=yQcH7AjWknzl3HrYHTt3ctq0oqEdc5hLcePYzH993+g=; b=nyYVczny8gNpo7Kc4CW7YVH44vAug19ayWqCDsQ720czrLEzNL9XTWmSScHb8s3zoN cBjU+KRED9+W8RLZtl8LfVhY7des6xkAn5FMBLDM5dyWyf4Ij5NrJ288lwfLYGhAapvp bIn/Nki9e5GrWG6HuF4Jzuw5Nz4wAk6aFAL0fOcE2+/9PTjRw7iNPRCM+ILtxqOBCaUH QMUQVPwwy7tG4Iwjsn86xcKfX8HeFTI8g7LlTL7HUPbnPM/5iFKxrNHcyadqYNHn6hyE odOlw14H+GutH7uTEW20j6GyYgLHXJpn6nLc2gtOK0R1yPwBz1IgCOwENLq9e8vGJvDz 6LuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:sender:subject:to:cc:references:from:message-id :date:user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=yQcH7AjWknzl3HrYHTt3ctq0oqEdc5hLcePYzH993+g=; b=Z4H5giRXL1MAmmTX9ztAy/mPtnnPkxWqnHvzh3MCflUnMOE5WYJIbA6fJJ6MblrIFw wuavJnMAjmWvoJBrrBwrvRDGJE37FUq5Nesx9Xj8ef83yEiZA3w3/vlBjk+rejn6jtoP IS9JJiWEBhmcE/jHk0N0CHiVmu/4FrtRVDBOXcHlIUYuTzQkBFUof1t6sid1OhvrdYpB F1uD0D0fZyteI83D/GymcWuRuDEcmFKoxRqBJikO3u/3KokDKAOVHiDmBF15NPVzhCE9 k9+ucwKriEZrRiM3SRZmJ9WHZVcFEPS9WIc9XlaANJaj/+5DcPOOh3OCzh3d4ssDq1w7 TpXQ== X-Gm-Message-State: AOAM531h4qfSXi9j/SA+ltzp2YpVSD2v3py7Hxe9tREn5sHKl+0ZseLd e9u3nY0E20F36BfO9fCW0DNEQU4SWMHTR4Tw X-Google-Smtp-Source: ABdhPJx/Pg93wtuBR2aQSL3zCWs7t0QmcwyBxFzlrhhONnrDqWsD6/djFx1UXevNb87BnAbwkj0I4Q== X-Received: by 2002:ac8:5794:: with SMTP id v20mr9157844qta.243.1634491554553; Sun, 17 Oct 2021 10:25:54 -0700 (PDT) Received: from spectre.mavhome.dp.ua ([8.46.75.2]) by smtp.gmail.com with ESMTPSA id c5sm5375637qkm.10.2021.10.17.10.25.48 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 17 Oct 2021 10:25:54 -0700 (PDT) Subject: Re: Dell Latitude 7400 - nvme0: Missing interrupt To: Warner Losh , Pavel Timofeev Cc: Chuck Tuffli , freebsd-current References: From: Alexander Motin Message-ID: <3712e2a7-1180-9918-3af2-67531a9ead9f@FreeBSD.org> Date: Sun, 17 Oct 2021 13:25:42 -0400 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@freebsd.org MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 4HXRk124rGz4RCW X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=nyYVczny; dmarc=none; spf=pass (mx1.freebsd.org: domain of mavbsd@gmail.com designates 2607:f8b0:4864:20::830 as permitted sender) smtp.mailfrom=mavbsd@gmail.com X-Spamd-Result: default: False [-1.20 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c]; RCVD_COUNT_THREE(0.00)[3]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; NEURAL_HAM_SHORT(-1.00)[-1.000]; FREEMAIL_TO(0.00)[bsdimp.com,gmail.com]; FORGED_SENDER(0.30)[mav@FreeBSD.org,mavbsd@gmail.com]; RECEIVED_SPAMHAUS_PBL(0.00)[8.46.75.2:received]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; MID_RHS_MATCH_FROM(0.00)[]; FROM_NEQ_ENVFROM(0.00)[mav@FreeBSD.org,mavbsd@gmail.com]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-current@freebsd.org]; DMARC_NA(0.00)[FreeBSD.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(1.00)[1.000]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::830:from]; FREEMAIL_CC(0.00)[gmail.com,freebsd.org]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N It may be a noise, but comparing logs I see in reboot case also "acpi_ec0: not getting interrupts, switched to polled mode". I am thinking whether the problem may be caused not by SSD, but by some resource conflict/misconfiguration in the system. Pavel, can you compare `devinfo -vr` and `lspci -vvvvv` in both cases. looking for any differences? Are you running the latest BIOS? On 12.10.2021 15:56, Warner Losh wrote: > One piece of data that would be good to have: > > nvmecontrol identify nvme0 > > There's an optional feature that none of my drives have, but that the Linux > driver does oddly > weird things when enabled. The output of that command will help me > understand if that may > be in play. Maybe we need to do oddly weird things too :) > > Warner > > On Sun, Oct 10, 2021 at 11:00 PM Warner Losh wrote: > >> >> >> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev wrote: >> >>> сб, 9 окт. 2021 г. в 14:59, Warner Losh : >>> >>>> >>>> >>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev wrote: >>>> >>>>> >>>>> >>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh : >>>>> >>>>>> >>>>>> >>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh : >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Warner Losh : >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Pavel Timofeev : >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Chuck Tuffli : >>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev < >>>>>>>>>>> timp87@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hello >>>>>>>>>>>>>> I've got a Dell Latitude 7400 and tried installing the latest >>>>>>>>>>>>> 14.0-CURRENT >>>>>>>>>>>>>> (main-n248636-d20e9e02db3) on it. >>>>>>>>>>>>>> Despite other things the weird one which concerns me is >>>>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>>>> message I get sometimes on the console. >>>>>>>>>>>>>> It seems like I get it only after the reboot of the laptop, >>>>>>>>>>> i. e. not >>>>>>>>>>>>>> getting that message if I power cycle the laptop, at least I >>>>>>>>>>> haven't >>>>>>>>>>>>> seen >>>>>>>>>>>>>> them for now in such cases. >>>>>>>>>>>>>> So when the laptop is rebooted I can't even take advantage of >>>>>>>>>>>>>> nvmecontrol(8) quickly. >>>>>>>>>>>>>> Well, it still works, but it takes tens of seconds to return >>>>>>>>>>> the output. >>>>>>>>>>>>> ... >>>>>>>>>>>>>> dmesg when power cycled - >>>>>>>>>>>>>> >>>>>>>>>>> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ >>>>>>>>>>>>>> dmesg when rebooted - >>>>>>>>>>>>>> >>>>>>>>>>> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh >>>>>>>>>>>>> >>>>>>>>>>>>> I'm sort of curious about the time stamps for the log messages >>>>>>>>>>> in the >>>>>>>>>>>>> failing case. Something like: >>>>>>>>>>>>> >>>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >>>>>>>>>>>>> >>>>>>>>>>>>> --chuck >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Well, I can't see timestamps in the verbose boot log. Am I >>>>>>>>>>> missing some >>>>>>>>>>>> configuration for that? >>>>>>>>>>>> >>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >>>>>>>>>>>> nvme0: mem >>>>>>>>>>>> >>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >>>>>>>>>>>> 0.0 on pci6 >>>>>>>>>>>> nvme0: attempting to allocate 5 MSI-X vectors (17 supported) >>>>>>>>>>>> nvme0: using IRQs 133-137 for MSI-X >>>>>>>>>>>> nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20 >>>>>>>>>>>> nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0, >>>>>>>>>>> MPSMAX 0 >>>>>>>>>>>> nvme0: Version: 0x00010300: 1.3 >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvme0: Missing interrupt >>>>>>>>>>>> nvd0: NVMe namespace >>>>>>>>>>>> GEOM: new disk nvd0 >>>>>>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors) >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Ah, sorry, provided wrong output. >>>>>>>>>>> Here is what you requested: >>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: mem >>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff >>>>>>>>>>> at device >>>>>>>>>>> 0.0 on pci6 >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to allocate 5 >>>>>>>>>>> MSI-X >>>>>>>>>>> vectors (17 supported) >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for >>>>>>>>>>> MSI-X >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: MQES >>>>>>>>>>> 1023, CQR, >>>>>>>>>>> TO 20 >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030: DSTRD >>>>>>>>>>> 0, NSSRS, >>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0 >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: 1.3 >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: >>>>>>>>>> 512GB> NVMe >>>>>>>>>>> namespace >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0 >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 512 >>>>>>>>>>> byte >>>>>>>>>>> sectors) >>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and >>>>>>>>>> hw.cam.nda.nvd_compat=1 >>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd >>>>>>>>>> was? Or does >>>>>>>>>> it work? >>>>>>>>>> >>>>>>>>>> Something weird is going on in the interrupt assignment, I think, >>>>>>>>>> but I >>>>>>>>>> wanted to get any nvd vs nda issues out of the way first. >>>>>>>>>> >>>>>>>>>> Warner >>>>>>>>>> >>>>>>>>> >>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead >>>>>>>>> of hw.cam.nda.nvd_compat? >>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now. >>>>>>>>> >>>>>>>>> So I tried to set hw.nvme.use_nvd to 1 as suggested, but I still >>>>>>>>> see >>>>>>>>> nvme0: Missing interrupt >>>>>>>>> and now also >>>>>>>>> Root mount waiting for: CAM >>>>>>>>> messages besides those >>>>>>>>> >>>>>>>> >>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by default >>>>>>>> these >>>>>>>> days. >>>>>>>> >>>>>>>> I'll take a look on monday starting at the differences in interrupt >>>>>>>> assignment that >>>>>>>> are apparent when you cold boot vs reboot. >>>>>>>> >>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also >>>>>>>> didn't really >>>>>>>> expect it to be. >>>>>>>> >>>>>>>> Warner >>>>>>>> >>>>>>>> >>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even >>>>>>> worse now. >>>>>>> So clean poweron works as before. >>>>>>> But if rebooted nvme drive refuses to work, while before the code >>>>>>> upgrade it was just complaining about missing interrupts. >>>>>>> >>>>>>> currently dmesg show this: >>>>>>> nvme0: mem >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >>>>>>> 0.0 on pci6 >>>>>>> nvd0: NVMe namespace >>>>>>> nvd0: 488386MB (1000215216 512 byte sectors) >>>>>>> nvme0: mem >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >>>>>>> 0.0 on pci6 >>>>>>> >>>>>> >>>>>> Why is this showing up twice? Or is everything above this line left >>>>>> over from the first, working boot? >>>>>> >>>>>> >>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288 >>>>>>> nvme0: timeout with nothing complete, resetting >>>>>>> nvme0: Resetting controller due to a timeout. >>>>>>> nvme0: RECOVERY_WAITING >>>>>>> nvme0: resetting controller >>>>>>> nvme0: aborting outstanding admin command >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 >>>>>>> cdw11:00000000 >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >>>>>>> nvme0: nvme_identify_controller failed! >>>>>>> nvme0: waiting >>>>>>> >>>>>> >>>>>> Clearly something bad is going on with the drive here... We looked >>>>>> into the completion queues when we didn't get an interrupt and there was >>>>>> nothing complete there.... >>>>>> >>>>>> The only thing I can think of is that this means there's a phase error >>>>>> between the drive and the system. I recently removed a second reset and >>>>>> made it an option NVME_2X_RESET. Can you see if adding >>>>>> 'options NVME_2X_RESET' to your kernel config fixes this? >>>>>> >>>>>> Warner >>>>>> >>>>>> >>>>>>> nvme0: mem >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >>>>>>> 0.0 on pci6 >>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561 >>>>>>> nvme0: timeout with nothing complete, resetting >>>>>>> nvme0: Resetting controller due to a timeout. >>>>>>> nvme0: RECOVERY_WAITING >>>>>>> nvme0: resetting controller >>>>>>> nvme0: aborting outstanding admin command >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 >>>>>>> cdw11:00000000 >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >>>>>>> nvme0: nvme_identify_controller failed! >>>>>>> nvme0: waiting >>>>>>> >>>>>>> >>>>> >>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's >>>>> like: >>>>> nvme0: mem >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device >>>>> 0.0 on pci6 >>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423 >>>>> nvme0: timeout with nothing complete, resetting >>>>> nvme0: Resetting controller due to a timeout. >>>>> nvme0: RECOVERY_WAITING >>>>> nvme0: resetting controller >>>>> nvme0: aborting outstanding admin command >>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:00000000 >>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0 >>>>> nvme0: nvme_identify_controller failed! >>>>> nvme0: waiting >>>>> >>>>> Well, neither Windows not Linux have any problems with the device. I >>>>> understand they may be hiding it or workaround somehow. >>>>> >>>> >>>> Yea, I'm trying to figure out why your machine is different than mine, >>>> and what Windows or Linux do that is different. It may be dodgy hardware, >>>> but others have no trouble... >>>> >>>> I'll try setting NVME_2X_RESET in the kernel config and report back in a >>>>> while. >>>>> >>>> >>>> Thanks. If it helps, that tells me something. If it doesn't, that tells >>>> me something else. >>>> >>>> I suspect that it is somewhere else in the system, tbh, but I need to >>>> find it systematically. >>>> >>>> Warner >>>> >>> >>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't changed >>> anything. I. e it didn't help. >>> >> >> While it would have been nice to have this be the fix, I'm not that >> surprised either. >> It was the biggest change of late, apart from the big re-arrangement that >> I'd done. >> >> So the other changes have moved from the occasional missing interrupt >> message >> (which the old code would get when a command wasn't completed in the >> timeout >> period, but that we found to be done when we scanned the completion queue. >> Now >> the device is detected fine (as before), but then doesn't do I/O at all >> (including not >> completing the identify command!) and is worse. This is unexpected and I'm >> trying >> understand what happens on reboot that 'changes'the working state and why >> the >> new code behaves so differently. >> >> Warner >> > -- Alexander Motin