From owner-svn-src-all@freebsd.org Wed Sep 4 01:19:15 2019 Return-Path: Delivered-To: svn-src-all@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 4034DDFDF9 for ; Wed, 4 Sep 2019 01:19:15 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: from mail-qk1-x742.google.com (mail-qk1-x742.google.com [IPv6:2607:f8b0:4864:20::742]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 46NQvk3NnGz4cNm for ; Wed, 4 Sep 2019 01:19:14 +0000 (UTC) (envelope-from wlosh@bsdimp.com) Received: by mail-qk1-x742.google.com with SMTP id x5so3982594qkh.5 for ; Tue, 03 Sep 2019 18:19:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bsdimp-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=dpQ//wMiAmfF3C02lI/fLyIQNNzpx3lNTjNCta2rSOg=; b=uux6Ja8EAAGpw//kM0pYo7cHcfMLU61QJVufyOZx775WJ81EY4uwirqaVO/lH80j+5 kAN/cr+V1dQbkXzb3EdeAsYe99C/hRmy8IZBy2duyiim0uR3kYK7OxxvNwxaN2Qu1eWJ 9aEm5F/D7VIp4OVmd2ccTviBWVRKmRvZ+vpIY9mBgSWIaD49nZXWHOI0c4U/arCB+kgK iXtLEGuBCcH7vkrrLt2SramRvxmsLjuJwZCqHWUBQ65Gcbm/RN6o6duhHqOXdgrn5G47 ojIvIshbWN4k+Bfmcleyipd42j0ndyHHNvxL1L6v+1a7/R24QHWzfNHBcHQXM2jEkzV8 N5Ow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=dpQ//wMiAmfF3C02lI/fLyIQNNzpx3lNTjNCta2rSOg=; b=GR2MsJHz5ssdOCiDkOsYI3+wVhbgivpHxRYlz2OVYbomvZL+qsDl888OcKwhSwukvp X5LQu7wf9RjclZz4GvzVYSYiSobel3Aou9BFWQ8aZvXuYYmiHTmEmt6PJCHSohLmoYrA fIsEqblVUEdNZBsS0EuFP5f/pS2opRT4qwDAR53rIOE1vKOg6Ssgg5+oL0uEU0jdo6Qc tIkwTgrjBc+CU8FXJYNy8CFvwOyaAMqu5BaFcNv+BmvItopnjQDZwpIdc9vHaDMEBX49 bQI25TnGEPamf/nEFlG2TkvHsNT7zjRI+DrH1JMz+6vwucFelqD8va1Q/TgO57sAo/By dKOw== X-Gm-Message-State: APjAAAXg0KPRf+gDA3ODTp7KBfpJk8UrMKAggMouSnoqyfo3eHIal3xm 1+Bmh7CID5wIu767nT359mB4QjS8d9px07Rl8cg+i7mjbQ8= X-Google-Smtp-Source: APXvYqyT9bLgiln+WHNFEeZEKh3s6jBh+GjfyvzUr4paewJ2CjMcICv/EKmMNGctuEAhHd7R3bpk7k2FO3ygGlaoY/8= X-Received: by 2002:a05:620a:1671:: with SMTP id d17mr5775147qko.495.1567559952923; Tue, 03 Sep 2019 18:19:12 -0700 (PDT) MIME-Version: 1.0 References: <201909031526.x83FQBbE002770@repo.freebsd.org> In-Reply-To: From: Warner Losh Date: Tue, 3 Sep 2019 19:19:02 -0600 Message-ID: Subject: Re: svn commit: r351747 - head/sys/dev/nvme To: Ravi Pokala Cc: Warner Losh , src-committers , svn-src-all , svn-src-head X-Rspamd-Queue-Id: 46NQvk3NnGz4cNm X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=bsdimp-com.20150623.gappssmtp.com header.s=20150623 header.b=uux6Ja8E; dmarc=none; spf=none (mx1.freebsd.org: domain of wlosh@bsdimp.com has no SPF policy when checking 2607:f8b0:4864:20::742) smtp.mailfrom=wlosh@bsdimp.com X-Spamd-Result: default: False [-2.34 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.999,0]; R_DKIM_ALLOW(-0.20)[bsdimp-com.20150623.gappssmtp.com:s=20150623]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-0.999,0]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[svn-src-all@freebsd.org]; DMARC_NA(0.00)[bsdimp.com]; URI_COUNT_ODD(1.00)[3]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[bsdimp-com.20150623.gappssmtp.com:+]; NEURAL_HAM_SHORT(-0.94)[-0.940,0]; RCVD_IN_DNSWL_NONE(0.00)[2.4.7.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.b.8.f.7.0.6.2.list.dnswl.org : 127.0.5.0]; R_SPF_NA(0.00)[]; FORGED_SENDER(0.30)[imp@bsdimp.com,wlosh@bsdimp.com]; MIME_TRACE(0.00)[0:+,1:+,2:~]; IP_SCORE(-0.40)[ip: (3.12), ipnet: 2607:f8b0::/32(-2.77), asn: 15169(-2.28), country: US(-0.05)]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; FROM_NEQ_ENVFROM(0.00)[imp@bsdimp.com,wlosh@bsdimp.com]; RCVD_TLS_ALL(0.00)[]; RCVD_COUNT_TWO(0.00)[2] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Sep 2019 01:19:15 -0000 On Tue, Sep 3, 2019 at 2:38 PM Ravi Pokala wrote: > Hi Warner, > > + /* > + * Per Section 7.6.2 of NVMe spec 1.4, to properly suspend, we > need to > + * delete the hardware I/O queues, and then shutdown. This properly > + * flushes any metadata the drive may have stored so it can survive > + * having its power removed and prevents the unsafe shutdown count > from > + * incriminating. Once we delete the qpairs, we have to disable > them > + * before shutting down. The delay is out of paranoia in > + * nvme_ctrlr_hw_reset, and is repeated here (though we should > have no > + * pending I/O that the delay copes with). > + */ > + nvme_ctrlr_delete_qpairs(ctrlr); > + nvme_ctrlr_disable_qpairs(ctrlr); > + DELAY(100*1000); > + nvme_ctrlr_shutdown(ctrlr); > > This seems backwards to me. From that section: > > > The host should perform the following actions in sequence for a normal > shutdown: > > 1. Stop submitting any new I/O commands to the controller and allow > any outstanding commands to complete; > > 2. If the controller implements I/O queues, then the host should > delete all I/O Submission Queues, using the Delete I/O Submission Queue > command. A result of the successful completion of the Delete I/O Submission > Queue command is that any remaining commands outstanding are aborted; > > 3. If the controller implements I/O queues, then the host should > delete all I/O Completion Queues, using the Delete I/O Completion Queue > command; and > > 4. The host should set the Shutdown Notification (CC.SHN) field to > 01b to indicate a normal shutdown operation. The controller indicates when > shutdown processing is completed by updating the Shutdown Status > (CSTS.SHST) field to 10b. > > You are calling nvme_ctrlr_delete_qpairs() -- which implements steps (2) > and (3) -- before you are calling nvme_ctrlr_disable_qpairs() -- which > implements step (1). > > Or am I missing something? > The system stops all I/O and makes sure all I/O is complete before starting to suspend. Done in the reverse order, I don't have the queues Warner > When we suspend, we need to properly shutdown the NVME controller. The > controller may go into D3 state (or may have the power removed), and > to properly flush the metadata to non-volatile RAM, we must complete > a > normal shutdown. This consists of deleting the I/O queues and setting > the shutodown bit. We have to do some extra stuff to make sure we > reset the software state of the queues as well. > > On resume, we have to reset the card twice, for reasons described in > the attach funcion. Once we've done that, we can restart the card. If > any of this fails, we'll fail the NVMe card, just like we do when a > reset fails. > > Set is_resetting for the duration of the suspend / resume. This keeps > the reset taskqueue from running a concurrent reset, and also is > needed to prevent any hw completions from queueing more I/O to the > card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to get > that from the global state of the ctrlr. Wait for any pending reset > to > finish. All queued I/O will get sent to the hardware as part of > nvme_ctrlr_start(), though the upper layers shouldn't send any > down. Disabling the qpairs is the other failsafe to ensure all I/O is > queued. > > Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to avoid > confusion with all the other destroy functions. It just removes the > queues in hardware, while the other _destroy_ functions tear down > driver data structures. > > Split parts of the hardware reset function up so that I can > do part of the reset in suspsend. Split out the software disabling > of the qpairs into nvme_ctrlr_disable_qpairs. > > Finally, fix a couple of spelling errors in comments related to > this. > > Relnotes: Yes > MFC After: 1 week > Reviewed by: scottl@ (prior version) > Differential Revision: https://reviews.freebsd.org/D21493 > > Modified: > head/sys/dev/nvme/nvme.c > head/sys/dev/nvme/nvme_ctrlr.c > head/sys/dev/nvme/nvme_pci.c > head/sys/dev/nvme/nvme_private.h > > Modified: head/sys/dev/nvme/nvme.c > > ============================================================================== > --- head/sys/dev/nvme/nvme.c Tue Sep 3 14:55:19 2019 > (r351746) > +++ head/sys/dev/nvme/nvme.c Tue Sep 3 15:26:11 2019 > (r351747) > @@ -137,9 +137,10 @@ nvme_attach(device_t dev) > } > > /* > - * Reset controller twice to ensure we do a transition from > cc.en==1 > - * to cc.en==0. This is because we don't really know what status > - * the controller was left in when boot handed off to OS. > + * Reset controller twice to ensure we do a transition from > cc.en==1 to > + * cc.en==0. This is because we don't really know what status the > + * controller was left in when boot handed off to OS. Linux > doesn't do > + * this, however. If we adopt that policy, see also > nvme_ctrlr_resume(). > */ > status = nvme_ctrlr_hw_reset(ctrlr); > if (status != 0) { > > Modified: head/sys/dev/nvme/nvme_ctrlr.c > > ============================================================================== > --- head/sys/dev/nvme/nvme_ctrlr.c Tue Sep 3 14:55:19 2019 > (r351746) > +++ head/sys/dev/nvme/nvme_ctrlr.c Tue Sep 3 15:26:11 2019 > (r351747) > @@ -118,8 +118,8 @@ nvme_ctrlr_construct_io_qpairs(struct > nvme_controller > > /* > * Our best estimate for the maximum number of I/Os that we should > - * noramlly have in flight at one time. This should be viewed as a > hint, > - * not a hard limit and will need to be revisitted when the upper > layers > + * normally have in flight at one time. This should be viewed as a > hint, > + * not a hard limit and will need to be revisited when the upper > layers > * of the storage system grows multi-queue support. > */ > ctrlr->max_hw_pend_io = num_trackers * ctrlr->num_io_queues * 3 / > 4; > @@ -344,10 +344,10 @@ nvme_ctrlr_enable(struct nvme_controller *ctrlr) > return (nvme_ctrlr_wait_for_ready(ctrlr, 1)); > } > > -int > -nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr) > +static void > +nvme_ctrlr_disable_qpairs(struct nvme_controller *ctrlr) > { > - int i, err; > + int i; > > nvme_admin_qpair_disable(&ctrlr->adminq); > /* > @@ -359,7 +359,15 @@ nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr) > for (i = 0; i < ctrlr->num_io_queues; i++) > nvme_io_qpair_disable(&ctrlr->ioq[i]); > } > +} > > +int > +nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr) > +{ > + int err; > + > + nvme_ctrlr_disable_qpairs(ctrlr); > + > DELAY(100*1000); > > err = nvme_ctrlr_disable(ctrlr); > @@ -481,7 +489,7 @@ nvme_ctrlr_create_qpairs(struct nvme_controller > *ctrlr > } > > static int > -nvme_ctrlr_destroy_qpairs(struct nvme_controller *ctrlr) > +nvme_ctrlr_delete_qpairs(struct nvme_controller *ctrlr) > { > struct nvme_completion_poll_status status; > struct nvme_qpair *qpair; > @@ -820,7 +828,7 @@ nvme_ctrlr_configure_int_coalescing(struct > nvme_contro > } > > static void > -nvme_ctrlr_start(void *ctrlr_arg) > +nvme_ctrlr_start(void *ctrlr_arg, bool resetting) > { > struct nvme_controller *ctrlr = ctrlr_arg; > uint32_t old_num_io_queues; > @@ -833,7 +841,7 @@ nvme_ctrlr_start(void *ctrlr_arg) > * the number of I/O queues supported, so cannot reset > * the adminq again here. > */ > - if (ctrlr->is_resetting) > + if (resetting) > nvme_qpair_reset(&ctrlr->adminq); > > for (i = 0; i < ctrlr->num_io_queues; i++) > @@ -854,7 +862,7 @@ nvme_ctrlr_start(void *ctrlr_arg) > * explicit specify how many queues it will use. This value > should > * never change between resets, so panic if somehow that does > happen. > */ > - if (ctrlr->is_resetting) { > + if (resetting) { > old_num_io_queues = ctrlr->num_io_queues; > if (nvme_ctrlr_set_num_qpairs(ctrlr) != 0) { > nvme_ctrlr_fail(ctrlr); > @@ -894,7 +902,7 @@ nvme_ctrlr_start_config_hook(void *arg) > > if (nvme_ctrlr_set_num_qpairs(ctrlr) == 0 && > nvme_ctrlr_construct_io_qpairs(ctrlr) == 0) > - nvme_ctrlr_start(ctrlr); > + nvme_ctrlr_start(ctrlr, false); > else > nvme_ctrlr_fail(ctrlr); > > @@ -923,7 +931,7 @@ nvme_ctrlr_reset_task(void *arg, int pending) > */ > pause("nvmereset", hz / 10); > if (status == 0) > - nvme_ctrlr_start(ctrlr); > + nvme_ctrlr_start(ctrlr, true); > else > nvme_ctrlr_fail(ctrlr); > > @@ -946,7 +954,7 @@ nvme_ctrlr_poll(struct nvme_controller *ctrlr) > } > > /* > - * Poll the single-vector intertrupt case: num_io_queues will be 1 and > + * Poll the single-vector interrupt case: num_io_queues will be 1 and > * there's only a single vector. While we're polling, we mask further > * interrupts in the controller. > */ > @@ -1012,7 +1020,7 @@ nvme_ctrlr_passthrough_cmd(struct > nvme_controller *ctr > if (is_user_buffer) { > /* > * Ensure the user buffer is wired for the > duration of > - * this passthrough command. > + * this pass-through command. > */ > PHOLD(curproc); > buf = uma_zalloc(pbuf_zone, M_WAITOK); > @@ -1031,7 +1039,7 @@ nvme_ctrlr_passthrough_cmd(struct > nvme_controller *ctr > } else > req = nvme_allocate_request_null(nvme_pt_done, pt); > > - /* Assume userspace already converted to little-endian */ > + /* Assume user space already converted to little-endian */ > req->cmd.opc = pt->cmd.opc; > req->cmd.fuse = pt->cmd.fuse; > req->cmd.rsvd2 = pt->cmd.rsvd2; > @@ -1206,7 +1214,7 @@ nvme_ctrlr_destruct(struct nvme_controller > *ctrlr, dev > > if (ctrlr->is_initialized) { > if (!gone) > - nvme_ctrlr_destroy_qpairs(ctrlr); > + nvme_ctrlr_delete_qpairs(ctrlr); > for (i = 0; i < ctrlr->num_io_queues; i++) > nvme_io_qpair_destroy(&ctrlr->ioq[i]); > free(ctrlr->ioq, M_NVME); > @@ -1305,4 +1313,88 @@ nvme_ctrlr_get_data(struct nvme_controller > *ctrlr) > { > > return (&ctrlr->cdata); > +} > + > +int > +nvme_ctrlr_suspend(struct nvme_controller *ctrlr) > +{ > + int to = hz; > + > + /* > + * Can't touch failed controllers, so it's already suspended. > + */ > + if (ctrlr->is_failed) > + return (0); > + > + /* > + * We don't want the reset taskqueue running, since it does similar > + * things, so prevent it from running after we start. Wait for any > reset > + * that may have been started to complete. The reset process we > follow > + * will ensure that any new I/O will queue and be given to the > hardware > + * after we resume (though there should be none). > + */ > + while (atomic_cmpset_32(&ctrlr->is_resetting, 0, 1) == 0 && to-- > > 0) > + pause("nvmesusp", 1); > + if (to <= 0) { > + nvme_printf(ctrlr, > + "Competing reset task didn't finish. Try again > later.\n"); > + return (EWOULDBLOCK); > + } > + > + /* > + * Per Section 7.6.2 of NVMe spec 1.4, to properly suspend, we > need to > + * delete the hardware I/O queues, and then shutdown. This properly > + * flushes any metadata the drive may have stored so it can survive > + * having its power removed and prevents the unsafe shutdown count > from > + * incriminating. Once we delete the qpairs, we have to disable > them > + * before shutting down. The delay is out of paranoia in > + * nvme_ctrlr_hw_reset, and is repeated here (though we should > have no > + * pending I/O that the delay copes with). > + */ > + nvme_ctrlr_delete_qpairs(ctrlr); > + nvme_ctrlr_disable_qpairs(ctrlr); > + DELAY(100*1000); > + nvme_ctrlr_shutdown(ctrlr); > + > + return (0); > +} > + > +int > +nvme_ctrlr_resume(struct nvme_controller *ctrlr) > +{ > + > + /* > + * Can't touch failed controllers, so nothing to do to resume. > + */ > + if (ctrlr->is_failed) > + return (0); > + > + /* > + * Have to reset the hardware twice, just like we do on attach. See > + * nmve_attach() for why. > + */ > + if (nvme_ctrlr_hw_reset(ctrlr) != 0) > + goto fail; > + if (nvme_ctrlr_hw_reset(ctrlr) != 0) > + goto fail; > + > + /* > + * Now that we're reset the hardware, we can restart the > controller. Any > + * I/O that was pending is requeued. Any admin commands are > aborted with > + * an error. Once we've restarted, take the controller out of > reset. > + */ > + nvme_ctrlr_start(ctrlr, true); > + atomic_cmpset_32(&ctrlr->is_resetting, 1, 0); > + > + return (0); > +fail: > + /* > + * Since we can't bring the controller out of reset, announce and > fail > + * the controller. However, we have to return success for the > resume > + * itself, due to questionable APIs. > + */ > + nvme_printf(ctrlr, "Failed to reset on resume, failing.\n"); > + nvme_ctrlr_fail(ctrlr); > + atomic_cmpset_32(&ctrlr->is_resetting, 1, 0); > + return (0); > } > > Modified: head/sys/dev/nvme/nvme_pci.c > > ============================================================================== > --- head/sys/dev/nvme/nvme_pci.c Tue Sep 3 14:55:19 2019 > (r351746) > +++ head/sys/dev/nvme/nvme_pci.c Tue Sep 3 15:26:11 2019 > (r351747) > @@ -43,6 +43,8 @@ __FBSDID("$FreeBSD$"); > static int nvme_pci_probe(device_t); > static int nvme_pci_attach(device_t); > static int nvme_pci_detach(device_t); > +static int nvme_pci_suspend(device_t); > +static int nvme_pci_resume(device_t); > > static void nvme_ctrlr_setup_interrupts(struct nvme_controller > *ctrlr); > > @@ -51,6 +53,8 @@ static device_method_t nvme_pci_methods[] = { > DEVMETHOD(device_probe, nvme_pci_probe), > DEVMETHOD(device_attach, nvme_pci_attach), > DEVMETHOD(device_detach, nvme_pci_detach), > + DEVMETHOD(device_suspend, nvme_pci_suspend), > + DEVMETHOD(device_resume, nvme_pci_resume), > DEVMETHOD(device_shutdown, nvme_shutdown), > { 0, 0 } > }; > @@ -331,4 +335,22 @@ nvme_ctrlr_setup_interrupts(struct > nvme_controller *ct > } > > ctrlr->msix_enabled = 1; > +} > + > +static int > +nvme_pci_suspend(device_t dev) > +{ > + struct nvme_controller *ctrlr; > + > + ctrlr = DEVICE2SOFTC(dev); > + return (nvme_ctrlr_suspend(ctrlr)); > +} > + > +static int > +nvme_pci_resume(device_t dev) > +{ > + struct nvme_controller *ctrlr; > + > + ctrlr = DEVICE2SOFTC(dev); > + return (nvme_ctrlr_resume(ctrlr)); > } > > Modified: head/sys/dev/nvme/nvme_private.h > > ============================================================================== > --- head/sys/dev/nvme/nvme_private.h Tue Sep 3 14:55:19 2019 > (r351746) > +++ head/sys/dev/nvme/nvme_private.h Tue Sep 3 15:26:11 2019 > (r351747) > @@ -556,4 +556,7 @@ void nvme_notify_ns(struct nvme_controller > *ctrlr, int > void nvme_ctrlr_intx_handler(void *arg); > void nvme_ctrlr_poll(struct nvme_controller *ctrlr); > > +int nvme_ctrlr_suspend(struct nvme_controller *ctrlr); > +int nvme_ctrlr_resume(struct nvme_controller *ctrlr); > + > #endif /* __NVME_PRIVATE_H__ */ > > > > >