Date: Tue, 3 Sep 2019 19:20:23 -0600 From: Warner Losh <imp@bsdimp.com> To: Ravi Pokala <rpokala@freebsd.org> Cc: Warner Losh <imp@freebsd.org>, src-committers <src-committers@freebsd.org>, svn-src-all <svn-src-all@freebsd.org>, svn-src-head <svn-src-head@freebsd.org> Subject: Re: svn commit: r351747 - head/sys/dev/nvme Message-ID: <CANCZdfo7TiPZP02vai0r5fMAAn%2BQwGRwns5Db805LXv68RZkFA@mail.gmail.com> In-Reply-To: <CANCZdfopPpaoKE=K%2B20juKp_DQ19aTOK_MVzhM5SJXyBbbAO7g@mail.gmail.com> References: <201909031526.x83FQBbE002770@repo.freebsd.org> <F2340AD7-11C8-4650-8945-D8B2EC9A8FBB@panasas.com> <CANCZdfopPpaoKE=K%2B20juKp_DQ19aTOK_MVzhM5SJXyBbbAO7g@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Sep 3, 2019 at 7:19 PM Warner Losh <imp@bsdimp.com> wrote: > > > On Tue, Sep 3, 2019 at 2:38 PM Ravi Pokala <rpokala@freebsd.org> wrote: > >> Hi Warner, >> >> + /* >> + * Per Section 7.6.2 of NVMe spec 1.4, to properly suspend, we >> need to >> + * delete the hardware I/O queues, and then shutdown. This >> properly >> + * flushes any metadata the drive may have stored so it can >> survive >> + * having its power removed and prevents the unsafe shutdown >> count from >> + * incriminating. Once we delete the qpairs, we have to disable >> them >> + * before shutting down. The delay is out of paranoia in >> + * nvme_ctrlr_hw_reset, and is repeated here (though we should >> have no >> + * pending I/O that the delay copes with). >> + */ >> + nvme_ctrlr_delete_qpairs(ctrlr); >> + nvme_ctrlr_disable_qpairs(ctrlr); >> + DELAY(100*1000); >> + nvme_ctrlr_shutdown(ctrlr); >> >> This seems backwards to me. From that section: >> >> > The host should perform the following actions in sequence for a normal >> shutdown: >> > 1. Stop submitting any new I/O commands to the controller and allow >> any outstanding commands to complete; >> > 2. If the controller implements I/O queues, then the host should >> delete all I/O Submission Queues, using the Delete I/O Submission Queue >> command. A result of the successful completion of the Delete I/O Submission >> Queue command is that any remaining commands outstanding are aborted; >> > 3. If the controller implements I/O queues, then the host should >> delete all I/O Completion Queues, using the Delete I/O Completion Queue >> command; and >> > 4. The host should set the Shutdown Notification (CC.SHN) field to >> 01b to indicate a normal shutdown operation. The controller indicates when >> shutdown processing is completed by updating the Shutdown Status >> (CSTS.SHST) field to 10b. >> >> You are calling nvme_ctrlr_delete_qpairs() -- which implements steps (2) >> and (3) -- before you are calling nvme_ctrlr_disable_qpairs() -- which >> implements step (1). >> >> Or am I missing something? >> > > The system stops all I/O and makes sure all I/O is complete before > starting to suspend. Done in the reverse order, I don't have the queues > Doh! disable_qpairs() makes it impossible to send the delete_qpairs() as well, so I have to do that first. Not entirely sure why, but since the OS enforces the I/O ban, not the driver, I'm good. Otherwise, I'd have to drain all my queues, etc before starting this and come up with some interlock to prevent new I/O from being submitted... Warner > Warner > > > >> When we suspend, we need to properly shutdown the NVME controller. The >> controller may go into D3 state (or may have the power removed), and >> to properly flush the metadata to non-volatile RAM, we must >> complete a >> normal shutdown. This consists of deleting the I/O queues and >> setting >> the shutodown bit. We have to do some extra stuff to make sure we >> reset the software state of the queues as well. >> >> On resume, we have to reset the card twice, for reasons described in >> the attach funcion. Once we've done that, we can restart the card. >> If >> any of this fails, we'll fail the NVMe card, just like we do when a >> reset fails. >> >> Set is_resetting for the duration of the suspend / resume. This >> keeps >> the reset taskqueue from running a concurrent reset, and also is >> needed to prevent any hw completions from queueing more I/O to the >> card. Pass resetting flag to nvme_ctrlr_start. It doesn't need to >> get >> that from the global state of the ctrlr. Wait for any pending reset >> to >> finish. All queued I/O will get sent to the hardware as part of >> nvme_ctrlr_start(), though the upper layers shouldn't send any >> down. Disabling the qpairs is the other failsafe to ensure all I/O >> is >> queued. >> >> Rename nvme_ctrlr_destory_qpairs to nvme_ctrlr_delete_qpairs to >> avoid >> confusion with all the other destroy functions. It just removes the >> queues in hardware, while the other _destroy_ functions tear down >> driver data structures. >> >> Split parts of the hardware reset function up so that I can >> do part of the reset in suspsend. Split out the software disabling >> of the qpairs into nvme_ctrlr_disable_qpairs. >> >> Finally, fix a couple of spelling errors in comments related to >> this. >> >> Relnotes: Yes >> MFC After: 1 week >> Reviewed by: scottl@ (prior version) >> Differential Revision: https://reviews.freebsd.org/D21493 >> >> Modified: >> head/sys/dev/nvme/nvme.c >> head/sys/dev/nvme/nvme_ctrlr.c >> head/sys/dev/nvme/nvme_pci.c >> head/sys/dev/nvme/nvme_private.h >> >> Modified: head/sys/dev/nvme/nvme.c >> >> ============================================================================== >> --- head/sys/dev/nvme/nvme.c Tue Sep 3 14:55:19 2019 >> (r351746) >> +++ head/sys/dev/nvme/nvme.c Tue Sep 3 15:26:11 2019 >> (r351747) >> @@ -137,9 +137,10 @@ nvme_attach(device_t dev) >> } >> >> /* >> - * Reset controller twice to ensure we do a transition from >> cc.en==1 >> - * to cc.en==0. This is because we don't really know what status >> - * the controller was left in when boot handed off to OS. >> + * Reset controller twice to ensure we do a transition from >> cc.en==1 to >> + * cc.en==0. This is because we don't really know what status the >> + * controller was left in when boot handed off to OS. Linux >> doesn't do >> + * this, however. If we adopt that policy, see also >> nvme_ctrlr_resume(). >> */ >> status = nvme_ctrlr_hw_reset(ctrlr); >> if (status != 0) { >> >> Modified: head/sys/dev/nvme/nvme_ctrlr.c >> >> ============================================================================== >> --- head/sys/dev/nvme/nvme_ctrlr.c Tue Sep 3 14:55:19 2019 >> (r351746) >> +++ head/sys/dev/nvme/nvme_ctrlr.c Tue Sep 3 15:26:11 2019 >> (r351747) >> @@ -118,8 +118,8 @@ nvme_ctrlr_construct_io_qpairs(struct >> nvme_controller >> >> /* >> * Our best estimate for the maximum number of I/Os that we should >> - * noramlly have in flight at one time. This should be viewed as >> a hint, >> - * not a hard limit and will need to be revisitted when the upper >> layers >> + * normally have in flight at one time. This should be viewed as >> a hint, >> + * not a hard limit and will need to be revisited when the upper >> layers >> * of the storage system grows multi-queue support. >> */ >> ctrlr->max_hw_pend_io = num_trackers * ctrlr->num_io_queues * 3 / >> 4; >> @@ -344,10 +344,10 @@ nvme_ctrlr_enable(struct nvme_controller *ctrlr) >> return (nvme_ctrlr_wait_for_ready(ctrlr, 1)); >> } >> >> -int >> -nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr) >> +static void >> +nvme_ctrlr_disable_qpairs(struct nvme_controller *ctrlr) >> { >> - int i, err; >> + int i; >> >> nvme_admin_qpair_disable(&ctrlr->adminq); >> /* >> @@ -359,7 +359,15 @@ nvme_ctrlr_hw_reset(struct nvme_controller >> *ctrlr) >> for (i = 0; i < ctrlr->num_io_queues; i++) >> nvme_io_qpair_disable(&ctrlr->ioq[i]); >> } >> +} >> >> +int >> +nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr) >> +{ >> + int err; >> + >> + nvme_ctrlr_disable_qpairs(ctrlr); >> + >> DELAY(100*1000); >> >> err = nvme_ctrlr_disable(ctrlr); >> @@ -481,7 +489,7 @@ nvme_ctrlr_create_qpairs(struct nvme_controller >> *ctrlr >> } >> >> static int >> -nvme_ctrlr_destroy_qpairs(struct nvme_controller *ctrlr) >> +nvme_ctrlr_delete_qpairs(struct nvme_controller *ctrlr) >> { >> struct nvme_completion_poll_status status; >> struct nvme_qpair *qpair; >> @@ -820,7 +828,7 @@ nvme_ctrlr_configure_int_coalescing(struct >> nvme_contro >> } >> >> static void >> -nvme_ctrlr_start(void *ctrlr_arg) >> +nvme_ctrlr_start(void *ctrlr_arg, bool resetting) >> { >> struct nvme_controller *ctrlr = ctrlr_arg; >> uint32_t old_num_io_queues; >> @@ -833,7 +841,7 @@ nvme_ctrlr_start(void *ctrlr_arg) >> * the number of I/O queues supported, so cannot reset >> * the adminq again here. >> */ >> - if (ctrlr->is_resetting) >> + if (resetting) >> nvme_qpair_reset(&ctrlr->adminq); >> >> for (i = 0; i < ctrlr->num_io_queues; i++) >> @@ -854,7 +862,7 @@ nvme_ctrlr_start(void *ctrlr_arg) >> * explicit specify how many queues it will use. This value >> should >> * never change between resets, so panic if somehow that does >> happen. >> */ >> - if (ctrlr->is_resetting) { >> + if (resetting) { >> old_num_io_queues = ctrlr->num_io_queues; >> if (nvme_ctrlr_set_num_qpairs(ctrlr) != 0) { >> nvme_ctrlr_fail(ctrlr); >> @@ -894,7 +902,7 @@ nvme_ctrlr_start_config_hook(void *arg) >> >> if (nvme_ctrlr_set_num_qpairs(ctrlr) == 0 && >> nvme_ctrlr_construct_io_qpairs(ctrlr) == 0) >> - nvme_ctrlr_start(ctrlr); >> + nvme_ctrlr_start(ctrlr, false); >> else >> nvme_ctrlr_fail(ctrlr); >> >> @@ -923,7 +931,7 @@ nvme_ctrlr_reset_task(void *arg, int pending) >> */ >> pause("nvmereset", hz / 10); >> if (status == 0) >> - nvme_ctrlr_start(ctrlr); >> + nvme_ctrlr_start(ctrlr, true); >> else >> nvme_ctrlr_fail(ctrlr); >> >> @@ -946,7 +954,7 @@ nvme_ctrlr_poll(struct nvme_controller *ctrlr) >> } >> >> /* >> - * Poll the single-vector intertrupt case: num_io_queues will be 1 >> and >> + * Poll the single-vector interrupt case: num_io_queues will be 1 and >> * there's only a single vector. While we're polling, we mask further >> * interrupts in the controller. >> */ >> @@ -1012,7 +1020,7 @@ nvme_ctrlr_passthrough_cmd(struct >> nvme_controller *ctr >> if (is_user_buffer) { >> /* >> * Ensure the user buffer is wired for the >> duration of >> - * this passthrough command. >> + * this pass-through command. >> */ >> PHOLD(curproc); >> buf = uma_zalloc(pbuf_zone, M_WAITOK); >> @@ -1031,7 +1039,7 @@ nvme_ctrlr_passthrough_cmd(struct >> nvme_controller *ctr >> } else >> req = nvme_allocate_request_null(nvme_pt_done, pt); >> >> - /* Assume userspace already converted to little-endian */ >> + /* Assume user space already converted to little-endian */ >> req->cmd.opc = pt->cmd.opc; >> req->cmd.fuse = pt->cmd.fuse; >> req->cmd.rsvd2 = pt->cmd.rsvd2; >> @@ -1206,7 +1214,7 @@ nvme_ctrlr_destruct(struct nvme_controller >> *ctrlr, dev >> >> if (ctrlr->is_initialized) { >> if (!gone) >> - nvme_ctrlr_destroy_qpairs(ctrlr); >> + nvme_ctrlr_delete_qpairs(ctrlr); >> for (i = 0; i < ctrlr->num_io_queues; i++) >> nvme_io_qpair_destroy(&ctrlr->ioq[i]); >> free(ctrlr->ioq, M_NVME); >> @@ -1305,4 +1313,88 @@ nvme_ctrlr_get_data(struct nvme_controller >> *ctrlr) >> { >> >> return (&ctrlr->cdata); >> +} >> + >> +int >> +nvme_ctrlr_suspend(struct nvme_controller *ctrlr) >> +{ >> + int to = hz; >> + >> + /* >> + * Can't touch failed controllers, so it's already suspended. >> + */ >> + if (ctrlr->is_failed) >> + return (0); >> + >> + /* >> + * We don't want the reset taskqueue running, since it does >> similar >> + * things, so prevent it from running after we start. Wait for >> any reset >> + * that may have been started to complete. The reset process we >> follow >> + * will ensure that any new I/O will queue and be given to the >> hardware >> + * after we resume (though there should be none). >> + */ >> + while (atomic_cmpset_32(&ctrlr->is_resetting, 0, 1) == 0 && to-- >> > 0) >> + pause("nvmesusp", 1); >> + if (to <= 0) { >> + nvme_printf(ctrlr, >> + "Competing reset task didn't finish. Try again >> later.\n"); >> + return (EWOULDBLOCK); >> + } >> + >> + /* >> + * Per Section 7.6.2 of NVMe spec 1.4, to properly suspend, we >> need to >> + * delete the hardware I/O queues, and then shutdown. This >> properly >> + * flushes any metadata the drive may have stored so it can >> survive >> + * having its power removed and prevents the unsafe shutdown >> count from >> + * incriminating. Once we delete the qpairs, we have to disable >> them >> + * before shutting down. The delay is out of paranoia in >> + * nvme_ctrlr_hw_reset, and is repeated here (though we should >> have no >> + * pending I/O that the delay copes with). >> + */ >> + nvme_ctrlr_delete_qpairs(ctrlr); >> + nvme_ctrlr_disable_qpairs(ctrlr); >> + DELAY(100*1000); >> + nvme_ctrlr_shutdown(ctrlr); >> + >> + return (0); >> +} >> + >> +int >> +nvme_ctrlr_resume(struct nvme_controller *ctrlr) >> +{ >> + >> + /* >> + * Can't touch failed controllers, so nothing to do to resume. >> + */ >> + if (ctrlr->is_failed) >> + return (0); >> + >> + /* >> + * Have to reset the hardware twice, just like we do on attach. >> See >> + * nmve_attach() for why. >> + */ >> + if (nvme_ctrlr_hw_reset(ctrlr) != 0) >> + goto fail; >> + if (nvme_ctrlr_hw_reset(ctrlr) != 0) >> + goto fail; >> + >> + /* >> + * Now that we're reset the hardware, we can restart the >> controller. Any >> + * I/O that was pending is requeued. Any admin commands are >> aborted with >> + * an error. Once we've restarted, take the controller out of >> reset. >> + */ >> + nvme_ctrlr_start(ctrlr, true); >> + atomic_cmpset_32(&ctrlr->is_resetting, 1, 0); >> + >> + return (0); >> +fail: >> + /* >> + * Since we can't bring the controller out of reset, announce and >> fail >> + * the controller. However, we have to return success for the >> resume >> + * itself, due to questionable APIs. >> + */ >> + nvme_printf(ctrlr, "Failed to reset on resume, failing.\n"); >> + nvme_ctrlr_fail(ctrlr); >> + atomic_cmpset_32(&ctrlr->is_resetting, 1, 0); >> + return (0); >> } >> >> Modified: head/sys/dev/nvme/nvme_pci.c >> >> ============================================================================== >> --- head/sys/dev/nvme/nvme_pci.c Tue Sep 3 14:55:19 2019 >> (r351746) >> +++ head/sys/dev/nvme/nvme_pci.c Tue Sep 3 15:26:11 2019 >> (r351747) >> @@ -43,6 +43,8 @@ __FBSDID("$FreeBSD$"); >> static int nvme_pci_probe(device_t); >> static int nvme_pci_attach(device_t); >> static int nvme_pci_detach(device_t); >> +static int nvme_pci_suspend(device_t); >> +static int nvme_pci_resume(device_t); >> >> static void nvme_ctrlr_setup_interrupts(struct nvme_controller >> *ctrlr); >> >> @@ -51,6 +53,8 @@ static device_method_t nvme_pci_methods[] = { >> DEVMETHOD(device_probe, nvme_pci_probe), >> DEVMETHOD(device_attach, nvme_pci_attach), >> DEVMETHOD(device_detach, nvme_pci_detach), >> + DEVMETHOD(device_suspend, nvme_pci_suspend), >> + DEVMETHOD(device_resume, nvme_pci_resume), >> DEVMETHOD(device_shutdown, nvme_shutdown), >> { 0, 0 } >> }; >> @@ -331,4 +335,22 @@ nvme_ctrlr_setup_interrupts(struct >> nvme_controller *ct >> } >> >> ctrlr->msix_enabled = 1; >> +} >> + >> +static int >> +nvme_pci_suspend(device_t dev) >> +{ >> + struct nvme_controller *ctrlr; >> + >> + ctrlr = DEVICE2SOFTC(dev); >> + return (nvme_ctrlr_suspend(ctrlr)); >> +} >> + >> +static int >> +nvme_pci_resume(device_t dev) >> +{ >> + struct nvme_controller *ctrlr; >> + >> + ctrlr = DEVICE2SOFTC(dev); >> + return (nvme_ctrlr_resume(ctrlr)); >> } >> >> Modified: head/sys/dev/nvme/nvme_private.h >> >> ============================================================================== >> --- head/sys/dev/nvme/nvme_private.h Tue Sep 3 14:55:19 2019 >> (r351746) >> +++ head/sys/dev/nvme/nvme_private.h Tue Sep 3 15:26:11 2019 >> (r351747) >> @@ -556,4 +556,7 @@ void nvme_notify_ns(struct nvme_controller >> *ctrlr, int >> void nvme_ctrlr_intx_handler(void *arg); >> void nvme_ctrlr_poll(struct nvme_controller *ctrlr); >> >> +int nvme_ctrlr_suspend(struct nvme_controller *ctrlr); >> +int nvme_ctrlr_resume(struct nvme_controller *ctrlr); >> + >> #endif /* __NVME_PRIVATE_H__ */ >> >> >> >> >>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfo7TiPZP02vai0r5fMAAn%2BQwGRwns5Db805LXv68RZkFA>