From nobody Sun Sep 24 13:21:41 2023
X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Rtmqn3g57z4tyP4;
	Sun, 24 Sep 2023 13:21:41 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4Rtmqn38h0z4YDS;
	Sun, 24 Sep 2023 13:21:41 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim;
	t=1695561701;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=RCTB6GFbTgqSFy6oBeseJ89Dn+N7ifDCrCEpp6GUAa4=;
	b=BM9WMwZdPzmdG0maLpUW8CBM3NnnCMoiHr1bLzpNqh6py2PmGpPkpZ7zziwqnhuphqf/g6
	tFBcr6Jl7gd2dgBZzecoMqOWh4JWJf/Pgub3QtXH9c+9nn1+EXTc+Ah+jwq+I6M9zJPRaj
	jdEbcZpOyq6Ek4iErWPSRB5h4E+ZoRPgNZqTZFUU4n4aH+/Fk94/q9FORuDcx2iTs1pFLu
	esZTcQOBtEuJuHlAYVgFHqne7kZrq/ADTJEnQQFYwvOjS7nRtD+wG19Td1DdSjTEg/QvHR
	AfYQV6eNwZ5qV/4nH/At1Nh9nX2iWrNgyrA8/s6+TeUJl8UM/e5dOhOMTHSm1A==
ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1695561701; a=rsa-sha256; cv=none;
	b=VXwzO1dzTD9ePJRAS/sxCQHnklksp6FbnmgAmCbgtBfBnjdk3Hw+Gh0KfXo6Oqx4s4SDor
	BQx05c6CuVru91RV5OdQKmnBqCCWh0j9ot7dqEOFCmJ4zwsq/6I88WMjz++rVM1eDf7pY+
	om1MjYFZW8H/9bLXDVJX/fhqi8lci9GZLfmhsGGcbRfnBX6S4bMXMjF9F7SH8GdF4iZlo4
	YIwLqM+ClKJGIzJL0PrTuZXdktwOUUpvILnw/jA78DiF+jRGrzjqOVWQ00wKYlX8zdtNnh
	1ktCCGM2j483JbfO8gqzUPt7f9vz+F1+mUuliQc7ibzzSpNuHqzgUezifCFXBQ==
ARC-Authentication-Results: i=1;
	mx1.freebsd.org;
	none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org;
	s=dkim; t=1695561701;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=RCTB6GFbTgqSFy6oBeseJ89Dn+N7ifDCrCEpp6GUAa4=;
	b=H2wDD0L6YS4j8EFDOuSEIPC4ZIP6KjrECX9wFCpitQag3bQNSKHqNn09uHYoYZIviN0B7b
	b17gistgr/fSvuKLx7MyuswnMHnFPZZySS/Z1CTUL4DL/fQrNxHZPhnYXQvfdKk/DUsfe2
	mjpEnU9Sp2rfb3iEly3LZQxjlO1PssvEAo68XILeY3ONgO96Bbc6aqIn3gffpLk07itEH9
	HiyJBSH0mNzUpNwXjZfHYBYj9T2PMeN2+PpwMtt/T5ith75FhXWhKLrK1buYKrC0VhYVEE
	4vWQzpxbYpoGHFGQIQhjnKaxzCuz6UAZ3bCTSLlJOWXy+VJWb6y7Z60n4zXy/g==
Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4Rtmqn2DKvznSV;
	Sun, 24 Sep 2023 13:21:41 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
Received: from gitrepo.freebsd.org ([127.0.1.44])
	by gitrepo.freebsd.org (8.17.1/8.17.1) with ESMTP id 38ODLfIJ090509;
	Sun, 24 Sep 2023 13:21:41 GMT
	(envelope-from git@gitrepo.freebsd.org)
Received: (from git@localhost)
	by gitrepo.freebsd.org (8.17.1/8.17.1/Submit) id 38ODLf43090506;
	Sun, 24 Sep 2023 13:21:41 GMT
	(envelope-from git)
Date: Sun, 24 Sep 2023 13:21:41 GMT
Message-Id: <202309241321.38ODLf43090506@gitrepo.freebsd.org>
To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org,
        dev-commits-src-main@FreeBSD.org
From: Warner Losh <imp@FreeBSD.org>
Subject: git: da8324a9258f - main - nvme: Fix locking protocol
  violation to fix suspend / resume
List-Id: Commit messages for all branches of the src repository <dev-commits-src-all.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all
List-Help: <mailto:dev-commits-src-all+help@freebsd.org>
List-Post: <mailto:dev-commits-src-all@freebsd.org>
List-Subscribe: <mailto:dev-commits-src-all+subscribe@freebsd.org>
List-Unsubscribe: <mailto:dev-commits-src-all+unsubscribe@freebsd.org>
Sender: owner-dev-commits-src-all@freebsd.org
X-BeenThere: dev-commits-src-all@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Git-Committer: imp
X-Git-Repository: src
X-Git-Refname: refs/heads/main
X-Git-Reftype: branch
X-Git-Commit: da8324a9258f1791cd10423103c1746646e33104
Auto-Submitted: auto-generated

The branch main has been updated by imp:

URL: https://cgit.FreeBSD.org/src/commit/?id=da8324a9258f1791cd10423103c1746646e33104

commit da8324a9258f1791cd10423103c1746646e33104
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2023-09-24 12:57:07 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2023-09-24 13:17:18 +0000

    nvme: Fix locking protocol violation to fix suspend / resume
    
    Currently, when we suspend, we need to tear down all the qpairs. We call
    nvme_admin_qpair_abort_aers with the admin qpair lock held, but the
    tracker it will call for the pending AER also locks it (recursively)
    hitting an assert. This routine is called without the qpair lock held
    when we destroy the device entirely in a number of places. Add an assert
    to this effect and drop the qpair lock before calling it.
    nvme_admin_qpair_abort_aers then locks the qpair lock to traverse the
    list, dropping it around calls to nvme_qpair_complete_tracker, and
    restarting the list scan after picking it back up.
    
    Note: If interrupts are still running, there's a tiny window for these
    AERs: If one fires just an instant after we manually complete it, then
    we'll be fine: we set the state of the queue to 'waiting' and we ignore
    interrupts while 'waiting'. We know we'll destroy all the queue state
    with these pending interrupts before looking at them again and we know
    all the TRs will have been completed or rescheduled. So either way we're
    covered.
    
    Also, tidy up the failure case as well: failing a queue is a superset of
    disabling it, so no need to call disable first. This solves solves some
    locking issues with recursion since we don't need to recurse.. Set the
    qpair state of failed queues to RECOVERY_FAILED and stop scheduling the
    watchdog. Assert we're not failed when we're enabling a qpair, since
    failure currently is one-way. Make failure a little less verbose.
    
    Next, kill the pre/post reset stuff. It's completely bogus since we
    disable the qparis, we don't need to also hold the lock through the
    reset: disabling will cause the ISR to return early. This keeps us from
    recursing on the recovery lock when resuming. We only need the recovery
    lock to avoid a specific race between the timer and the ISR.
    
    Finally, kill NVME_RESET_2X. It'S been a major release since we put it
    in and nobody has used it as far as I can tell. And it was a motivator
    for the pre/post uglification.
    
    These are all interrelated, so need to be done at the same time.
    
    Sponsored by:           Netflix
    Reviewed by:            jhb
    Tested by:              jhb (made sure suspend / resume worked)
    MFC After:              3 days
    Differential Revision:  https://reviews.freebsd.org/D41866
---
 sys/dev/nvme/nvme_ctrlr.c   | 49 +++++--------------------------------------
 sys/dev/nvme/nvme_private.h |  1 +
 sys/dev/nvme/nvme_qpair.c   | 51 ++++++++++++++++++++++++++++++++-------------
 3 files changed, 42 insertions(+), 59 deletions(-)

diff --git a/sys/dev/nvme/nvme_ctrlr.c b/sys/dev/nvme/nvme_ctrlr.c
index c4a750901743..30a5ee81b2a4 100644
--- a/sys/dev/nvme/nvme_ctrlr.c
+++ b/sys/dev/nvme/nvme_ctrlr.c
@@ -217,12 +217,15 @@ nvme_ctrlr_fail(struct nvme_controller *ctrlr)
 {
 	int i;
 
+	/*
+	 * No need to disable queues before failing them. Failing is a superet
+	 * of disabling (though pedantically we'd abort the AERs silently with
+	 * a different error, though when we fail, that hardly matters).
+	 */
 	ctrlr->is_failed = true;
-	nvme_admin_qpair_disable(&ctrlr->adminq);
 	nvme_qpair_fail(&ctrlr->adminq);
 	if (ctrlr->ioq != NULL) {
 		for (i = 0; i < ctrlr->num_io_queues; i++) {
-			nvme_io_qpair_disable(&ctrlr->ioq[i]);
 			nvme_qpair_fail(&ctrlr->ioq[i]);
 		}
 	}
@@ -416,34 +419,6 @@ nvme_ctrlr_disable_qpairs(struct nvme_controller *ctrlr)
 	}
 }
 
-static void
-nvme_pre_reset(struct nvme_controller *ctrlr)
-{
-	/*
-	 * Make sure that all the ISRs are done before proceeding with the reset
-	 * (and also keep any stray interrupts that happen during this process
-	 * from racing this process). For startup, this is a nop, since the
-	 * hardware is in a good state. But for recovery, where we randomly
-	 * reset the hardware, this ensure that we're not racing the ISRs.
-	 */
-	mtx_lock(&ctrlr->adminq.recovery);
-	for (int i = 0; i < ctrlr->num_io_queues; i++) {
-		mtx_lock(&ctrlr->ioq[i].recovery);
-	}
-}
-
-static void
-nvme_post_reset(struct nvme_controller *ctrlr)
-{
-	/*
-	 * Reset complete, unblock ISRs
-	 */
-	mtx_unlock(&ctrlr->adminq.recovery);
-	for (int i = 0; i < ctrlr->num_io_queues; i++) {
-		mtx_unlock(&ctrlr->ioq[i].recovery);
-	}
-}
-
 static int
 nvme_ctrlr_hw_reset(struct nvme_controller *ctrlr)
 {
@@ -1236,9 +1211,7 @@ nvme_ctrlr_reset_task(void *arg, int pending)
 	int			status;
 
 	nvme_ctrlr_devctl_log(ctrlr, "RESET", "resetting controller");
-	nvme_pre_reset(ctrlr);
 	status = nvme_ctrlr_hw_reset(ctrlr);
-	nvme_post_reset(ctrlr);
 	if (status == 0)
 		nvme_ctrlr_start(ctrlr, true);
 	else
@@ -1725,19 +1698,8 @@ nvme_ctrlr_resume(struct nvme_controller *ctrlr)
 	if (ctrlr->is_failed)
 		return (0);
 
-	nvme_pre_reset(ctrlr);
 	if (nvme_ctrlr_hw_reset(ctrlr) != 0)
 		goto fail;
-#ifdef NVME_2X_RESET
-	/*
-	 * Prior to FreeBSD 13.1, FreeBSD's nvme driver reset the hardware twice
-	 * to get it into a known good state. However, the hardware's state is
-	 * good and we don't need to do this for proper functioning.
-	 */
-	if (nvme_ctrlr_hw_reset(ctrlr) != 0)
-		goto fail;
-#endif
-	nvme_post_reset(ctrlr);
 
 	/*
 	 * Now that we've reset the hardware, we can restart the controller. Any
@@ -1754,7 +1716,6 @@ fail:
 	 * the controller. However, we have to return success for the resume
 	 * itself, due to questionable APIs.
 	 */
-	nvme_post_reset(ctrlr);
 	nvme_printf(ctrlr, "Failed to reset on resume, failing.\n");
 	nvme_ctrlr_fail(ctrlr);
 	(void)atomic_cmpset_32(&ctrlr->is_resetting, 1, 0);
diff --git a/sys/dev/nvme/nvme_private.h b/sys/dev/nvme/nvme_private.h
index a6239f30f3bf..496bd8229e0a 100644
--- a/sys/dev/nvme/nvme_private.h
+++ b/sys/dev/nvme/nvme_private.h
@@ -150,6 +150,7 @@ struct nvme_tracker {
 enum nvme_recovery {
 	RECOVERY_NONE = 0,		/* Normal operations */
 	RECOVERY_WAITING,		/* waiting for the reset to complete */
+	RECOVERY_FAILED,		/* We have failed, no more I/O */
 };
 struct nvme_qpair {
 	struct nvme_controller	*ctrlr;
diff --git a/sys/dev/nvme/nvme_qpair.c b/sys/dev/nvme/nvme_qpair.c
index 4e37aa0e1020..9806096de81d 100644
--- a/sys/dev/nvme/nvme_qpair.c
+++ b/sys/dev/nvme/nvme_qpair.c
@@ -945,22 +945,38 @@ nvme_admin_qpair_abort_aers(struct nvme_qpair *qpair)
 {
 	struct nvme_tracker	*tr;
 
+	/*
+	 * nvme_complete_tracker must be called without the qpair lock held. It
+	 * takes the lock to adjust outstanding_tr list, so make sure we don't
+	 * have it yet (since this is a general purpose routine). We take the
+	 * lock to make the list traverse safe, but have to drop the lock to
+	 * complete any AER. We restart the list scan when we do this to make
+	 * this safe. There's interlock with the ISR so we know this tracker
+	 * won't be completed twice.
+	 */
+	mtx_assert(&qpair->lock, MA_NOTOWNED);
+
+	mtx_lock(&qpair->lock);
 	tr = TAILQ_FIRST(&qpair->outstanding_tr);
 	while (tr != NULL) {
 		if (tr->req->cmd.opc == NVME_OPC_ASYNC_EVENT_REQUEST) {
+			mtx_unlock(&qpair->lock);
 			nvme_qpair_manual_complete_tracker(tr,
 			    NVME_SCT_GENERIC, NVME_SC_ABORTED_SQ_DELETION, 0,
 			    ERROR_PRINT_NONE);
+			mtx_lock(&qpair->lock);
 			tr = TAILQ_FIRST(&qpair->outstanding_tr);
 		} else {
 			tr = TAILQ_NEXT(tr, tailq);
 		}
 	}
+	mtx_unlock(&qpair->lock);
 }
 
 void
 nvme_admin_qpair_destroy(struct nvme_qpair *qpair)
 {
+	mtx_assert(&qpair->lock, MA_NOTOWNED);
 
 	nvme_admin_qpair_abort_aers(qpair);
 	nvme_qpair_destroy(qpair);
@@ -1011,17 +1027,6 @@ nvme_qpair_timeout(void *arg)
 
 	mtx_assert(&qpair->recovery, MA_OWNED);
 
-	/*
-	 * If the controller has failed, give up. We're never going to change
-	 * state from a failed controller: no further transactions are possible.
-	 * We go ahead and let the timeout expire in many cases for simplicity.
-	 */
-	if (qpair->ctrlr->is_failed) {
-		nvme_printf(ctrlr, "Controller failed, giving up\n");
-		qpair->timer_armed = false;
-		return;
-	}
-
 	switch (qpair->recovery_state) {
 	case RECOVERY_NONE:
 		/*
@@ -1108,6 +1113,11 @@ nvme_qpair_timeout(void *arg)
 		nvme_printf(ctrlr, "Waiting for reset to complete\n");
 		idle = false;		/* We want to keep polling */
 		break;
+	case RECOVERY_FAILED:
+		KASSERT(qpair->ctrlr->is_failed,
+		    ("Recovery state failed w/o failed controller\n"));
+		idle = true;			/* nothing to monitor */
+		break;
 	}
 
 	/*
@@ -1297,6 +1307,8 @@ nvme_qpair_enable(struct nvme_qpair *qpair)
 		mtx_assert(&qpair->recovery, MA_OWNED);
 	if (mtx_initialized(&qpair->lock))
 		mtx_assert(&qpair->lock, MA_OWNED);
+	KASSERT(qpair->recovery_state != RECOVERY_FAILED,
+	    ("Enabling a failed qpair\n"));
 
 	qpair->recovery_state = RECOVERY_NONE;
 }
@@ -1421,12 +1433,13 @@ void
 nvme_admin_qpair_disable(struct nvme_qpair *qpair)
 {
 	mtx_lock(&qpair->recovery);
-	mtx_lock(&qpair->lock);
 
+	mtx_lock(&qpair->lock);
 	nvme_qpair_disable(qpair);
+	mtx_unlock(&qpair->lock);
+
 	nvme_admin_qpair_abort_aers(qpair);
 
-	mtx_unlock(&qpair->lock);
 	mtx_unlock(&qpair->recovery);
 }
 
@@ -1451,18 +1464,27 @@ nvme_qpair_fail(struct nvme_qpair *qpair)
 	if (!mtx_initialized(&qpair->lock))
 		return;
 
+	mtx_lock(&qpair->recovery);
+	qpair->recovery_state = RECOVERY_FAILED;
+	mtx_unlock(&qpair->recovery);
+
 	mtx_lock(&qpair->lock);
 
+	if (!STAILQ_EMPTY(&qpair->queued_req)) {
+		nvme_printf(qpair->ctrlr, "failing queued i/o\n");
+	}
 	while (!STAILQ_EMPTY(&qpair->queued_req)) {
 		req = STAILQ_FIRST(&qpair->queued_req);
 		STAILQ_REMOVE_HEAD(&qpair->queued_req, stailq);
-		nvme_printf(qpair->ctrlr, "failing queued i/o\n");
 		mtx_unlock(&qpair->lock);
 		nvme_qpair_manual_complete_request(qpair, req, NVME_SCT_GENERIC,
 		    NVME_SC_ABORTED_BY_REQUEST);
 		mtx_lock(&qpair->lock);
 	}
 
+	if (!TAILQ_EMPTY(&qpair->outstanding_tr)) {
+		nvme_printf(qpair->ctrlr, "failing outstanding i/o\n");
+	}
 	/* Manually abort each outstanding I/O. */
 	while (!TAILQ_EMPTY(&qpair->outstanding_tr)) {
 		tr = TAILQ_FIRST(&qpair->outstanding_tr);
@@ -1470,7 +1492,6 @@ nvme_qpair_fail(struct nvme_qpair *qpair)
 		 * Do not remove the tracker.  The abort_tracker path will
 		 *  do that for us.
 		 */
-		nvme_printf(qpair->ctrlr, "failing outstanding i/o\n");
 		mtx_unlock(&qpair->lock);
 		nvme_qpair_manual_complete_tracker(tr, NVME_SCT_GENERIC,
 		    NVME_SC_ABORTED_BY_REQUEST, DO_NOT_RETRY, ERROR_PRINT_ALL);