From nobody Thu Sep 28 22:28:08 2023
X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4RxSmT01LXz4vhQv;
	Thu, 28 Sep 2023 22:28:09 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4RxSmS6Rhdz3QH3;
	Thu, 28 Sep 2023 22:28:08 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim;
	t=1695940088;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=UxnU67fCnfQwkMQ3npcbGylxUoFsgTdAwNQpt9MhDKk=;
	b=jwsOJHrl83aCjei/Ce2uJqXI9cJ4bl7wjAYBuCERa3p6FdTAUHPd2VnoitEso1LqJdNBNd
	dpWtVO66hXbfg9j9UYTM6zV0bflae6qFiYRJffWOdUe/tMOzN6rst6njkswFcRM8ud+84i
	nvc6XIPSXIzbZ+J+WI07/xC/dUOObgXXCTZBKvHiaHxQFBofkKHjXoz7QOdx70fDiT5YEu
	tiKtFOI4acTIBo2GU5oc4DAyMx5T35Eh1DqSvJXEuZPBT2yEjcvkkhnjj5l5mtGao0DotY
	QgnhcbPXvAJfGG4heTQbgVSDaSE6dzK1LVxR6sFdwXgvrhftr8EAbC+WopP/Yg==
ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1695940088; a=rsa-sha256; cv=none;
	b=To0GuCuHF2wf7ES93H89XbZrP1a3/UiimhliqWjnprdcP93QMYDwQHAYE4x8CrDTof69Rf
	1yicyPRJpCDkxQfE59LIyoPqEYTu5rjnE9afU6skadU74TpqSWBTZTLFHhlPP3D0pxT3o4
	diMUsgv8DqKKmZxiB4joqDS2l1euLrh9OCPmpL/BgssLdBwIo6mQnfCHv2kRjHCTlk/eAN
	YcLZB5ajBbm5kE4ipzjalzbnRyTqyiATQCl7gfkPKTdk9xOnAWawGSlEU3L+XiEqpwvAX6
	GB2MLZN1BJLnaKEI2+jzyJuXYa0p1uRc5nzE795yW7ugPXzcHv/nj+Ra2zmBqQ==
ARC-Authentication-Results: i=1;
	mx1.freebsd.org;
	none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org;
	s=dkim; t=1695940088;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=UxnU67fCnfQwkMQ3npcbGylxUoFsgTdAwNQpt9MhDKk=;
	b=lBC5INCxuS1VWGzp7EqNuayrJkPuWs9JwX3kQ9+1VxO+zBoIq1cU9TdQPflMr8iZshmIXm
	Gx7pSKtODtTPy3UmPxbDahlwQV7fNQbykFr9dBIg0/fX8u/WMcMKhU4ezRpkBrkDO+OS4k
	4BFszoIb9SEUBsfz2Hn7dcPcZ/+50dEp8AEWFEqqUOdVZ3ZdOTdOgx8we0uc9Bb21JN00S
	0RRHPUNdYB76TOEuOnooli7sLOkdjhKCtop4OqtuipLMm+1xQ7Esod4V5wA0BiUoo9R7f5
	G3jDPrXUFWs9pjvNwDV9U6d38mnn8QK/v5M8+ozFFo4qc2iB2+EC/8vnH0NL0A==
Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4RxSmS5VqmzgF0;
	Thu, 28 Sep 2023 22:28:08 +0000 (UTC)
	(envelope-from git@FreeBSD.org)
Received: from gitrepo.freebsd.org ([127.0.1.44])
	by gitrepo.freebsd.org (8.17.1/8.17.1) with ESMTP id 38SMS8lQ073653;
	Thu, 28 Sep 2023 22:28:08 GMT
	(envelope-from git@gitrepo.freebsd.org)
Received: (from git@localhost)
	by gitrepo.freebsd.org (8.17.1/8.17.1/Submit) id 38SMS8ZZ073650;
	Thu, 28 Sep 2023 22:28:08 GMT
	(envelope-from git)
Date: Thu, 28 Sep 2023 22:28:08 GMT
Message-Id: <202309282228.38SMS8ZZ073650@gitrepo.freebsd.org>
To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org,
        dev-commits-src-branches@FreeBSD.org
From: Warner Losh <imp@FreeBSD.org>
Subject: git: 1a25315863e5 - releng/14.0 - nvme: Greatly improve
  error recovery
List-Id: Commit messages for all branches of the src repository <dev-commits-src-all.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all
List-Help: <mailto:dev-commits-src-all+help@freebsd.org>
List-Post: <mailto:dev-commits-src-all@freebsd.org>
List-Subscribe: <mailto:dev-commits-src-all+subscribe@freebsd.org>
List-Unsubscribe: <mailto:dev-commits-src-all+unsubscribe@freebsd.org>
Sender: owner-dev-commits-src-all@freebsd.org
X-BeenThere: dev-commits-src-all@freebsd.org
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-Git-Committer: imp
X-Git-Repository: src
X-Git-Refname: refs/heads/releng/14.0
X-Git-Reftype: branch
X-Git-Commit: 1a25315863e5faffd447cd59fb9e54f5d2b5f6ae
Auto-Submitted: auto-generated

The branch releng/14.0 has been updated by imp:

URL: https://cgit.FreeBSD.org/src/commit/?id=1a25315863e5faffd447cd59fb9e54f5d2b5f6ae

commit 1a25315863e5faffd447cd59fb9e54f5d2b5f6ae
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2023-09-28 20:45:50 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2023-09-28 22:26:04 +0000

    nvme: Greatly improve error recovery
    
    Next phase of error recovery: Eliminate the REOVERY_START phase, since
    we don't need to wait to start recovery. Eliminate the RECOVERY_RESET
    phase since it is transient, we now transition from RECOVERY_NORMAL into
    RECOVERY_WAITING.
    
    In normal mode, read the status of the controller. If it is in failed
    state, or appears to be hot-plugged, jump directly to reset which will
    sort out the proper things to do. This will cause all pending I/O to
    complete with an abort status before the reset.
    
    When in the NORMAL state, call the interrupt handler. This will complete
    all pending transactions when interrupts are broken or temporarily
    misbehaving. We then check all the pending completions for timeouts. If
    we have abort enabled, then we'll send an abort. Otherwise we'll assume
    the controller is wedged and needs a reset. By calling the interrupt
    handler here, we'll avoid an issue with the current code where we
    transitioned to RECOVERY_START which prevented any completions from
    happening. Now completions happen. In addition and follow-on I/O that is
    scheduled in the completion routines will be submitted, rather than
    queued, because the recovery state is correct. This also fixes a problem
    where I/O would timeout, but never complete, leading to hung I/O.
    
    Resetting remains the same as before, just when we chose to reset has
    changed.
    
    A nice side effect of these changes is that we now do I/O when
    interrupts to the card are totally broken. Followon commits will improve
    the error reporting and logging when this happens. Performance will be
    aweful, but will at least be minimally functional.
    
    There is a small race when we're checking the completions if interrupts
    are working, but this is handled in a future commit.
    
    Sponsored by:           Netflix
    MFC After:              2 weeks
    Differential Revision:  https://reviews.freebsd.org/D36922
    
    (cherry picked from commit d4959bfcd110ea471222c7dd87775ba1f4e3d1d9)
    (cherry picked from commit 5d627e0669c5f047580b6f71d9f042afc68307d0)
    
    Approved-by: re (cperciva)
---
 sys/dev/nvme/nvme_private.h |   2 -
 sys/dev/nvme/nvme_qpair.c   | 114 +++++++++++++++++++++++++-------------------
 2 files changed, 66 insertions(+), 50 deletions(-)

diff --git a/sys/dev/nvme/nvme_private.h b/sys/dev/nvme/nvme_private.h
index dc69012cfd71..e4b319b9d8b7 100644
--- a/sys/dev/nvme/nvme_private.h
+++ b/sys/dev/nvme/nvme_private.h
@@ -149,8 +149,6 @@ struct nvme_tracker {
 
 enum nvme_recovery {
 	RECOVERY_NONE = 0,		/* Normal operations */
-	RECOVERY_START,			/* Deadline has passed, start recovering */
-	RECOVERY_RESET,			/* This pass, initiate reset of controller */
 	RECOVERY_WAITING,		/* waiting for the reset to complete */
 };
 struct nvme_qpair {
diff --git a/sys/dev/nvme/nvme_qpair.c b/sys/dev/nvme/nvme_qpair.c
index 0ad0b7cbe17f..6d34c7ddba2d 100644
--- a/sys/dev/nvme/nvme_qpair.c
+++ b/sys/dev/nvme/nvme_qpair.c
@@ -976,88 +976,106 @@ nvme_qpair_timeout(void *arg)
 	struct nvme_tracker	*tr;
 	sbintime_t		now;
 	bool			idle;
-	bool			expired;
+	bool			needs_reset;
 	uint32_t		csts;
 	uint8_t			cfs;
 
+
 	mtx_lock(&qpair->lock);
 	idle = TAILQ_EMPTY(&qpair->outstanding_tr);
 
-again:
 	switch (qpair->recovery_state) {
 	case RECOVERY_NONE:
+		/*
+		 * Read csts to get value of cfs - controller fatal status.  If
+		 * we are in the hot-plug or controller failed status proceed
+		 * directly to reset. We also bail early if the status reads all
+		 * 1's or the control fatal status bit is now 1. The latter is
+		 * always true when the former is true, but not vice versa.  The
+		 * intent of the code is that if the card is gone (all 1's) or
+		 * we've failed, then try to do a reset (which someitmes
+		 * unwedges a card reading all 1's that's not gone away, but
+		 * usually doesn't).
+		 */
+		csts = nvme_mmio_read_4(ctrlr, csts);
+		cfs = (csts >> NVME_CSTS_REG_CFS_SHIFT) & NVME_CSTS_REG_CFS_MASK;
+		if (csts == NVME_GONE || cfs == 1)
+			goto do_reset;
+
+		/*
+		 * Next, check to see if we have any completions. If we do,
+		 * we've likely missed an interrupt, but the card is otherwise
+		 * fine. This will also catch all the commands that are about
+		 * to timeout (but there's still a tiny race). Since the timeout
+		 * is long relative to the race between here and the check below,
+		 * this is still a win.
+		 */
+		mtx_unlock(&qpair->lock);
+		nvme_qpair_process_completions(qpair);
+		mtx_lock(&qpair->lock);
+		if (qpair->recovery_state != RECOVERY_NONE) {
+			/*
+			 * Somebody else adjusted recovery state while unlocked,
+			 * we should bail. Unlock the qpair and return without
+			 * doing anything else.
+			 */
+			mtx_unlock(&qpair->lock);
+			return;
+		}
+
 		/*
 		 * Check to see if we need to timeout any commands. If we do, then
 		 * we also enter a recovery phase.
 		 */
 		now = getsbinuptime();
-		expired = false;
+		needs_reset = false;
 		TAILQ_FOREACH(tr, &qpair->outstanding_tr, tailq) {
 			if (tr->deadline == SBT_MAX)
 				continue;
-			idle = false;
 			if (now > tr->deadline) {
-				expired = true;
-				nvme_ctrlr_cmd_abort(ctrlr, tr->cid, qpair->id,
-				    nvme_abort_complete, tr);
+				if (tr->req->cb_fn != nvme_abort_complete &&
+				    ctrlr->enable_aborts) {
+					/*
+					 * This isn't an abort command, ask
+					 * for a hardware abort.
+					 */
+					nvme_ctrlr_cmd_abort(ctrlr, tr->cid,
+					    qpair->id, nvme_abort_complete, tr);
+				} else {
+					/*
+					 * Otherwise we have a live command in
+					 * the card (either one we couldn't
+					 * abort, or aborts weren't enabled).
+					 * The only safe way to proceed is to do
+					 * a reset.
+					 */
+					needs_reset = true;
+				}
+			} else {
+				idle = false;
 			}
 		}
-		if (!expired)
+		if (!needs_reset)
 			break;
 
 		/*
-		 * We're now passed our earliest deadline. We need to do
-		 * expensive things to cope, but next time. Flag that
-		 * and close the door to any further processing.
-		 */
-		qpair->recovery_state = RECOVERY_START;
-		nvme_printf(ctrlr, "RECOVERY_START %jd vs %jd\n",
-		    (uintmax_t)now, (uintmax_t)tr->deadline);
-		/* FALLTHROUGH */
-	case RECOVERY_START:
-		/*
-		 * Read csts to get value of cfs - controller fatal status.
-		 * If no fatal status, try to call the completion routine, and
-		 * if completes transactions, report a missed interrupt and
-		 * return (this may need to be rate limited). Otherwise, if
-		 * aborts are enabled and the controller is not reporting
-		 * fatal status, abort the command. Otherwise, just reset the
-		 * controller and hope for the best.
-		 */
-		csts = nvme_mmio_read_4(ctrlr, csts);
-		cfs = (csts >> NVME_CSTS_REG_CFS_SHIFT) & NVME_CSTS_REG_CFS_MASK;
-		if (cfs) {
-			nvme_printf(ctrlr, "Controller in fatal status, resetting\n");
-			qpair->recovery_state = RECOVERY_RESET;
-			goto again;
-		}
-		mtx_unlock(&qpair->lock);
-		if (nvme_qpair_process_completions(qpair)) {
-			nvme_printf(ctrlr, "Completions present in output without an interrupt\n");
-			qpair->recovery_state = RECOVERY_NONE;
-		} else {
-			nvme_printf(ctrlr, "timeout with nothing complete, resetting\n");
-			qpair->recovery_state = RECOVERY_RESET;
-			mtx_lock(&qpair->lock);
-			goto again;
-		}
-		mtx_lock(&qpair->lock);
-		break;
-	case RECOVERY_RESET:
-		/*
+		 * We've had a command timeout that we weren't able to abort
+		 *
 		 * If we get here due to a possible surprise hot-unplug event,
 		 * then we let nvme_ctrlr_reset confirm and fail the
 		 * controller.
 		 */
+	do_reset:
 		nvme_printf(ctrlr, "Resetting controller due to a timeout%s.\n",
 		    (csts == 0xffffffff) ? " and possible hot unplug" :
 		    (cfs ? " and fatal error status" : ""));
 		nvme_printf(ctrlr, "RECOVERY_WAITING\n");
 		qpair->recovery_state = RECOVERY_WAITING;
 		nvme_ctrlr_reset(ctrlr);
+		idle = false;			/* We want to keep polling */
 		break;
 	case RECOVERY_WAITING:
-		nvme_printf(ctrlr, "waiting\n");
+		nvme_printf(ctrlr, "waiting for reset to complete\n");
 		break;
 	}