Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 1 May 2022 17:10:56 GMT
From:      Warner Losh <imp@FreeBSD.org>
To:        src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org
Subject:   git: 6c8ab086fed3 - main - ada: Retry commands with retries left on CAM_SEL_TIMEOUT
Message-ID:  <202205011710.241HAuaL042980@gitrepo.freebsd.org>

next in thread | raw e-mail | index | archive | help
The branch main has been updated by imp:

URL: https://cgit.FreeBSD.org/src/commit/?id=6c8ab086fed37a6b44fa84377e48c499f223ae80

commit 6c8ab086fed37a6b44fa84377e48c499f223ae80
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2022-05-01 16:39:04 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2022-05-01 17:08:56 +0000

    ada: Retry commands with retries left on CAM_SEL_TIMEOUT
    
    The AHCI and ATA SIMs will return CAM_SEL_TIMEOUT when an underlying
    device has stopped responding. This is usually seen after a timeouted
    out command and can be a transient event. Rather than fail the
    peripheral immediately after seeing this, queue a retry. For transient
    events, this allows drives to continue to provide data, though with some
    added latency, just like we do when we have some other kind of retriable
    error. If the error isn't transient (the drive is truly gone), then
    we'll discover that eventually and fail the transaction and invalidate
    the drive like we do today.
    
    This helps us avoid a panic at the end of camperiphfree when
    CAM_PERIPH_NEW_DEV_FOUND is set. However, the deferred callback should
    be queued to xpt_async_td instead of being made inline there. This issue
    will be solved in a different patch that does that. PR 263703.
    
    This also helps us avoid another bug where we can drop all references to
    the device (causing us to go through camperiphfree and destroy the path)
    while we have an I/O pending in the ata_da state machine (usually in
    state ADA_STATE_RAHEAD with ATA_SETFEATURES ATA_SF_ENAB_RCACHE
    command). It's not clear why the reference that we take out to do the
    reprobe isn't effective at blocking this. By retrying this condition,
    though we avoid this bug (at least more often, I don't have a good
    reproduction test case, I just see this panic a few times a month at
    work on systems that have transient disk errors on ahci connected SATA
    SSDs). PR 263704. It's too soon to know how much this helps us avoid
    this bug.
    
    Sponsored by:           Netflix
    Differential Revision:  https://reviews.freebsd.org/D34977
---
 sys/cam/ata/ata_da.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sys/cam/ata/ata_da.c b/sys/cam/ata/ata_da.c
index b82671315138..b76058c8f19d 100644
--- a/sys/cam/ata/ata_da.c
+++ b/sys/cam/ata/ata_da.c
@@ -2872,7 +2872,7 @@ adadone(struct cam_periph *periph, union ccb *done_ccb)
 		cam_periph_lock(periph);
 		bp = (struct bio *)done_ccb->ccb_h.ccb_bp;
 		if ((done_ccb->ccb_h.status & CAM_STATUS_MASK) != CAM_REQ_CMP) {
-			error = adaerror(done_ccb, 0, 0);
+			error = adaerror(done_ccb, CAM_RETRY_SELTO, 0);
 			if (error == ERESTART) {
 				/* A retry was scheduled, so just return. */
 				cam_periph_unlock(periph);



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?202205011710.241HAuaL042980>