Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 8 Feb 2025 21:43:11 GMT
From:      Warner Losh <imp@FreeBSD.org>
To:        src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org
Subject:   git: f8de2be7d920 - main - cam/da: Call cam_periph_invalidate on ENXIO in dadone
Message-ID:  <202502082143.518LhBoU097572@gitrepo.freebsd.org>

next in thread | raw e-mail | index | archive | help
The branch main has been updated by imp:

URL: https://cgit.FreeBSD.org/src/commit/?id=f8de2be7d920d4e8d9a60804819282dc89f4881a

commit f8de2be7d920d4e8d9a60804819282dc89f4881a
Author:     Warner Losh <imp@FreeBSD.org>
AuthorDate: 2025-02-08 21:31:14 +0000
Commit:     Warner Losh <imp@FreeBSD.org>
CommitDate: 2025-02-08 21:31:14 +0000

    cam/da: Call cam_periph_invalidate on ENXIO in dadone
    
    Use cam_periph_invalidate() instead of just setting the PACK_INVALID
    flag in the da softc. It's a more appropriate and bigger hammer for this
    case. PACK_INVALID is set as part of that, so remove the now-redundant
    setting. This also has the side effect of short-circuiting errors for
    other I/O still in the drive which is just about to fail (sometimes with
    different error codes than what triggered this ENXIO).
    
    The prior practice of just setting the PACK_INVALID flag, however, was
    too ephemeral to be effective.. Since daopen would clear PACK_INVALID
    after a successful open, we'd have to rediscover the error (which takes
    tens of seconds) for every different geom tasting the drive. These two
    factors lead to a watchdog before we could get through all the devices
    if we had multiple failed drives with this syndrome. By invalidating the
    periph, we fail fast enough to reboot enough to start petting the
    watchdog. If we disable the watchdog, the tasting eventually completes,
    but takes over an hour which is too long. As it is, it takes an extra
    minute per failed drive, which is tolerable.
    
    When the PACK_INVALID flag is already set, just flush remaining I/Os
    with ENXIO. This bit will be set either when we've called
    cam_periph_invalidate() before (so we've just waiting for the I/Os to
    complete) or more typically when we've seen an ASC 0x3a, which is the
    catch all for 'drive is otherwise OK, we're just missing the media to
    get data from'. In the latter case, we do not want to invalidate the
    periph since we allow recovery from this with a trip through daopen().
    
    While cam_periph_error's asc/ascq tables have a SSQ_LOST flag for
    failing the entire drive, I've opted not to use that. That flag will
    also causes all attached drivers, like pass, to detach, which is
    undesireable. By not adding that flag, but just invalidating the da
    periph driver, we prevent I/Os, but still allow collection of logs from
    the device.
    
    We can also simplify the logic w/o bloating the change, so do that too.
    
    Finally, this has been tested on all the removeable/non-removeable disks
    I could find, cd players, combo cd/da memory sticks, etc. I've removed
    the media while doing I/O on several of them. With these changes, we
    handle things corretly in all the cases I tested (except partially
    inserted media, which fails chaotically the same as before). The numbre
    of devices out there is, however, huge.
    
    mav@ raised concerns about what happens when we have asc/ascq 28/0.  I
    see that on boot for one of my cards (that's not autoquirked) and as
    preditected in the review, we retry that transaction and we get proper
    behavior. To be fair, though, I only ever saw it at startup where it was
    a transient. I couldn't get some of my energy saving disks to ever throw
    that ASC/ASCQ, even after they spun down, so I've not tested that case.
    
    Sponsored by:           Netflix
    Discussed with:         mav@
    Differential Revision:  https://reviews.freebsd.org/D48689
---
 sys/cam/scsi/scsi_da.c | 59 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 37 insertions(+), 22 deletions(-)

diff --git a/sys/cam/scsi/scsi_da.c b/sys/cam/scsi/scsi_da.c
index 44dc21d1bc2f..1fd6d4919c61 100644
--- a/sys/cam/scsi/scsi_da.c
+++ b/sys/cam/scsi/scsi_da.c
@@ -1805,7 +1805,10 @@ daopen(struct disk *dp)
 
 	/*
 	 * Only 'validate' the pack if the media size is non-zero and the
-	 * underlying peripheral isn't invalid (the only error != 0 path).
+	 * underlying peripheral isn't invalid (the only error != 0 path).  Once
+	 * the periph is marked invalid, we only get here on lost races with its
+	 * teardown, so keeping the pack invalid also keeps more I/O from
+	 * starting.
 	 */
 	if (error == 0 && softc->params.sectors != 0)
 		softc->flags &= ~DA_FLAG_PACK_INVALID;
@@ -4609,33 +4612,45 @@ dadone(struct cam_periph *periph, union ccb *done_ccb)
 		 */
 		bp = (struct bio *)done_ccb->ccb_h.ccb_bp;
 		if (error != 0) {
-			int queued_error;
+			bool pack_invalid =
+			    (softc->flags & DA_FLAG_PACK_INVALID) != 0;
 
-			/*
-			 * return all queued I/O with EIO, so that
-			 * the client can retry these I/Os in the
-			 * proper order should it attempt to recover.
-			 */
-			queued_error = EIO;
-
-			if (error == ENXIO
-			 && (softc->flags & DA_FLAG_PACK_INVALID)== 0) {
+			if (error == ENXIO && !pack_invalid) {
 				/*
-				 * Catastrophic error.  Mark our pack as
-				 * invalid.
+				 * ENXIO flags ASC/ASCQ codes for either media
+				 * missing, or the drive being extremely
+				 * unhealthy.  Invalidate peripheral on this
+				 * catestrophic error when the pack is valid
+				 * since we set the pack invalid bit only for
+				 * the few ASC/ASCQ codes indicating missing
+				 * media.  The invalidation will flush any
+				 * queued I/O and short-circuit retries for
+				 * other I/O. We only invalidate the da device
+				 * so the passX device remains for recovery and
+				 * diagnostics.
 				 *
-				 * XXX See if this is really a media
-				 * XXX change first?
+				 * While we do also set the pack invalid bit
+				 * after invalidating the peripheral, the
+				 * pending I/O will have been flushed then with
+				 * no new I/O starting, so this 'edge' case
+				 * doesn't matter.
 				 */
 				xpt_print(periph->path, "Invalidating pack\n");
-				softc->flags |= DA_FLAG_PACK_INVALID;
-#ifdef CAM_IO_STATS
-				softc->invalidations++;
-#endif
-				queued_error = ENXIO;
+				cam_periph_invalidate(periph);
+			} else {
+				/*
+				 * Return all queued I/O with EIO, so that the
+				 * client can retry these I/Os in the proper
+				 * order should it attempt to recover. When the
+				 * pack is invalid, fail all I/O with ENXIO
+				 * since we can't assume when the media returns
+				 * it's the same media and we force a trip
+				 * through daclose / daopen and the client won't
+				 * retry.
+				 */
+				cam_iosched_flush(softc->cam_iosched, NULL,
+				    pack_invalid ? ENXIO : EIO);
 			}
-			cam_iosched_flush(softc->cam_iosched, NULL,
-			   queued_error);
 			if (bp != NULL) {
 				bp->bio_error = error;
 				bp->bio_resid = bp->bio_bcount;



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?202502082143.518LhBoU097572>