Date: Sat, 8 Feb 2025 21:43:11 GMT From: Warner Losh <imp@FreeBSD.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org Subject: git: f8de2be7d920 - main - cam/da: Call cam_periph_invalidate on ENXIO in dadone Message-ID: <202502082143.518LhBoU097572@gitrepo.freebsd.org>
next in thread | raw e-mail | index | archive | help
The branch main has been updated by imp: URL: https://cgit.FreeBSD.org/src/commit/?id=f8de2be7d920d4e8d9a60804819282dc89f4881a commit f8de2be7d920d4e8d9a60804819282dc89f4881a Author: Warner Losh <imp@FreeBSD.org> AuthorDate: 2025-02-08 21:31:14 +0000 Commit: Warner Losh <imp@FreeBSD.org> CommitDate: 2025-02-08 21:31:14 +0000 cam/da: Call cam_periph_invalidate on ENXIO in dadone Use cam_periph_invalidate() instead of just setting the PACK_INVALID flag in the da softc. It's a more appropriate and bigger hammer for this case. PACK_INVALID is set as part of that, so remove the now-redundant setting. This also has the side effect of short-circuiting errors for other I/O still in the drive which is just about to fail (sometimes with different error codes than what triggered this ENXIO). The prior practice of just setting the PACK_INVALID flag, however, was too ephemeral to be effective.. Since daopen would clear PACK_INVALID after a successful open, we'd have to rediscover the error (which takes tens of seconds) for every different geom tasting the drive. These two factors lead to a watchdog before we could get through all the devices if we had multiple failed drives with this syndrome. By invalidating the periph, we fail fast enough to reboot enough to start petting the watchdog. If we disable the watchdog, the tasting eventually completes, but takes over an hour which is too long. As it is, it takes an extra minute per failed drive, which is tolerable. When the PACK_INVALID flag is already set, just flush remaining I/Os with ENXIO. This bit will be set either when we've called cam_periph_invalidate() before (so we've just waiting for the I/Os to complete) or more typically when we've seen an ASC 0x3a, which is the catch all for 'drive is otherwise OK, we're just missing the media to get data from'. In the latter case, we do not want to invalidate the periph since we allow recovery from this with a trip through daopen(). While cam_periph_error's asc/ascq tables have a SSQ_LOST flag for failing the entire drive, I've opted not to use that. That flag will also causes all attached drivers, like pass, to detach, which is undesireable. By not adding that flag, but just invalidating the da periph driver, we prevent I/Os, but still allow collection of logs from the device. We can also simplify the logic w/o bloating the change, so do that too. Finally, this has been tested on all the removeable/non-removeable disks I could find, cd players, combo cd/da memory sticks, etc. I've removed the media while doing I/O on several of them. With these changes, we handle things corretly in all the cases I tested (except partially inserted media, which fails chaotically the same as before). The numbre of devices out there is, however, huge. mav@ raised concerns about what happens when we have asc/ascq 28/0. I see that on boot for one of my cards (that's not autoquirked) and as preditected in the review, we retry that transaction and we get proper behavior. To be fair, though, I only ever saw it at startup where it was a transient. I couldn't get some of my energy saving disks to ever throw that ASC/ASCQ, even after they spun down, so I've not tested that case. Sponsored by: Netflix Discussed with: mav@ Differential Revision: https://reviews.freebsd.org/D48689 --- sys/cam/scsi/scsi_da.c | 59 +++++++++++++++++++++++++++++++------------------- 1 file changed, 37 insertions(+), 22 deletions(-) diff --git a/sys/cam/scsi/scsi_da.c b/sys/cam/scsi/scsi_da.c index 44dc21d1bc2f..1fd6d4919c61 100644 --- a/sys/cam/scsi/scsi_da.c +++ b/sys/cam/scsi/scsi_da.c @@ -1805,7 +1805,10 @@ daopen(struct disk *dp) /* * Only 'validate' the pack if the media size is non-zero and the - * underlying peripheral isn't invalid (the only error != 0 path). + * underlying peripheral isn't invalid (the only error != 0 path). Once + * the periph is marked invalid, we only get here on lost races with its + * teardown, so keeping the pack invalid also keeps more I/O from + * starting. */ if (error == 0 && softc->params.sectors != 0) softc->flags &= ~DA_FLAG_PACK_INVALID; @@ -4609,33 +4612,45 @@ dadone(struct cam_periph *periph, union ccb *done_ccb) */ bp = (struct bio *)done_ccb->ccb_h.ccb_bp; if (error != 0) { - int queued_error; + bool pack_invalid = + (softc->flags & DA_FLAG_PACK_INVALID) != 0; - /* - * return all queued I/O with EIO, so that - * the client can retry these I/Os in the - * proper order should it attempt to recover. - */ - queued_error = EIO; - - if (error == ENXIO - && (softc->flags & DA_FLAG_PACK_INVALID)== 0) { + if (error == ENXIO && !pack_invalid) { /* - * Catastrophic error. Mark our pack as - * invalid. + * ENXIO flags ASC/ASCQ codes for either media + * missing, or the drive being extremely + * unhealthy. Invalidate peripheral on this + * catestrophic error when the pack is valid + * since we set the pack invalid bit only for + * the few ASC/ASCQ codes indicating missing + * media. The invalidation will flush any + * queued I/O and short-circuit retries for + * other I/O. We only invalidate the da device + * so the passX device remains for recovery and + * diagnostics. * - * XXX See if this is really a media - * XXX change first? + * While we do also set the pack invalid bit + * after invalidating the peripheral, the + * pending I/O will have been flushed then with + * no new I/O starting, so this 'edge' case + * doesn't matter. */ xpt_print(periph->path, "Invalidating pack\n"); - softc->flags |= DA_FLAG_PACK_INVALID; -#ifdef CAM_IO_STATS - softc->invalidations++; -#endif - queued_error = ENXIO; + cam_periph_invalidate(periph); + } else { + /* + * Return all queued I/O with EIO, so that the + * client can retry these I/Os in the proper + * order should it attempt to recover. When the + * pack is invalid, fail all I/O with ENXIO + * since we can't assume when the media returns + * it's the same media and we force a trip + * through daclose / daopen and the client won't + * retry. + */ + cam_iosched_flush(softc->cam_iosched, NULL, + pack_invalid ? ENXIO : EIO); } - cam_iosched_flush(softc->cam_iosched, NULL, - queued_error); if (bp != NULL) { bp->bio_error = error; bp->bio_resid = bp->bio_bcount;
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?202502082143.518LhBoU097572>