Date: Sat, 10 Feb 2001 22:02:13 -0800 (PST) From: Matthew Jacob <mjacob@feral.com> To: "Justin T. Gibbs" <gibbs@scsiguy.com> Cc: audit@freebsd.org, "Kenneth D. Merry" <ken@kdm.org>, Gerard Roudier <groudier@club-internet.fr> Subject: Re: a couple of minor but important changes to SCSI error handling Message-ID: <Pine.BSF.4.21.0102102158450.68317-100000@beppo.feral.com> In-Reply-To: <200102110523.f1B5NbO10383@aslan.scsiguy.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 10 Feb 2001, Justin T. Gibbs wrote: > > > >First is scsi_all.c: > > This looks fine. I also verified that the new error recovery code that > Ken is reviewing right now also gets this right. Good! > > >Second is scsi_da.c: > > ... > > >10 retries with a .5 second delay between each is still only 5 seconds. 10 > >retries might be more appropriate to a SAN environment with at least a couple > >of seconds of different initiators spasming the loop. > > Depending on the error, I don't know that we would necessarily delay or not > here. If an initiator is spamming the loop, what does the peripheral driver > see? A command timeout? Something reported as a "selection timeout"? If > you can be more specific, perhaps we can make the da error handler smarter > so that certain types of errors get additional retries (similar perhaps to > how we do a series of TURs for some errors in cam_periph_error()). Well, the default action for selection timeout is to delay .5 seconds. That's what this affects. There's a bit of uncertainty when a device leaves the loop (or the fabric) as to really whether it's left for good or just temporarily. I'd like to give a device we'd seen before a bit more grace before we give up on it. When I did the Solaris SCSA stuff, I did 30 retries, but I didn't give it enough grace time- if it's device with mounted filesystems, you should give somebody a chance to see the message spewing out and enough time for them to go back and plug the cable back in that they unplugged. So, really, 5 seconds isn't enough..... this may be more in the new error recovery zone. Note that this affects the read/write code only- not the probe or sync cache or read capacity or 'other' code. -matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-audit" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0102102158450.68317-100000>