From owner-freebsd-audit Sat Feb 10 22: 2:35 2001 Delivered-To: freebsd-audit@freebsd.org Received: from feral.com (feral.com [192.67.166.1]) by hub.freebsd.org (Postfix) with ESMTP id 5647C37B401 for ; Sat, 10 Feb 2001 22:02:17 -0800 (PST) Received: from beppo (beppo [192.67.166.79]) by feral.com (8.9.3/8.9.3) with ESMTP id WAA29941; Sat, 10 Feb 2001 22:02:17 -0800 Date: Sat, 10 Feb 2001 22:02:13 -0800 (PST) From: Matthew Jacob Reply-To: mjacob@feral.com To: "Justin T. Gibbs" Cc: audit@freebsd.org, "Kenneth D. Merry" , Gerard Roudier Subject: Re: a couple of minor but important changes to SCSI error handling In-Reply-To: <200102110523.f1B5NbO10383@aslan.scsiguy.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-audit@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Sat, 10 Feb 2001, Justin T. Gibbs wrote: > > > >First is scsi_all.c: > > This looks fine. I also verified that the new error recovery code that > Ken is reviewing right now also gets this right. Good! > > >Second is scsi_da.c: > > ... > > >10 retries with a .5 second delay between each is still only 5 seconds. 10 > >retries might be more appropriate to a SAN environment with at least a couple > >of seconds of different initiators spasming the loop. > > Depending on the error, I don't know that we would necessarily delay or not > here. If an initiator is spamming the loop, what does the peripheral driver > see? A command timeout? Something reported as a "selection timeout"? If > you can be more specific, perhaps we can make the da error handler smarter > so that certain types of errors get additional retries (similar perhaps to > how we do a series of TURs for some errors in cam_periph_error()). Well, the default action for selection timeout is to delay .5 seconds. That's what this affects. There's a bit of uncertainty when a device leaves the loop (or the fabric) as to really whether it's left for good or just temporarily. I'd like to give a device we'd seen before a bit more grace before we give up on it. When I did the Solaris SCSA stuff, I did 30 retries, but I didn't give it enough grace time- if it's device with mounted filesystems, you should give somebody a chance to see the message spewing out and enough time for them to go back and plug the cable back in that they unplugged. So, really, 5 seconds isn't enough..... this may be more in the new error recovery zone. Note that this affects the read/write code only- not the probe or sync cache or read capacity or 'other' code. -matt To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-audit" in the body of the message