From owner-freebsd-audit  Sat Feb 10 22: 2:35 2001
Delivered-To: freebsd-audit@freebsd.org
Received: from feral.com (feral.com [192.67.166.1])
	by hub.freebsd.org (Postfix) with ESMTP id 5647C37B401
	for <audit@freebsd.org>; Sat, 10 Feb 2001 22:02:17 -0800 (PST)
Received: from beppo (beppo [192.67.166.79])
	by feral.com (8.9.3/8.9.3) with ESMTP id WAA29941;
	Sat, 10 Feb 2001 22:02:17 -0800
Date: Sat, 10 Feb 2001 22:02:13 -0800 (PST)
From: Matthew Jacob <mjacob@feral.com>
Reply-To: mjacob@feral.com
To: "Justin T. Gibbs" <gibbs@scsiguy.com>
Cc: audit@freebsd.org, "Kenneth D. Merry" <ken@kdm.org>,
	Gerard Roudier <groudier@club-internet.fr>
Subject: Re: a couple of minor but important changes to SCSI error handling
In-Reply-To: <200102110523.f1B5NbO10383@aslan.scsiguy.com>
Message-ID: <Pine.BSF.4.21.0102102158450.68317-100000@beppo.feral.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-audit@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


On Sat, 10 Feb 2001, Justin T. Gibbs wrote:

> >
> >First is scsi_all.c:
> 
> This looks fine.  I also verified that the new error recovery code that
> Ken is reviewing right now also gets this right.

Good!

> 
> >Second is scsi_da.c:
> 
> ...
> 
> >10 retries with a .5 second delay between each is still only 5 seconds. 10
> >retries might be more appropriate to a SAN environment with at least a couple
> >of seconds of different initiators spasming the loop.
> 
> Depending on the error, I don't know that we would necessarily delay or not
> here.  If an initiator is spamming the loop, what does the peripheral driver
> see?  A command timeout?  Something reported as a "selection timeout"?  If
> you can be more specific, perhaps we can make the da error handler smarter
> so that certain types of errors get additional retries (similar perhaps to
> how we do a series of TURs for some errors in cam_periph_error()).

Well, the default action for selection timeout is to delay .5 seconds. That's 
what this affects.

There's a bit of uncertainty when a device leaves the loop (or the fabric) as
to really whether it's left for good or just temporarily. I'd like to give a
device we'd seen before a bit more grace before we give up on it. When I did
the Solaris SCSA stuff, I did 30 retries, but I didn't give it enough grace
time- if it's device with mounted filesystems, you should give somebody a
chance to see the message spewing out and enough time for them to go back and
plug the cable back in that they unplugged. So, really, 5 seconds isn't
enough..... this may be more in the new error recovery zone.

Note that this affects the read/write code only- not the probe or sync cache
or read capacity or 'other' code.

-matt


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-audit" in the body of the message