From owner-freebsd-hackers  Tue Jul 18 18:20:14 1995
Return-Path: hackers-owner
Received: (from majordom@localhost)
          by freefall.cdrom.com (8.6.11/8.6.6) id SAA26653
          for hackers-outgoing; Tue, 18 Jul 1995 18:20:14 -0700
Received: from hda.com (hda.com [199.232.40.182])
          by freefall.cdrom.com (8.6.11/8.6.6) with ESMTP id SAA26644
          ; Tue, 18 Jul 1995 18:20:11 -0700
Received: (dufault@localhost) by hda.com (8.6.11/8.3) id VAA00365; Tue, 18 Jul 1995 21:21:10 -0400
From: Peter Dufault <dufault@hda.com>
Message-Id: <199507190121.VAA00365@hda.com>
Subject: Re: SCSI drivers
To: root@freefall.cdrom.com (& freefall.cdrom.com)
Date: Tue, 18 Jul 1995 21:21:09 -0400 (EDT)
Cc: julian@ref.tfs.com, stratlif@grail.cba.csuohio.edu,
        freebsd-hackers@FreeBSD.org
In-Reply-To: <199507182014.NAA11391@freefall.cdrom.com> from "& freefall.cdrom.com" at Jul 18, 95 01:14:49 pm
X-Mailer: ELM [version 2.4 PL24]
Content-Type: text
Content-Length: 2448      
Sender: hackers-owner@FreeBSD.org
Precedence: bulk

& freefall.cdrom.com writes:
...

(I guess that is Justin's new name; I know he went to a wedding
but I didn't think it was his)

I started to prepare a response to this, and seeing the that
there is some interest I'll fire away.  In particular, anybody thinking
about scsi code should think of how we can properly layer it so that
the policy is dictated from above, and where that breaks down.

Here it is:
> 
> o Error Recovery
> 
>   This driver implements extensive error recovery procedures.  When the
>   higher level parts of the SCSI subsystem request that a command be reset,
>   a bus device reset is first sent to the target device.  If two bus device
>   resets have been attempted and no command to the device has completed
>   successfully, then a host adapter hard reset and SCSI bus reset is
>   performed.  SCSI bus resets caused by other devices and detected by the
>   host adapter are also handled by issuing a hard reset to the host adapter
>   and full reinitialization.  This strategy should improve overall system
>   robustness by preventing individual errant devices from causing the
>   system as a whole to lockup or crash, and thereby allowing a clean
>   shutdown and restart after the offending component is removed.

I have one overall comment which is that policy should be handled
at the common level and not the lower ones.  If the policy on device
hangups is "after two bus device resets reset the device, then reset
the board, and then reset the bus" then it should be driven by calls down from
the common code and not decided upon in a single low level driver.

Another observation is that you'll have outstanding work going on
on the bus.  You'll have to resubmit these transactions after
resetting the SCSI bus.  Some transactions will not make sense to
resubmit, such as an aborted tape write.

Of course this could be a proof of concept that will then be implemented
in a more uniform fashion, and the author may have addressed these
issues.

I sent out a summary of an error
strategy a little while ago, and the consensus was that it was 2.2 material
because of the changes involved.  It included suspending the activity
on the scsi bus to let as much I/O as possible drain before resetting
the bus and trying to pick things up again.

-- 
Peter Dufault               Real Time Machine Control and Simulation
HD Associates, Inc.         Voice: 508 433 6936
dufault@hda.com             Fax:   508 433 5267