From owner-freebsd-hackers Tue Jul 18 18:20:14 1995 Return-Path: hackers-owner Received: (from majordom@localhost) by freefall.cdrom.com (8.6.11/8.6.6) id SAA26653 for hackers-outgoing; Tue, 18 Jul 1995 18:20:14 -0700 Received: from hda.com (hda.com [199.232.40.182]) by freefall.cdrom.com (8.6.11/8.6.6) with ESMTP id SAA26644 ; Tue, 18 Jul 1995 18:20:11 -0700 Received: (dufault@localhost) by hda.com (8.6.11/8.3) id VAA00365; Tue, 18 Jul 1995 21:21:10 -0400 From: Peter Dufault Message-Id: <199507190121.VAA00365@hda.com> Subject: Re: SCSI drivers To: root@freefall.cdrom.com (& freefall.cdrom.com) Date: Tue, 18 Jul 1995 21:21:09 -0400 (EDT) Cc: julian@ref.tfs.com, stratlif@grail.cba.csuohio.edu, freebsd-hackers@FreeBSD.org In-Reply-To: <199507182014.NAA11391@freefall.cdrom.com> from "& freefall.cdrom.com" at Jul 18, 95 01:14:49 pm X-Mailer: ELM [version 2.4 PL24] Content-Type: text Content-Length: 2448 Sender: hackers-owner@FreeBSD.org Precedence: bulk & freefall.cdrom.com writes: ... (I guess that is Justin's new name; I know he went to a wedding but I didn't think it was his) I started to prepare a response to this, and seeing the that there is some interest I'll fire away. In particular, anybody thinking about scsi code should think of how we can properly layer it so that the policy is dictated from above, and where that breaks down. Here it is: > > o Error Recovery > > This driver implements extensive error recovery procedures. When the > higher level parts of the SCSI subsystem request that a command be reset, > a bus device reset is first sent to the target device. If two bus device > resets have been attempted and no command to the device has completed > successfully, then a host adapter hard reset and SCSI bus reset is > performed. SCSI bus resets caused by other devices and detected by the > host adapter are also handled by issuing a hard reset to the host adapter > and full reinitialization. This strategy should improve overall system > robustness by preventing individual errant devices from causing the > system as a whole to lockup or crash, and thereby allowing a clean > shutdown and restart after the offending component is removed. I have one overall comment which is that policy should be handled at the common level and not the lower ones. If the policy on device hangups is "after two bus device resets reset the device, then reset the board, and then reset the bus" then it should be driven by calls down from the common code and not decided upon in a single low level driver. Another observation is that you'll have outstanding work going on on the bus. You'll have to resubmit these transactions after resetting the SCSI bus. Some transactions will not make sense to resubmit, such as an aborted tape write. Of course this could be a proof of concept that will then be implemented in a more uniform fashion, and the author may have addressed these issues. I sent out a summary of an error strategy a little while ago, and the consensus was that it was 2.2 material because of the changes involved. It included suspending the activity on the scsi bus to let as much I/O as possible drain before resetting the bus and trying to pick things up again. -- Peter Dufault Real Time Machine Control and Simulation HD Associates, Inc. Voice: 508 433 6936 dufault@hda.com Fax: 508 433 5267