From owner-freebsd-scsi  Tue Aug 19 09:57:12 1997
Return-Path: <owner-freebsd-scsi>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.5/8.8.5) id JAA18834
          for freebsd-scsi-outgoing; Tue, 19 Aug 1997 09:57:12 -0700 (PDT)
Received: from pluto.plutotech.com (root@mail.plutotech.com [206.168.67.137])
          by hub.freebsd.org (8.8.5/8.8.5) with ESMTP id JAA18823
          for <freebsd-scsi@FreeBSD.ORG>; Tue, 19 Aug 1997 09:57:07 -0700 (PDT)
Received: from narnia.plutotech.com (narnia.plutotech.com [206.168.67.130])
	by pluto.plutotech.com (8.8.5/8.8.5) with ESMTP id KAA24228;
	Tue, 19 Aug 1997 10:54:29 -0600 (MDT)
Message-Id: <199708191654.KAA24228@pluto.plutotech.com>
X-Mailer: exmh version 2.0zeta 7/24/97
To: Greg Lehey <grog@lemis.com>
cc: FreeBSD SCSI Mailing List <freebsd-scsi@FreeBSD.ORG>
Subject: Re: Bus resets. Grrrr. 
In-reply-to: Your message of "Tue, 19 Aug 1997 15:30:23 +0930."
             <19970819153023.02433@lemis.com> 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Date: Tue, 19 Aug 1997 10:53:54 -0600
From: "Justin T. Gibbs" <gibbs@plutotech.com>
Sender: owner-freebsd-scsi@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

>> What version of the kernel are you using
>
>Recent versions of -current.  The ones I reported it against were some
>time last week.  I've just rebuilt with a version supped this morning.

And it is still reproducible?

>> The message simply indicates the state of the SCSI bus at the time the
>> timeout occurred...
>
>Aha (oops, ahc).  So the state isn't usually very relevant to the
>problem?

It depends on the problem.  If the sequencer code has a protocol bug, it
is usually indicated by a hang or timeout in a particular bus state.

>> So, what does a "timeout while idle" tell us?  Well, it means that either
>> the timeout that the type driver (in this case the "st" driver) 
>
>In fact, this was the sd driver, specifically sd0.  It always seems to
>be sd0, although I have 3 disks connected to the bus, which tends to
>confirm the theory that there is something wrong with the physical
>bus.  It could also, of course, indicate that the disk is dying.

Hmmm. How many devices are active at the time that the timeout occurs?
Since you are not using tagged queuing, you would need 5 devices active
at a time to overflow the QOUTFIFO (the bug that I fixed recently) on an
aic7860 based controller.

>> specified was too short, or the aic7xxx driver lost the command
>> somewhere either in route to or from the device.  The latter problem
>> did occur under heavy load prior to my latest "spin lock" change to
>> the driver.
>
>When was that?  Would it also have the effect that the abort message
>wouldn't be taken?

The abort probably was taken, but the tape drive took a long time to
release the bus, which was why the bus was reset.  I put my fix in
the kernel on 8/13 in rev 1.121 or aic7xxx.c.

>> The first problem seems really common in the st driver especially
>> when older media or a rewind operation is involved.  You can try
>> bumping up the timeouts in sys/scsi/st.c to see if this solves your
>> problem.
>
>As I said, this wasn't a tape device timeout.  In any case, this
>always seems to happen when the tape is writing, which makes it look
>more like the heavy load scenario.

Could it be that you don't have disconnections enabled for your tape drive?
You should check both SCSI-Select for the 2940 and any relevant jumpers
on the tape drive itself.  If disconnections are disabled, a tape write that
required multiple retries could easily tie up the SCSI bus for the 10s
needed to make a disk command time out.

>> What it means is that the tape drive accepted the connection from
>> the controller, most likely accepted the ABORT message, but took
>> longer than the driver allowed for it to process the abort request,
>> free the bus, and thus signal that the abort was successful.  So,
>> we take out the hammer and reset the bus.  The timeout in the
>> aic7xxx driver for abort requests may be too short.
>
>Would this still be the case for disks?
>

The analysis is the same regardless of the device type.

>> As you can tell by the sense code that is returned, your tape drive
>> draws no distinction between "Power on", "reset" (i.e. bus reset),
>> or "bus device reset" and is probably returning "Target Busy" because
>> it is going through self test.  Any information regarding tape position
>> is almost certainly lost as is probably the case for the compression/density
>> settings.  The "st" driver should be able to restore the drive to
>> the previous condition though since it knows all of the information
>> to do so.  This is a bug.
>
>Are you sure?  I thought so at first too, but then it occurred to me
>that after the reset, the tape drive would probably not have enough
>information to continue.  For example, it would probably have cleared
>its buffer memory (this is a DDS-2 drive).

The device should flush any cached information for previous commands
completed with "good status" to the media in the process of handling
the reset.  Any device that doesn't do this is broken.

>In this connection, it's interesting to  report how I tried to recover
>from  the  problem.   I'm writing   several  files to  a non-rewinding
>device, and lately they've been  dying in the same  file.  I check the
>return status from tar, and if it's  non-0, do a  bsf 1, an fsf 1, and
>restart the tar.  The first bsf 1 always fails, apparently because the
>drive doesn't know where it is.  The second bsf 1 succeeds.

The first one probably fails because the device isn't ready.  What error
is reported on the console?

>Greg

--
Justin T. Gibbs
===========================================
  FreeBSD: Turning PCs into workstations
===========================================