From owner-freebsd-scsi  Mon Aug 18 23:01:07 1997
Return-Path: <owner-freebsd-scsi>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.5/8.8.5) id XAA13315
          for freebsd-scsi-outgoing; Mon, 18 Aug 1997 23:01:07 -0700 (PDT)
Received: from nico.telstra.net (nico.telstra.net [139.130.204.16])
          by hub.freebsd.org (8.8.5/8.8.5) with SMTP id XAA13306
          for <freebsd-scsi@FreeBSD.ORG>; Mon, 18 Aug 1997 23:01:01 -0700 (PDT)
Received: from freebie.lemis.com (gregl1.lnk.telstra.net [139.130.136.133]) by nico.telstra.net (8.6.10/8.6.10) with ESMTP id QAA05099; Tue, 19 Aug 1997 16:00:25 +1000
Received: (grog@localhost) by freebie.lemis.com (8.8.7/8.6.12) 
       id PAA00370; Tue, 19 Aug 1997 15:30:24 +0930 (CST)
Message-ID: <19970819153023.02433@lemis.com>
Date: Tue, 19 Aug 1997 15:30:23 +0930
From: Greg Lehey <grog@lemis.com>
To: "Justin T. Gibbs" <gibbs@plutotech.com>
Cc: FreeBSD SCSI Mailing List <freebsd-scsi@FreeBSD.ORG>
Subject: Re: Bus resets. Grrrr.
References: <199708170129.KAA03776@freebie.lemis.com> <199708190359.VAA10011@pluto.plutotech.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Mutt 0.81e
In-Reply-To: <199708190359.VAA10011@pluto.plutotech.com>; from Justin T. Gibbs on Mon, Aug 18, 1997 at 09:59:02PM -0600
Organisation: LEMIS, PO Box 460, Echunga SA 5153, Australia
Phone: +61-8-8388-8250
Fax: +61-8-8388-8250
Mobile: +61-41-739-7062
WWW-Home-Page: http://www.lemis.com/~grog
Sender: owner-freebsd-scsi@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

On Mon, Aug 18, 1997 at 09:59:02PM -0600, Justin T. Gibbs wrote:
>> This is the third time in a row that I haven't been able to complete a backup
>> because of "recoverable" SCSI errors.  Here's a pretty typical scenario:
>
> What version of the kernel are you using

Recent versions of -current.  The ones I reported it against were some
time last week.  I've just rebuilt with a version supped this morning.

> with what AHC options enabled in your kernel config file?

The version I was using when I got these messages had the following
relevant options:

controller      ahc0
controller      aha0    at isa? port "IO_AHA0" bio irq ? drq 5 vector ahaintr
controller      aha1    at isa? port 0x334 bio irq ? drq 7 vector ahaintr
controller      scbus0
device          sd0
device          st0
device          cd0     #Only need one of these, the code dynamically grows
options         SCSIDEBUG
options         SCSI_REPORT_GEOMETRY

The machine does, in fact, have two SCSI controllers (the 2940A and a
1542B).  At the moment I'm not having any problems on the 1542B, with
the exception of lack of bounce buffers for the CD-ROM changer
connected to it.

>> Aug 17 10:27:19 freebie /kernel: sd0: SCB 0x4 - timed out while idle, LASTPHAS
>> E == 0x1, SCSISIGI == 0x0
>>
>> What does this mean?  What can time out when nothing's happening?  Or is
>> this a timeout accepting a new command when it shouldn't have to?  Is this
>> a device or a driver logic error?
>
> The message simply indicates the state of the SCSI bus at the time the
> timeout occured.  In this case, the sequencer was sitting in it's idle
> loop waiting for new work to be queued by the kernel driver or for a
> device to reconnect and continue a previously established connection.
> To understand why a timeout can occur in this state or any other, it's
> probably best to explain the timeout model that is used in FreeBSD.
> The "type" drivers (e.g. sd, cd, st, worm, etc) specifies a timeout value
> for each command it issues.  When the adaptec driver queues a command
> to the controller, it sets up a "callout" that will occur timeout ms
> in the future for the just queued command.  In this particular case,
> that "callout" fired while the sequencer wasn't working on any
> commands.

Aha (oops, ahc).  So the state isn't usually very relevant to the
problem?

> So, what does a "timeout while idle" tell us?  Well, it means that either
> the timeout that the type driver (in this case the "st" driver) 

In fact, this was the sd driver, specifically sd0.  It always seems to
be sd0, although I have 3 disks connected to the bus, which tends to
confirm the theory that there is something wrong with the physical
bus.  It could also, of course, indicate that the disk is dying.

> specified was too short, or the aic7xxx driver lost the command
> somewhere either in route to or from the device.  The latter problem
> did occur under heavy load prior to my latest "spin lock" change to
> the driver.

When was that?  Would it also have the effect thtat the abort message
wouldn't be taken?

> The first problem seems really common in the st driver especially
> when older media or a rewind operation is involved.  You can try
> bumping up the timeouts in sys/scsi/st.c to see if this solves your
> problem.

As I said, this wasn't a tape device timeout.  In any case, this
always seems to happen when the tape is writing, which makes it look
more like the heavy load scenario.

>> Aug 17 10:27:31 freebie /kernel: SEQADDR = 0x9 SCSISEQ = 0x12 SSTAT0 = 0x5 SST
>> AT1 = 0xa
>> Aug 17 10:27:31 freebie /kernel: sd0: Queueing an Abort SCB
>> Aug 17 10:27:31 freebie /kernel: sd0: Abort Message Sent
>> Aug 17 10:27:31 freebie /kernel: sd0: SCB 0x4 - timed out in message out phase
>> , SCSISIGI == 0xa4
>> Aug 17 10:27:31 freebie /kernel: SEQADDR = 0x9a SCSISEQ = 0x12 SSTAT0 = 0x5 SS
>> TAT1 = 0x2
>>
>> If I understand this correctly, this means that the abort SCB wasn't received
>> either, so the driver does a bus reset:
>
> What it means is that the tape drive accepted the connection from
> the controller, most likely accepted the ABORT message, but took
> longer than the driver allowed for it to process the abort request,
> free the bus, and thus signal that the abort was successful.  So,
> we take out the hammer and reset the bus.  The timeout in the
> aic7xxx driver for abort requests may be too short.

Would this still be the case for disks?

>> Aug 17 10:27:31 freebie /kernel: ahc0: Issued Channel A Bus Reset. 3 SCBs abor
>> ted
>> Aug 17 10:27:32 freebie /kernel: Clearing bus reset
>> Aug 17 10:27:32 freebie /kernel: Clearing 'in-reset' flag
>> Aug 17 10:27:32 freebie /kernel: sd0: no longer in timeout
>>
>> ... which works.
>>
>> Aug 17 10:27:32 freebie /kernel: sd0: SCB 0x4 - timed out in command phase, SC
>> SISIGI == 0x84
>>
>> So why do we get another timeout?  Or is this overlapping?
>
> This is due to the way that timeout process occurs in the current driver.
> If I have 20 operations outstanding, each with a timeout queued, those
> timeouts are still in effect while I am doing recovery processing.  So,
> even if I just unwedged the bus, a timeout for a pending command can
> fire "logically too soon".

OK.

>> Aug 17 10:27:32 freebie /kernel: SEQADDR = 0x42 SCSISEQ = 0x12 SSTAT0 = 0x7 SS
>> TAT1 = 0x2
>> Aug 17 10:27:32 freebie /kernel: sd0: abort message in message buffer
>> Aug 17 10:27:32 freebie /kernel: sd1: SCB 0x3 timedout while recovery in progr
>> ess
>> Aug 17 10:27:32 freebie /kernel: sd0: SCB 1 - Abort Completed.
>> Aug 17 10:27:32 freebie /kernel: sd0: no longer in timeout
>> Aug 17 10:27:32 freebie /kernel: sd1: UNIT ATTENTION asc:29,0
>> Aug 17 10:27:32 freebie /kernel: sd1:  Power on, reset, or bus device reset oc
>> curred
>> Aug 17 10:27:32 freebie /kernel: , retries:3
>>
>> So sd3 complains, but carries on with no harm done,
>
> Yup.  The "sd" type driver seems to handle a bus reset gracefully.
>
>> Aug 17 10:27:32 freebie /kernel: st0: UNIT ATTENTION asc:29,0
>> Aug 17 10:27:32 freebie /kernel: st0:  Power on, reset, or bus
>>                                       device reset occurred
>> Aug 17 10:27:32 freebie /kernel: st0: Target Busy
>>
>> but the tape dies.  Is there a good reason for this?  I would have
>> thought that it would make sense for a power on or reset, but not
>> for a bus reset.  Does a tape unit lose its position or data when
>> it receives a bus reset?
>
> As you can tell by the sense code that is returned, your tape drive
> draws no distinction between "Power on", "reset" (i.e. bus reset),
> or "bus device reset" and is probably returning "Target Busy" because
> it is going through self test.  Any information regarding tape position
> is almost certainly lost as is probably the case for the compression/density
> settings.  The "st" driver should be able to restore the drive to
> the previous condition though since it knows all of the information
> to do so.  This is a bug.

Are you sure?  I thought so at first too, but then it occurred to me
that after the reset, the tape drive would probably not have enough
information to continue.  For example, it would probably have cleared
its buffer memory (this is a DDS-2 drive).

In this connection, it's interesting to  report how I tried to recover
from  the  problem.   I'm writing   several  files to  a non-rewinding
device, and lately they've been  dying in the same  file.  I check the
return status from tar, and if it's  non-0, do a  bsf 1, an fsf 1, and
restart the tar.  The first bsf 1 always fails, apparently because the
drive doesn't know where it is.  The second bsf 1 succeeds.

Greg