From owner-freebsd-scsi Mon Aug 18 23:01:07 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.5/8.8.5) id XAA13315 for freebsd-scsi-outgoing; Mon, 18 Aug 1997 23:01:07 -0700 (PDT) Received: from nico.telstra.net (nico.telstra.net [139.130.204.16]) by hub.freebsd.org (8.8.5/8.8.5) with SMTP id XAA13306 for ; Mon, 18 Aug 1997 23:01:01 -0700 (PDT) Received: from freebie.lemis.com (gregl1.lnk.telstra.net [139.130.136.133]) by nico.telstra.net (8.6.10/8.6.10) with ESMTP id QAA05099; Tue, 19 Aug 1997 16:00:25 +1000 Received: (grog@localhost) by freebie.lemis.com (8.8.7/8.6.12) id PAA00370; Tue, 19 Aug 1997 15:30:24 +0930 (CST) Message-ID: <19970819153023.02433@lemis.com> Date: Tue, 19 Aug 1997 15:30:23 +0930 From: Greg Lehey To: "Justin T. Gibbs" Cc: FreeBSD SCSI Mailing List Subject: Re: Bus resets. Grrrr. References: <199708170129.KAA03776@freebie.lemis.com> <199708190359.VAA10011@pluto.plutotech.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.81e In-Reply-To: <199708190359.VAA10011@pluto.plutotech.com>; from Justin T. Gibbs on Mon, Aug 18, 1997 at 09:59:02PM -0600 Organisation: LEMIS, PO Box 460, Echunga SA 5153, Australia Phone: +61-8-8388-8250 Fax: +61-8-8388-8250 Mobile: +61-41-739-7062 WWW-Home-Page: http://www.lemis.com/~grog Sender: owner-freebsd-scsi@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk On Mon, Aug 18, 1997 at 09:59:02PM -0600, Justin T. Gibbs wrote: >> This is the third time in a row that I haven't been able to complete a backup >> because of "recoverable" SCSI errors. Here's a pretty typical scenario: > > What version of the kernel are you using Recent versions of -current. The ones I reported it against were some time last week. I've just rebuilt with a version supped this morning. > with what AHC options enabled in your kernel config file? The version I was using when I got these messages had the following relevant options: controller ahc0 controller aha0 at isa? port "IO_AHA0" bio irq ? drq 5 vector ahaintr controller aha1 at isa? port 0x334 bio irq ? drq 7 vector ahaintr controller scbus0 device sd0 device st0 device cd0 #Only need one of these, the code dynamically grows options SCSIDEBUG options SCSI_REPORT_GEOMETRY The machine does, in fact, have two SCSI controllers (the 2940A and a 1542B). At the moment I'm not having any problems on the 1542B, with the exception of lack of bounce buffers for the CD-ROM changer connected to it. >> Aug 17 10:27:19 freebie /kernel: sd0: SCB 0x4 - timed out while idle, LASTPHAS >> E == 0x1, SCSISIGI == 0x0 >> >> What does this mean? What can time out when nothing's happening? Or is >> this a timeout accepting a new command when it shouldn't have to? Is this >> a device or a driver logic error? > > The message simply indicates the state of the SCSI bus at the time the > timeout occured. In this case, the sequencer was sitting in it's idle > loop waiting for new work to be queued by the kernel driver or for a > device to reconnect and continue a previously established connection. > To understand why a timeout can occur in this state or any other, it's > probably best to explain the timeout model that is used in FreeBSD. > The "type" drivers (e.g. sd, cd, st, worm, etc) specifies a timeout value > for each command it issues. When the adaptec driver queues a command > to the controller, it sets up a "callout" that will occur timeout ms > in the future for the just queued command. In this particular case, > that "callout" fired while the sequencer wasn't working on any > commands. Aha (oops, ahc). So the state isn't usually very relevant to the problem? > So, what does a "timeout while idle" tell us? Well, it means that either > the timeout that the type driver (in this case the "st" driver) In fact, this was the sd driver, specifically sd0. It always seems to be sd0, although I have 3 disks connected to the bus, which tends to confirm the theory that there is something wrong with the physical bus. It could also, of course, indicate that the disk is dying. > specified was too short, or the aic7xxx driver lost the command > somewhere either in route to or from the device. The latter problem > did occur under heavy load prior to my latest "spin lock" change to > the driver. When was that? Would it also have the effect thtat the abort message wouldn't be taken? > The first problem seems really common in the st driver especially > when older media or a rewind operation is involved. You can try > bumping up the timeouts in sys/scsi/st.c to see if this solves your > problem. As I said, this wasn't a tape device timeout. In any case, this always seems to happen when the tape is writing, which makes it look more like the heavy load scenario. >> Aug 17 10:27:31 freebie /kernel: SEQADDR = 0x9 SCSISEQ = 0x12 SSTAT0 = 0x5 SST >> AT1 = 0xa >> Aug 17 10:27:31 freebie /kernel: sd0: Queueing an Abort SCB >> Aug 17 10:27:31 freebie /kernel: sd0: Abort Message Sent >> Aug 17 10:27:31 freebie /kernel: sd0: SCB 0x4 - timed out in message out phase >> , SCSISIGI == 0xa4 >> Aug 17 10:27:31 freebie /kernel: SEQADDR = 0x9a SCSISEQ = 0x12 SSTAT0 = 0x5 SS >> TAT1 = 0x2 >> >> If I understand this correctly, this means that the abort SCB wasn't received >> either, so the driver does a bus reset: > > What it means is that the tape drive accepted the connection from > the controller, most likely accepted the ABORT message, but took > longer than the driver allowed for it to process the abort request, > free the bus, and thus signal that the abort was successful. So, > we take out the hammer and reset the bus. The timeout in the > aic7xxx driver for abort requests may be too short. Would this still be the case for disks? >> Aug 17 10:27:31 freebie /kernel: ahc0: Issued Channel A Bus Reset. 3 SCBs abor >> ted >> Aug 17 10:27:32 freebie /kernel: Clearing bus reset >> Aug 17 10:27:32 freebie /kernel: Clearing 'in-reset' flag >> Aug 17 10:27:32 freebie /kernel: sd0: no longer in timeout >> >> ... which works. >> >> Aug 17 10:27:32 freebie /kernel: sd0: SCB 0x4 - timed out in command phase, SC >> SISIGI == 0x84 >> >> So why do we get another timeout? Or is this overlapping? > > This is due to the way that timeout process occurs in the current driver. > If I have 20 operations outstanding, each with a timeout queued, those > timeouts are still in effect while I am doing recovery processing. So, > even if I just unwedged the bus, a timeout for a pending command can > fire "logically too soon". OK. >> Aug 17 10:27:32 freebie /kernel: SEQADDR = 0x42 SCSISEQ = 0x12 SSTAT0 = 0x7 SS >> TAT1 = 0x2 >> Aug 17 10:27:32 freebie /kernel: sd0: abort message in message buffer >> Aug 17 10:27:32 freebie /kernel: sd1: SCB 0x3 timedout while recovery in progr >> ess >> Aug 17 10:27:32 freebie /kernel: sd0: SCB 1 - Abort Completed. >> Aug 17 10:27:32 freebie /kernel: sd0: no longer in timeout >> Aug 17 10:27:32 freebie /kernel: sd1: UNIT ATTENTION asc:29,0 >> Aug 17 10:27:32 freebie /kernel: sd1: Power on, reset, or bus device reset oc >> curred >> Aug 17 10:27:32 freebie /kernel: , retries:3 >> >> So sd3 complains, but carries on with no harm done, > > Yup. The "sd" type driver seems to handle a bus reset gracefully. > >> Aug 17 10:27:32 freebie /kernel: st0: UNIT ATTENTION asc:29,0 >> Aug 17 10:27:32 freebie /kernel: st0: Power on, reset, or bus >> device reset occurred >> Aug 17 10:27:32 freebie /kernel: st0: Target Busy >> >> but the tape dies. Is there a good reason for this? I would have >> thought that it would make sense for a power on or reset, but not >> for a bus reset. Does a tape unit lose its position or data when >> it receives a bus reset? > > As you can tell by the sense code that is returned, your tape drive > draws no distinction between "Power on", "reset" (i.e. bus reset), > or "bus device reset" and is probably returning "Target Busy" because > it is going through self test. Any information regarding tape position > is almost certainly lost as is probably the case for the compression/density > settings. The "st" driver should be able to restore the drive to > the previous condition though since it knows all of the information > to do so. This is a bug. Are you sure? I thought so at first too, but then it occurred to me that after the reset, the tape drive would probably not have enough information to continue. For example, it would probably have cleared its buffer memory (this is a DDS-2 drive). In this connection, it's interesting to report how I tried to recover from the problem. I'm writing several files to a non-rewinding device, and lately they've been dying in the same file. I check the return status from tar, and if it's non-0, do a bsf 1, an fsf 1, and restart the tar. The first bsf 1 always fails, apparently because the drive doesn't know where it is. The second bsf 1 succeeds. Greg