From owner-freebsd-scsi Tue Aug 19 09:57:12 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.5/8.8.5) id JAA18834 for freebsd-scsi-outgoing; Tue, 19 Aug 1997 09:57:12 -0700 (PDT) Received: from pluto.plutotech.com (root@mail.plutotech.com [206.168.67.137]) by hub.freebsd.org (8.8.5/8.8.5) with ESMTP id JAA18823 for ; Tue, 19 Aug 1997 09:57:07 -0700 (PDT) Received: from narnia.plutotech.com (narnia.plutotech.com [206.168.67.130]) by pluto.plutotech.com (8.8.5/8.8.5) with ESMTP id KAA24228; Tue, 19 Aug 1997 10:54:29 -0600 (MDT) Message-Id: <199708191654.KAA24228@pluto.plutotech.com> X-Mailer: exmh version 2.0zeta 7/24/97 To: Greg Lehey cc: FreeBSD SCSI Mailing List Subject: Re: Bus resets. Grrrr. In-reply-to: Your message of "Tue, 19 Aug 1997 15:30:23 +0930." <19970819153023.02433@lemis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 19 Aug 1997 10:53:54 -0600 From: "Justin T. Gibbs" Sender: owner-freebsd-scsi@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk >> What version of the kernel are you using > >Recent versions of -current. The ones I reported it against were some >time last week. I've just rebuilt with a version supped this morning. And it is still reproducible? >> The message simply indicates the state of the SCSI bus at the time the >> timeout occurred... > >Aha (oops, ahc). So the state isn't usually very relevant to the >problem? It depends on the problem. If the sequencer code has a protocol bug, it is usually indicated by a hang or timeout in a particular bus state. >> So, what does a "timeout while idle" tell us? Well, it means that either >> the timeout that the type driver (in this case the "st" driver) > >In fact, this was the sd driver, specifically sd0. It always seems to >be sd0, although I have 3 disks connected to the bus, which tends to >confirm the theory that there is something wrong with the physical >bus. It could also, of course, indicate that the disk is dying. Hmmm. How many devices are active at the time that the timeout occurs? Since you are not using tagged queuing, you would need 5 devices active at a time to overflow the QOUTFIFO (the bug that I fixed recently) on an aic7860 based controller. >> specified was too short, or the aic7xxx driver lost the command >> somewhere either in route to or from the device. The latter problem >> did occur under heavy load prior to my latest "spin lock" change to >> the driver. > >When was that? Would it also have the effect that the abort message >wouldn't be taken? The abort probably was taken, but the tape drive took a long time to release the bus, which was why the bus was reset. I put my fix in the kernel on 8/13 in rev 1.121 or aic7xxx.c. >> The first problem seems really common in the st driver especially >> when older media or a rewind operation is involved. You can try >> bumping up the timeouts in sys/scsi/st.c to see if this solves your >> problem. > >As I said, this wasn't a tape device timeout. In any case, this >always seems to happen when the tape is writing, which makes it look >more like the heavy load scenario. Could it be that you don't have disconnections enabled for your tape drive? You should check both SCSI-Select for the 2940 and any relevant jumpers on the tape drive itself. If disconnections are disabled, a tape write that required multiple retries could easily tie up the SCSI bus for the 10s needed to make a disk command time out. >> What it means is that the tape drive accepted the connection from >> the controller, most likely accepted the ABORT message, but took >> longer than the driver allowed for it to process the abort request, >> free the bus, and thus signal that the abort was successful. So, >> we take out the hammer and reset the bus. The timeout in the >> aic7xxx driver for abort requests may be too short. > >Would this still be the case for disks? > The analysis is the same regardless of the device type. >> As you can tell by the sense code that is returned, your tape drive >> draws no distinction between "Power on", "reset" (i.e. bus reset), >> or "bus device reset" and is probably returning "Target Busy" because >> it is going through self test. Any information regarding tape position >> is almost certainly lost as is probably the case for the compression/density >> settings. The "st" driver should be able to restore the drive to >> the previous condition though since it knows all of the information >> to do so. This is a bug. > >Are you sure? I thought so at first too, but then it occurred to me >that after the reset, the tape drive would probably not have enough >information to continue. For example, it would probably have cleared >its buffer memory (this is a DDS-2 drive). The device should flush any cached information for previous commands completed with "good status" to the media in the process of handling the reset. Any device that doesn't do this is broken. >In this connection, it's interesting to report how I tried to recover >from the problem. I'm writing several files to a non-rewinding >device, and lately they've been dying in the same file. I check the >return status from tar, and if it's non-0, do a bsf 1, an fsf 1, and >restart the tar. The first bsf 1 always fails, apparently because the >drive doesn't know where it is. The second bsf 1 succeeds. The first one probably fails because the device isn't ready. What error is reported on the console? >Greg -- Justin T. Gibbs =========================================== FreeBSD: Turning PCs into workstations ===========================================