From owner-freebsd-hackers  Sun May 21 05:28:30 1995
Return-Path: hackers-owner
Received: (from majordom@localhost)
          by freefall.cdrom.com (8.6.10/8.6.6) id FAA19242
          for hackers-outgoing; Sun, 21 May 1995 05:28:30 -0700
Received: from hda.com (hda.com [199.232.40.182])
          by freefall.cdrom.com (8.6.10/8.6.6) with ESMTP id FAA19236
          ; Sun, 21 May 1995 05:28:25 -0700
Received: (dufault@localhost) by hda.com (8.6.9/8.3) id IAA17587; Sun, 21 May 1995 08:28:50 -0400
From: Peter Dufault <dufault@hda.com>
Message-Id: <199505211228.IAA17587@hda.com>
Subject: Re: kern/430: bug in tape drivers
To: bugs@ns1.win.net (Mark Hittinger)
Date: Sun, 21 May 1995 08:28:49 -0400 (EDT)
Cc: hackers@FreeBSD.org, julian@FreeBSD.org
In-Reply-To: <199505200134.VAA07349@ns1.win.net> from "Mark Hittinger" at May 19, 95 09:34:07 pm
X-Mailer: ELM [version 2.4 PL24]
Content-Type: text
Content-Length: 2488      
Sender: hackers-owner@FreeBSD.org
Precedence: bulk

Mark Hittinger writes:
> 
> >Number:         430
> >Category:       kern
> >Synopsis:       SCSI Tape dont work
> >Originator:     Charles Henrich (MSU)
> >Release:        FreeBSD 2.1.0-Development i386
> >
> >	ALR Dual Pentium, BT747 SCSI-2, Connor DDS-2 Dat, 3 Seagate Hawk 2gig
>                           ^^^^^
> >	drives.
> >
> >Description:
> >
> >	90% of the time you access the dat drive via dump, FreeBSD goes off
> >	and scrambles the other disks in the system.  This sucks, and has
> >	happened to me several times.
> >

I think that the the tape drive is tying up the SCSI bus (and
maybe therefore the host adapter?) for some reason.

> I have seen the same problem since 2.0R.  I have a WangDAT3400DX.  When a
> process closes the tape drive I get "bt0a: try to abort".  I believe this
> is due to the lengthy rewind, although recently I noted that there was a
> problem with scsi commands that contained no data.   In any event I
> still see the problem in -current.  I will try a 2940 controller this
> weekend and see if the problem exists there.

As I mentioned, zero length commands aren't an issue.

> After a few "bt0a try to abort" I get a "bt0a abort timed out".  It is
> at this point that horrible things happen.  The driver corrupts the ccb
> chain and bit sprays your disks.  If the rewind finishes before the
> "bt0a abort timed out" then no badness happens to your disks.

You get more than one "bt0: Try to abort" messages?   That
is probably the scsi system aborting the ongoing disk transfers that aren't
completing due to the problem with the tape drive, since you will
only get one "Try to abort" message per aborted transaction.

I'm not sure what your work around does:  you end up stretching out
the "Try to abort" time until the drive finishes and "unlocks"
the host adapter.  So you've tried to abort a few transfers.  Did they
abort?  I don't know.  Do you wind up getting a disk retry per
abort message after this?

Anyway, if the "abort timed out" happens we toss that active CCB's back
onto the freelist and the next SCSI transaction will get that same
CCB.  This is probably a mistake: we should instead let the CCBs leak
off into the bit bucket, potentially hanging the system,
but tossing them back so that they wind up being reused may be what
is trashing the disk.

Peter
-- 
Peter Dufault               Real Time Machine Control and Simulation
HD Associates, Inc.         Voice: 508 433 6936
dufault@hda.com             Fax:   508 433 5267