From owner-freebsd-scsi  Mon Nov 10 02:19:05 1997
Return-Path: <owner-freebsd-scsi>
Received: (from root@localhost)
          by hub.freebsd.org (8.8.7/8.8.7) id CAA20192
          for freebsd-scsi-outgoing; Mon, 10 Nov 1997 02:19:05 -0800 (PST)
          (envelope-from owner-freebsd-scsi)
Received: from bubble.didi.com (sjx-ca71-09.ix.netcom.com [207.92.177.73])
          by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id CAA20184;
          Mon, 10 Nov 1997 02:18:59 -0800 (PST)
          (envelope-from asami@sunset.cs.berkeley.edu)
Received: (from asami@localhost)
	by bubble.didi.com (8.8.7/8.8.7) id CAA13278;
	Mon, 10 Nov 1997 02:18:56 -0800 (PST)
	(envelope-from asami)
Date: Mon, 10 Nov 1997 02:18:56 -0800 (PST)
Message-Id: <199711101018.CAA13278@bubble.didi.com>
To: gibbs@freebsd.org
CC: scsi@freebsd.org, stable@freebsd.org
Reply-to: scsi@freebsd.org
Subject: timed out while idle
From: asami@cs.berkeley.edu (Satoshi Asami)
Sender: owner-freebsd-scsi@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

(Reply-to: set to -scsi)

Justin (and whoever else who can help),

I've done some real stress tests on our NFS server and found that the
crashes I've been reporting on -stable and IBM disks going "sleep"
were related.  It always starts like this, under heavy load (usually
when there are a lot of NFS clients issuing random requests):

===
sd6(ahc1:13:0): SCB 0x3 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0
SEQADDR = 0x5 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa
Ordered Tag queued
sd6(ahc1:13:0): SCB 0x3 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0
SEQADDR = 0x7 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa
sd6(ahc1:13:0): Queueing an Abort SCB
sd6(ahc1:13:0): Abort Message Sent
sd6(ahc1:13:0): SCB 3 - Abort Tag Completed.
sd6(ahc1:13:0): no longer in timeout
Ordered Tag sent
ahc1: target 13 synchronous at 10.0MHz, offset = 0x8
sd6(ahc1:13:0): UNIT ATTENTION asc:29,0
sd6(ahc1:13:0):  Power on, reset, or bus device reset occurred
===

The machine either crashes at this point, or keeps running.  If it
crashes, the crashdump is of very little help.  The stack trace is
very random and the only clue it offers is that it died doing
something with NFS.

If it keeps running, this disk goes into the "NOT READY" state I've
reported before.  (Thanks Peter, but I haven't had the time to try
your hook. :<)  Sometimes it will come back if I do a "scsi -r -f
/dev/rsd6c", sometimes it will say "device not configured".  When I
reboot the machine, it usually comes back but sometimes it will die in
fsck saying disk is not ready (usually the same disk).

I thought about using Peter's hook or writing a program to monitor
syslog and issuing a reprobe but if the machine is crashing before it
goes into the "NOT READY" state, it's not going to help much.  The
disks identify themselves as:

===
ahc1: target 8 using 16Bit transfers
ahc1: target 8 synchronous at 10.0MHz, offset = 0x8
ahc1: target 8 Tagged Queuing Device
(ahc1:8:0): "IBM OEM DCHS09Y 2424" type 0 fixed SCSI 2
sd1(ahc1:8:0): Direct-Access 8689MB (17796077 512 byte sectors)
===

Do you have any idea what's going on?  Does this sound like a firmware
bug?  Do you think you can find the problem if I give you access to
the machine?

Satoshi