From owner-freebsd-scsi Mon Nov 10 02:19:05 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id CAA20192 for freebsd-scsi-outgoing; Mon, 10 Nov 1997 02:19:05 -0800 (PST) (envelope-from owner-freebsd-scsi) Received: from bubble.didi.com (sjx-ca71-09.ix.netcom.com [207.92.177.73]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id CAA20184; Mon, 10 Nov 1997 02:18:59 -0800 (PST) (envelope-from asami@sunset.cs.berkeley.edu) Received: (from asami@localhost) by bubble.didi.com (8.8.7/8.8.7) id CAA13278; Mon, 10 Nov 1997 02:18:56 -0800 (PST) (envelope-from asami) Date: Mon, 10 Nov 1997 02:18:56 -0800 (PST) Message-Id: <199711101018.CAA13278@bubble.didi.com> To: gibbs@freebsd.org CC: scsi@freebsd.org, stable@freebsd.org Reply-to: scsi@freebsd.org Subject: timed out while idle From: asami@cs.berkeley.edu (Satoshi Asami) Sender: owner-freebsd-scsi@freebsd.org X-Loop: FreeBSD.org Precedence: bulk (Reply-to: set to -scsi) Justin (and whoever else who can help), I've done some real stress tests on our NFS server and found that the crashes I've been reporting on -stable and IBM disks going "sleep" were related. It always starts like this, under heavy load (usually when there are a lot of NFS clients issuing random requests): === sd6(ahc1:13:0): SCB 0x3 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0 SEQADDR = 0x5 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa Ordered Tag queued sd6(ahc1:13:0): SCB 0x3 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0 SEQADDR = 0x7 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa sd6(ahc1:13:0): Queueing an Abort SCB sd6(ahc1:13:0): Abort Message Sent sd6(ahc1:13:0): SCB 3 - Abort Tag Completed. sd6(ahc1:13:0): no longer in timeout Ordered Tag sent ahc1: target 13 synchronous at 10.0MHz, offset = 0x8 sd6(ahc1:13:0): UNIT ATTENTION asc:29,0 sd6(ahc1:13:0): Power on, reset, or bus device reset occurred === The machine either crashes at this point, or keeps running. If it crashes, the crashdump is of very little help. The stack trace is very random and the only clue it offers is that it died doing something with NFS. If it keeps running, this disk goes into the "NOT READY" state I've reported before. (Thanks Peter, but I haven't had the time to try your hook. :<) Sometimes it will come back if I do a "scsi -r -f /dev/rsd6c", sometimes it will say "device not configured". When I reboot the machine, it usually comes back but sometimes it will die in fsck saying disk is not ready (usually the same disk). I thought about using Peter's hook or writing a program to monitor syslog and issuing a reprobe but if the machine is crashing before it goes into the "NOT READY" state, it's not going to help much. The disks identify themselves as: === ahc1: target 8 using 16Bit transfers ahc1: target 8 synchronous at 10.0MHz, offset = 0x8 ahc1: target 8 Tagged Queuing Device (ahc1:8:0): "IBM OEM DCHS09Y 2424" type 0 fixed SCSI 2 sd1(ahc1:8:0): Direct-Access 8689MB (17796077 512 byte sectors) === Do you have any idea what's going on? Does this sound like a firmware bug? Do you think you can find the problem if I give you access to the machine? Satoshi