From owner-freebsd-scsi Mon Nov 10 02:19:05 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id CAA20192 for freebsd-scsi-outgoing; Mon, 10 Nov 1997 02:19:05 -0800 (PST) (envelope-from owner-freebsd-scsi) Received: from bubble.didi.com (sjx-ca71-09.ix.netcom.com [207.92.177.73]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id CAA20184; Mon, 10 Nov 1997 02:18:59 -0800 (PST) (envelope-from asami@sunset.cs.berkeley.edu) Received: (from asami@localhost) by bubble.didi.com (8.8.7/8.8.7) id CAA13278; Mon, 10 Nov 1997 02:18:56 -0800 (PST) (envelope-from asami) Date: Mon, 10 Nov 1997 02:18:56 -0800 (PST) Message-Id: <199711101018.CAA13278@bubble.didi.com> To: gibbs@freebsd.org CC: scsi@freebsd.org, stable@freebsd.org Reply-to: scsi@freebsd.org Subject: timed out while idle From: asami@cs.berkeley.edu (Satoshi Asami) Sender: owner-freebsd-scsi@freebsd.org X-Loop: FreeBSD.org Precedence: bulk (Reply-to: set to -scsi) Justin (and whoever else who can help), I've done some real stress tests on our NFS server and found that the crashes I've been reporting on -stable and IBM disks going "sleep" were related. It always starts like this, under heavy load (usually when there are a lot of NFS clients issuing random requests): === sd6(ahc1:13:0): SCB 0x3 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0 SEQADDR = 0x5 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa Ordered Tag queued sd6(ahc1:13:0): SCB 0x3 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0 SEQADDR = 0x7 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa sd6(ahc1:13:0): Queueing an Abort SCB sd6(ahc1:13:0): Abort Message Sent sd6(ahc1:13:0): SCB 3 - Abort Tag Completed. sd6(ahc1:13:0): no longer in timeout Ordered Tag sent ahc1: target 13 synchronous at 10.0MHz, offset = 0x8 sd6(ahc1:13:0): UNIT ATTENTION asc:29,0 sd6(ahc1:13:0): Power on, reset, or bus device reset occurred === The machine either crashes at this point, or keeps running. If it crashes, the crashdump is of very little help. The stack trace is very random and the only clue it offers is that it died doing something with NFS. If it keeps running, this disk goes into the "NOT READY" state I've reported before. (Thanks Peter, but I haven't had the time to try your hook. :<) Sometimes it will come back if I do a "scsi -r -f /dev/rsd6c", sometimes it will say "device not configured". When I reboot the machine, it usually comes back but sometimes it will die in fsck saying disk is not ready (usually the same disk). I thought about using Peter's hook or writing a program to monitor syslog and issuing a reprobe but if the machine is crashing before it goes into the "NOT READY" state, it's not going to help much. The disks identify themselves as: === ahc1: target 8 using 16Bit transfers ahc1: target 8 synchronous at 10.0MHz, offset = 0x8 ahc1: target 8 Tagged Queuing Device (ahc1:8:0): "IBM OEM DCHS09Y 2424" type 0 fixed SCSI 2 sd1(ahc1:8:0): Direct-Access 8689MB (17796077 512 byte sectors) === Do you have any idea what's going on? Does this sound like a firmware bug? Do you think you can find the problem if I give you access to the machine? Satoshi From owner-freebsd-scsi Thu Nov 13 07:30:24 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id HAA20684 for freebsd-scsi-outgoing; Thu, 13 Nov 1997 07:30:24 -0800 (PST) (envelope-from owner-freebsd-scsi) Received: from innocence.interface-business.de (innocence.interface-business.de [193.101.57.202]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id HAA20620 for ; Thu, 13 Nov 1997 07:29:42 -0800 (PST) (envelope-from j@ida.interface-business.de) Received: from ida.interface-business.de (ida.interface-business.de [193.101.57.203]) by innocence.interface-business.de (8.6.11/8.6.9) with SMTP id QAA03801 for ; Thu, 13 Nov 1997 16:29:15 +0100 Received: (from j@localhost) by ida.interface-business.de (8.8.7/8.7.3) id QAA25276; Thu, 13 Nov 1997 16:29:14 +0100 (MET) Message-ID: <19971113162914.44822@interface-business.de> Date: Thu, 13 Nov 1997 16:29:14 +0100 From: J Wunsch To: scsi@freebsd.org Subject: CD-ROM / AHA2940 problems Reply-To: Joerg Wunsch Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.81 X-Phone: +49-351-31809-14 X-PGP-Fingerprint: DC 47 E6 E4 FF A6 E9 8F 93 21 E0 7D F9 12 D6 4E Organization: interface business GmbH, Dresden Sender: owner-freebsd-scsi@freebsd.org X-Loop: FreeBSD.org Precedence: bulk After obtaining: (ahc1:6:0): "TOSHIBA CD-ROM XM-5701TA 0167" type 5 removable SCSI 2 cd0(ahc1:6:0): CD-ROM can't get the size ...i get the following error whenever i start to play a CD in workman: Nov 13 16:16:38 ida /kernel.ddb: cd0(ahc1:6:0): SCB 0x1 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0 Nov 13 16:16:38 ida /kernel.ddb: SEQADDR = 0x4 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa Nov 13 16:16:39 ida /kernel.ddb: cd0(ahc1:6:0): Queueing an Abort SCB Nov 13 16:16:39 ida /kernel.ddb: cd0(ahc1:6:0): no longer in timeout The CD starts to play nevertheless, so i'm not sure whether the error is benign or not. (There are no other activities on this SCSI bus by that time, the disks are on a different bus.) I can read a CD-ROM without problems. This is with 2.2-stable (built somewhere in September, as it looks), and with an AHA2940. Also, this drive doesn't respond to the audio channel volume settings specified in the CD audio mode page (the previous Plextor drive worked fine here). Is this the advantage of new technology now? -- J"org Wunsch Unix support engineer joerg_wunsch@interface-business.de http://www.interface-business.de/~j From owner-freebsd-scsi Thu Nov 13 10:32:22 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id KAA05567 for freebsd-scsi-outgoing; Thu, 13 Nov 1997 10:32:22 -0800 (PST) (envelope-from owner-freebsd-scsi) Received: from ifi.uio.no (ifi.uio.no [129.240.64.2]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id KAA05533 for ; Thu, 13 Nov 1997 10:32:13 -0800 (PST) (envelope-from dag-erli@ifi.uio.no) Received: from bafur.ifi.uio.no (2602@bafur.ifi.uio.no [129.240.64.159]) by ifi.uio.no (8.8.7/8.8.7/ifi0.2) with SMTP id TAA15358 for ; Thu, 13 Nov 1997 19:32:03 +0100 (MET) Received: from localhost (dag-erli@localhost) by bafur.ifi.uio.no ; Thu, 13 Nov 1997 18:32:02 GMT To: scsi@FreeBSD.ORG Subject: Re: CD-ROM / AHA2940 problems References: <19971113162914.44822@interface-business.de> Organization: Folkerørsla Mot Knut Yrvin X-url: http://www.ifi.uio.no/~dag-erli/ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit From: dag-erli@ifi.uio.no (Dag-Erling Coidan Smørgrav) Date: 13 Nov 1997 19:32:01 +0100 In-Reply-To: J Wunsch's message of Thu, 13 Nov 1997 16:29:14 +0100 Message-ID: Lines: 26 X-Mailer: Gnus v5.3/Emacs 19.34 Sender: owner-freebsd-scsi@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk J Wunsch writes: > After obtaining: > > (ahc1:6:0): "TOSHIBA CD-ROM XM-5701TA 0167" type 5 removable SCSI 2 > cd0(ahc1:6:0): CD-ROM can't get the size AFAIK this message is normal; it means that there's no data CD in the drive. > ...i get the following error whenever i start to play a CD in > workman: > [...] > This is with 2.2-stable (built somewhere in September, as it looks), > and with an AHA2940. I have the exact same CD-ROM and the exact same SCSI adapter (actually, I have both an AHA2940 and an AHA2940UW) and have not experienced any of the problems you mention. I run FreeBSD 2.2.1R. I have heard from other people that AHA2940 performance has been seriously reduced in 2.2.5R, but have not had the opportunity to observe this myself. -- * Finrod (INTJ) * Unix weenie * dag-erli@ifi.uio.no * cellular +47-92835919 * RFC1123: "Be liberal in what you accept, and conservative in what you send" From owner-freebsd-scsi Thu Nov 13 11:55:28 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id LAA13739 for freebsd-scsi-outgoing; Thu, 13 Nov 1997 11:55:28 -0800 (PST) (envelope-from owner-freebsd-scsi) Received: from tharg.eu.org (ipallfreeman.cwcmultimedia.co.uk [195.44.34.165]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id LAA13717 for ; Thu, 13 Nov 1997 11:55:20 -0800 (PST) (envelope-from ip@tharg.eu.org) Received: (from ip@localhost) by tharg.eu.org (8.8.7/8.7.3) id TAA03123; Thu, 13 Nov 1997 19:54:29 GMT From: Ian Pallfreeman Message-Id: <199711131954.TAA03123@tharg.eu.org> Subject: Re: AHC / SCSI Problem? In-Reply-To: <199711131520.KAA09225@ussenterprise.ufp.org> from Leo Bicknell at "Nov 13, 97 10:20:08 am" To: bicknell@ufp.org Date: Thu, 13 Nov 1997 19:54:28 +0000 (GMT) Cc: scsi@freebsd.org Reply-To: ip@mcc.ac.uk X-Mailer: ELM [version 2.4ME+ PL32 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-scsi@freebsd.org X-Loop: FreeBSD.org Precedence: bulk On freebsd-hackers, Leo Bicknell wrote: > We've been having some problems here with FreeBSD, > Adaptec 2940's, and various disk drives. > [...] > ahc0 rev 1 int a irq 9 on pci0:18 > ahc0: aic7860 Single Channel, SCSI Id=7, 3 SCBs > ahc0 waiting for scsi devices to settle > (ahc0:0:0): "MICROP 4221-09 1128RQAV RQAV" type 0 fixed SCSI 2 > sd0(ahc0:0:0): Direct-Access 1955MB (4004219 512 byte sectors) > sd1(ahc0:1:0): SCB 0x2 - timed out while idle, LASTPHASE == 0x1, SCSISIGI == 0x0 > SEQADDR = 0x5 SCSISEQ = 0x12 SSTAT0 = 0x5 SSTAT1 = 0xa > sd1(ahc0:1:0): Queueing an Abort SCB > sd1(ahc0:1:0): Abort Message Sent > sd1(ahc0:1:0): SCB 2 - Abort Completed. > sd1(ahc0:1:0): no longer in timeout FWIW, I'm also having incredible problems with a similar setup. The Micropolis disks seem fine with an NCR controller, and the Adaptecs are OK with a pile of old Sun disks. Symptoms include those you describe, and worse: fsck bombs with an intermittent SIGFPE, a ``make world'' scribbles bits of garbage into the binary of ``make'' (the first thing it builds). Frankly, it's driving me nuts. I would expect that if there were gotchas with Micropolis disks and Adaptec controllers, the folks on freebsd-scsi would know. I'm CC'ing a copy of this there (not to -hackers) in the hope of reassurance or mocking laughter, either of which might provide some guidance. :-) Ian. From owner-freebsd-scsi Fri Nov 14 00:51:16 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id AAA13002 for freebsd-scsi-outgoing; Fri, 14 Nov 1997 00:51:16 -0800 (PST) (envelope-from owner-freebsd-scsi) Received: from sax.sax.de (sax.sax.de [193.175.26.33]) by hub.freebsd.org (8.8.7/8.8.7) with SMTP id AAA12996 for ; Fri, 14 Nov 1997 00:51:10 -0800 (PST) (envelope-from j@uriah.heep.sax.de) Received: (from uucp@localhost) by sax.sax.de (8.6.12/8.6.12-s1) with UUCP id JAA02459 for scsi@FreeBSD.ORG; Fri, 14 Nov 1997 09:51:09 +0100 Received: (from j@localhost) by uriah.heep.sax.de (8.8.8/8.8.5) id JAA08685; Fri, 14 Nov 1997 09:39:38 +0100 (MET) Message-ID: <19971114093938.KX10347@uriah.heep.sax.de> Date: Fri, 14 Nov 1997 09:39:38 +0100 From: j@uriah.heep.sax.de (J Wunsch) To: scsi@FreeBSD.ORG Subject: Re: CD-ROM / AHA2940 problems References: <19971113162914.44822@interface-business.de> X-Mailer: Mutt 0.60_p2-3,5,8-9 Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Phone: +49-351-2012 669 X-PGP-Fingerprint: DC 47 E6 E4 FF A6 E9 8F 93 21 E0 7D F9 12 D6 4E Reply-To: joerg_wunsch@uriah.heep.sax.de (Joerg Wunsch) In-Reply-To: =?iso-8859-1?Q?=3Cxzpk9eclk66=2Efsf=40bafur=2Eifi=2Euio=2Eno=3E=3B_from_?= =?iso-8859-1?Q?Dag-Erling_Coidan_Sm=F8rgrav_on_Nov_13=2C_1997_19=3A32=3A?= =?iso-8859-1?Q?01_+0100?= Sender: owner-freebsd-scsi@FreeBSD.ORG X-Loop: FreeBSD.org Precedence: bulk As Dag-Erling Coidan Smørgrav wrote: > > (ahc1:6:0): "TOSHIBA CD-ROM XM-5701TA 0167" type 5 removable SCSI 2 > > cd0(ahc1:6:0): CD-ROM can't get the size > > AFAIK this message is normal; it means that there's no data CD in the > drive. Sure (although it tells the same even iff a medium's in the drive, but let's ignore this, it's not important). > I have the exact same CD-ROM and the exact same SCSI adapter > (actually, I have both an AHA2940 and an AHA2940UW) and have not > experienced any of the problems you mention. I run FreeBSD 2.2.1R. Interesting. Does this mean that even the audio volume controls (e.g. workman's slider, or the `vol' command in cdcontrol(1)) do work for you? This would mean my XM5701 has broken firmware. I forgot to mention in the other posting that the Toshiba was on the same bus with an Archive Python DAT drive which is known to be rather fragile on the SCSI bus, at least with the ahc driver (but ISTR reported problems with other drivers as well). After a couple of bus device resets for the Toshiba, i was totally unable to work with the tape drive anymore (the driver complained about reconnects with no SCB and other bogus things), so i quickly discarded the Toshiba, and swapped my old Plextor single-speed drive back in. -- cheers, J"org joerg_wunsch@uriah.heep.sax.de -- http://www.sax.de/~joerg/ -- NIC: JW11-RIPE Never trust an operating system you don't have sources for. ;-) From owner-freebsd-scsi Sat Nov 15 21:12:50 1997 Return-Path: Received: (from root@localhost) by hub.freebsd.org (8.8.7/8.8.7) id VAA18576 for freebsd-scsi-outgoing; Sat, 15 Nov 1997 21:12:50 -0800 (PST) (envelope-from owner-freebsd-scsi) Received: from pluto.plutotech.com (mail.plutotech.com [206.168.67.137]) by hub.freebsd.org (8.8.7/8.8.7) with ESMTP id VAA18559; Sat, 15 Nov 1997 21:12:45 -0800 (PST) (envelope-from gibbs@plutotech.com) Received: from narnia.plutotech.com (narnia.plutotech.com [206.168.67.130]) by pluto.plutotech.com (8.8.7/8.8.5) with ESMTP id WAA24691; Sat, 15 Nov 1997 22:11:59 -0700 (MST) Message-Id: <199711160511.WAA24691@pluto.plutotech.com> X-Mailer: exmh version 2.0zeta 7/24/97 To: harold barker Hbarker cc: hackers@FreeBSD.org, scsi@FreeBSD.org, aic7xxx@FreeBSD.org Subject: Re: AHC / SCSI UPDATE Date: Sat, 15 Nov 1997 22:10:52 -0700 From: "Justin T. Gibbs" Sender: owner-freebsd-scsi@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk Sorry for not responding sooner, but I don't read this list regularly anymore... >If the person responsible for the code in question will email me, i will >ship/open for login a machine that exibits the broblem. That would be me, but I do believe that I have a system here that exibits the same problem you are having. When I have a fix for this machine, I might take you up on your offer if it doesn't seem to work with your equipment. Here's a little info about what we (Ken Merry and myself) have determined about the problem so far. System: P6-233 256k cache 2940UW (SCSI ID 7) 1 X PLEXTOR CD-ROM PX-4XCS 1.04 (SCSI ID 4) 2 X QUANTUM XP34550W LXY4 (SCSI IDs 0 and 1) How to repeat: run concurrent I/O to all 3 devices at the same time. Symptom: After a varying period of time, disk 0 or 1 stops performing reselections for it's outstanding I/O. This eventually results in a timeout, usually with the controller in an "idle" state. Using a SCSI bus analyzer, we've looked at the transactions on the bus that lead up to this state. No protocol errors were discovered. What we did find, however, was a disturbing pattern of disconnections and reconnections from the CDROM drive. The plextor seems to perform disconnections "often enough" to allow other targets to perform a reselection, but unfortunately seems to partake in the next arbitration phase if it has a task to continue. Since the arbitration algorithm breaks "ties" based on the SCSI ID (from highest to lowest priority 7->0, 15->8), this effectively gives the CD drive the bus for as long as it wants it. Since the CD drive only handles a single task at a time, one would think that there would be plenty of time that the CD was idle and not wanting the bus. Unforunately, it seems that the SCSI system/ aic7xxx driver is fast enough to process a command completion for the CD drive, setup a new command to send, and participate in the next arbitration phase. As the controller has the highest priority ID on the bus, this again "starves" the drives and opens the possibility for the CD drive to start requesting the bus. In the end, what I believe is happening is that the drive exhausts it's "reconnect attempt" count, and decides not to attempt to contact the initiator again. In the case of an Atlas II, if the initiator selects the drive (say to send an abort or abort tag message), the drive starts making reconnection attempts again and the wedge is cleared. Other drives may not behave as nicely. So, what can be done about this? I'm currently looking through the SCSI II and III specs to determine what the standard has to say about reconnect attempt failures and how to properly deal with them. It may be that the SCSI layer/Adaptec driver can take actions that will work on most devices. For a more immediate fix, I suggest experimenting with: 1) Swapping the IDs on your devices so that hard drives have higher arbitration priority on the bus. The Adaptec BIOS will still find your disks in the proper order for you to boot even if you stick your CDROM or tape drive's IDs down before the hard disks. 2) Playing with the settings in the Disconnect-Reconnect mode Page (page #0x2). Try setting the "Disconnect Time Limit" variable to something other than 0. This is the time, in hundredths of a millisecond, the device waits after disconnecting before participating in arbitration. For many of you, I would expect solution 1 to work just fine. For those of you with lots of disks on a single chain (even if you don't have a tape or cdrom drive), you will probably have to try solution #2. Remeber that it's not really the type of device that matters, but the possibility of starvation. If you have lots of concurrent I/O going on to multiple disks on a single chain, you can still experience this problem (Hi Satoshi!). More information when it becomes available. -- Justin