From owner-freebsd-scsi Tue Jul 21 07:06:43 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id HAA25448 for freebsd-scsi-outgoing; Tue, 21 Jul 1998 07:06:43 -0700 (PDT) (envelope-from owner-freebsd-scsi@FreeBSD.ORG) Received: from Kitten.mcs.com (Kitten.mcs.com [192.160.127.90]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id HAA25437 for ; Tue, 21 Jul 1998 07:06:36 -0700 (PDT) (envelope-from karl@Mars.mcs.net) Received: from Mars.mcs.net (karl@Mars.mcs.net [192.160.127.85]) by Kitten.mcs.com (8.8.7/8.8.2) with ESMTP id JAA14736; Tue, 21 Jul 1998 09:06:18 -0500 (CDT) Received: (from karl@localhost) by Mars.mcs.net (8.8.7/8.8.2) id JAA16374; Tue, 21 Jul 1998 09:06:17 -0500 (CDT) Message-ID: <19980721090617.59118@mcs.net> Date: Tue, 21 Jul 1998 09:06:17 -0500 From: Karl Denninger To: Willem Jan Withagen Cc: scsi@FreeBSD.ORG Subject: Re: AHC errors under load. References: <199807202153.VAA16546@digi.digiware.nl> <19980720175223.53615@mcs.net> <199807211148.NAA05663@surf.IAE.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.84 In-Reply-To: <199807211148.NAA05663@surf.IAE.nl>; from Willem Jan Withagen on Tue, Jul 21, 1998 at 01:48:08PM +0200 Sender: owner-freebsd-scsi@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org On Tue, Jul 21, 1998 at 01:48:08PM +0200, Willem Jan Withagen wrote: > In article <19980720175223.53615@mcs.net> you write: > >This is VERY bad. You now have trash in at least one file (if you're > >lucky), or a wrecked directory structure (if you're unlucky). > > > >Find the cause. > > > >Like immediately. > > > >Termination problems can cause this problem, and it should be treated with > >EXTREME severity. > > Well it's not that serious, since I was only running bonnie and iozone > on that volume, as part of a stress-test. > I guess it failed. Uh, yeah. > You say termination? Could it be funcky firmware? It could be. It could also be a bad (out of spec) cable, a bad terminator, or a bad device (ie: the data buffers and/or drivers on the device itself or the adapter are defective). > But I'm still wondering what the messages mean?? > Is it because a device did not respond? Did the controller forget about > things? ..... > > --WjW This usually means that the drive did not strobe the data off the bus within the time window allowed, and as such the adapter "overran". SCSI is a clocked-bus architecture operating on rather strict timing requirements. If you do not acknowledge the data that is placed on the bus (indicating that you latched it, and the adapter is now free to release the data lines) this can happen. Now as it turns out this may happen because either (1) the "data is valid" signal never was recognized by the other device, (2) the "data is ACKd" signal never got back to the adapter, (3) the cable is too long, and as a result the signal didn't propagate to the other end in time, or (4) due to a reflection in the cable caused by either improper impedence, bad termination or an excessive "stub" length one or more reflected signals was interpreted as one of the above signals when it really hadn't been presented. What happens in that case is that the adapter and/or device gets confused, and you get this kind of error. The BITCH is that on a non-CAM system, the number of aborted transactions is sometimes unknown. This means that you could miss one or more writes to the disk. I've seen *major* disk corruption come out of this kind of error during testing. CAM fixes a lot of the retry semantics, but I don't think its possible to be *certain* that all the data got where it is supposed to be, and got there intact. SCSI parity is supposed to catch data errors (if you EVER get a parity error on the SCSI bus you have real trouble; find the reason and fix it immediately!) but parity is by definition a single-bit-error detector, and will fail half the time to detect a double-bit (or more) error. If you're seeing these errors there is a possibility that you are also taking undetected data errors and the data on your disk is trash! There is a also note in one of the recent kernel source files that a perculiarity in Rev B chipsets of certain Adaptec products has some kind of problem which can be tickled by code of specific revisions, and that this kind of problem could manifest as a result (it is claimed to be fixed in the most recent code branch). I've seen this error while qualifying disk subsystems before, and generally it means the cables, terminators, and/or drive electronics are not up to snuff. One potential fix (which you probably won't like) is to cut back the transfer rate in the Adapter (you can do this from the "setup" screen which is enterable during the boot sequence; hit CTRL-A when prompted) That is really a hack, as what you're doing by slowing down the bus is increasing the timing window during which an DAV or DACK signal is considered to be valid. If you're trying to run with a cable which is too long but everything else is ok, however, this will usually work - at the cost of some performance. -- -- Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin http://www.mcs.net/ | T1's from $600 monthly / All Lines K56Flex/DOV | NEW! Corporate ISDN Prices dropped by up to 50%! Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS Fax: [+1 312 803-4929] | *SPAMBLOCK* Technology now included at no cost > >> I got the message below, some more junk, and then the AHC resets the bus. > >> But the system is still up and running. > >> First time I eve see this. > >> Can anybody explain what happened? > > >> > >> hobby kernel log messages: > >> > sd2(ahc0:9:0): data overrun of 16777215 bytes detected in Data-Out phase. Tag == 0x9. Forcing a retry. > >> > sd2(ahc0:9:0): Have seen Data Phase. Length = 65536. NumSGs = 15. > >> > sg[0] - Addr 0x3326000 : Length 4096 > >> > sg[1] - Addr 0x627000 : Length 8192 > >> > sg[2] - Addr 0x3ea9000 : Length 4096 > >> > sg[3] - Addr 0x3baa000 : Length 4096 > >> > sg[4] - Addr 0x206b000 : Length 4096 > >> > sg[5] - Addr 0x242c000 : Length 4096 > >> > sg[6] - Addr 0x396d000 : Length 4096 > > > -- > Internet Access Eindhoven BV., voice: +31-40-2 393 393, data: +31-40-2 606 606 > P.O. 928, 5600 AX Eindhoven, The Netherlands > Full Internet connectivity for only fl 12.95 a month. > Call now, and login as 'new'. > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-scsi" in the body of the message To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-scsi" in the body of the message