FreeBSD Mail Archives

Date:      Tue, 21 Jul 1998 09:06:17 -0500
From:      Karl Denninger  <karl@mcs.net>
To:        Willem Jan Withagen <wjw@surf.IAE.nl>
Cc:        scsi@FreeBSD.ORG
Subject:   Re: AHC errors under load.
Message-ID:  <19980721090617.59118@mcs.net>
In-Reply-To: <199807211148.NAA05663@surf.IAE.nl>; from Willem Jan Withagen on Tue, Jul 21, 1998 at 01:48:08PM %2B0200
References:  <199807202153.VAA16546@digi.digiware.nl> <19980720175223.53615@mcs.net> <199807211148.NAA05663@surf.IAE.nl>

On Tue, Jul 21, 1998 at 01:48:08PM +0200, Willem Jan Withagen wrote:
> In article <19980720175223.53615@mcs.net> you write:
> >This is VERY bad.  You now have trash in at least one file (if you're
> >lucky), or a wrecked directory structure (if you're unlucky).
> >
> >Find the cause.  
> >
> >Like immediately.  
> >
> >Termination problems can cause this problem, and it should be treated with
> >EXTREME severity.
> 
> Well it's not that serious, since I was only running bonnie and iozone
> on that volume, as part of a stress-test.
> I guess it failed.

Uh, yeah.

> You say termination? Could it be funcky firmware?

It could be.  It could also be a bad (out of spec) cable, a bad terminator,
or a bad device (ie: the data buffers and/or drivers on the device itself or
the adapter are defective).

> But I'm still wondering what the messages mean??
> Is it because a device did not respond? Did the controller forget about
> things? .....
> 
> --WjW

This usually means that the drive did not strobe the data off the bus within
the time window allowed, and as such the adapter "overran".  

SCSI is a clocked-bus architecture operating on rather strict timing
requirements.  If you do not acknowledge the data that is placed on the 
bus (indicating that you latched it, and the adapter is now free to 
release the data lines) this can happen.

Now as it turns out this may happen because either (1) the "data is valid"
signal never was recognized by the other device, (2) the "data is ACKd"
signal never got back to the adapter, (3) the cable is too long, and as a
result the signal didn't propagate to the other end in time, or (4) due 
to a reflection in the cable caused by either improper impedence, bad 
termination or an excessive "stub" length one or more reflected signals 
was interpreted as one of the above signals when it really hadn't been 
presented.

What happens in that case is that the adapter and/or device gets confused,
and you get this kind of error.

The BITCH is that on a non-CAM system, the number of aborted transactions
is sometimes unknown.  This means that you could miss one or more writes
to the disk.  I've seen *major* disk corruption come out of this kind
of error during testing.

CAM fixes a lot of the retry semantics, but I don't think its possible to 
be *certain* that all the data got where it is supposed to be, and got there 
intact.  SCSI parity is supposed to catch data errors (if you EVER get a 
parity error on the SCSI bus you have real trouble; find the reason and fix 
it immediately!) but parity is by definition a single-bit-error detector, 
and will fail half the time to detect a double-bit (or more) error.  If
you're seeing these errors there is a possibility that you are also
taking undetected data errors and the data on your disk is trash!

There is a also note in one of the recent kernel source files that a 
perculiarity in Rev B chipsets of certain Adaptec products has some kind 
of problem which can be tickled by code of specific revisions, and that 
this kind of problem could manifest as a result (it is claimed to be fixed 
in the most recent code branch).

I've seen this error while qualifying disk subsystems before, and generally
it means the cables, terminators, and/or drive electronics are not up to
snuff.  One potential fix (which you probably won't like) is to cut back 
the transfer rate in the Adapter (you can do this from the "setup" screen 
which is enterable during the boot sequence; hit CTRL-A when prompted)

That is really a hack, as what you're doing by slowing down the bus is 
increasing the timing window during which an DAV or DACK signal is
considered to be valid.  If you're trying to run with a cable which is 
too long but everything else is ok, however, this will usually work - 
at the cost of some performance.

--
-- 
Karl Denninger (karl@MCS.Net)| MCSNet - Serving Chicagoland and Wisconsin
http://www.mcs.net/          | T1's from $600 monthly / All Lines K56Flex/DOV
			     | NEW! Corporate ISDN Prices dropped by up to 50%!
Voice: [+1 312 803-MCS1 x219]| EXCLUSIVE NEW FEATURE ON ALL PERSONAL ACCOUNTS
Fax:   [+1 312 803-4929]     | *SPAMBLOCK* Technology now included at no cost

> >> I got the message below, some more junk, and then the AHC resets the bus.
> >> But the system is still up and running.
> >> First time I eve see this.
> >> Can anybody explain what happened?
> 
> >> 
> >> hobby kernel log messages:
> >> > sd2(ahc0:9:0): data overrun of 16777215 bytes detected in Data-Out phase.  Tag == 0x9.  Forcing a retry.
> >> > sd2(ahc0:9:0): Have seen Data Phase.  Length = 65536.  NumSGs = 15.
> >> > sg[0] - Addr 0x3326000 : Length 4096
> >> > sg[1] - Addr 0x627000 : Length 8192
> >> > sg[2] - Addr 0x3ea9000 : Length 4096
> >> > sg[3] - Addr 0x3baa000 : Length 4096
> >> > sg[4] - Addr 0x206b000 : Length 4096
> >> > sg[5] - Addr 0x242c000 : Length 4096
> >> > sg[6] - Addr 0x396d000 : Length 4096
> 
> 
> -- 
> Internet Access Eindhoven BV.,  voice: +31-40-2 393 393, data: +31-40-2 606 606
> P.O. 928, 5600 AX Eindhoven, The Netherlands
> Full Internet connectivity for only fl 12.95 a month.
> Call now, and login as 'new'.
> 
> To Unsubscribe: send mail to majordomo@FreeBSD.org
> with "unsubscribe freebsd-scsi" in the body of the message

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19980721090617.59118>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation