From owner-freebsd-hackers  Sun Sep 23 14:13: 9 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from freebsd.dk (fw-rl0.freebsd.dk [212.242.86.114])
	by hub.freebsd.org (Postfix) with ESMTP id 0B71137B435
	for <freebsd-hackers@FreeBSD.ORG>; Sun, 23 Sep 2001 14:13:00 -0700 (PDT)
Received: (from sos@localhost)
	by freebsd.dk (8.11.3/8.11.3) id f8NLCsE42136;
	Sun, 23 Sep 2001 23:12:54 +0200 (CEST)
	(envelope-from sos)
From: Søren Schmidt <sos@freebsd.dk>
Message-Id: <200109232112.f8NLCsE42136@freebsd.dk>
Subject: Re: Problems with many ATA drives
In-Reply-To: <200109231643.JAA09454@hokkshideh.jetcafe.org> "from Dave Hayes
 at Sep 23, 2001 09:43:25 am"
To: Dave Hayes <dave@jetcafe.org>
Date: Sun, 23 Sep 2001 23:12:54 +0200 (CEST)
Cc: freebsd-hackers@FreeBSD.ORG
Reply-To: sos@freebsd.dk
X-Mailer: ELM [version 2.4ME+ PL88 (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=ISO-8859-1
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

It seems Dave Hayes wrote:
> 
> ad1: READ command timeout tag=0 serv=0 - resetting
> ata0: resetting devices .. done
> ad1a: hard error reading fsbn 5068879 (ad1 bn 5068879; cn 315 tn 133 sn 
> 25)ad1a: hard error reading fsbn 5068879 (ad1 bn 5068879; cn 315 tn 133 sn 25) 
> status=59 error=40
> 
> I notice 3 out of 11 drives produce this error, so far one on each
> controller (ruling out a specific controller issue). I didn't want to
> just assume the failure rate of 80GB IDE drives is that large, so
> I'm asking this list for it's opinion:
> 
> a) Is this a bug or consequence of software drivers? (see
> bug kern/17592)
> 
> b) Or is it just that IDE drives are cheap and fail this much?
> 
> Relevant data from dmesg:
> 
> atapci0: <Promise ATA100 controller> port 0xb000-0xb00f,0xb400-0xb403,0xb800-0x
> b807,0xd000-0xd003,0xd400-0xd407 mem 0xf5800000-0xf5803fff irq 6 at device 
> 10.0 on pci2
> ata2: at 0xd400 on atapci0
> ata3: at 0xb800 on atapci0
> atapci1: <Promise ATA100 controller> port 0x9400-0x940f,0x9800-0x9803,0xa000-0x
> a007,0xa400-0xa403,0xa800-0xa807 mem 0xf5000000-0xf5003fff irq 9 at device 
> 11.0 on pci2
> ata4: at 0xa800 on atapci1
> ata5: at 0xa000 on atapci1
> ...
> atapci2: <Intel ICH2 ATA100 controller> port 0x8800-0x880f at device 31.1 on 
> pci0
> ata0: at 0x1f0 irq 14 on atapci2
> ata1: at 0x170 irq 15 on atapci2
> ...
> ad0: 78167MB <Maxtor 4W080H6> [158816/16/63] at ata0-master UDMA100
> ad1: 78167MB <Maxtor 4W080H6> [158816/16/63] at ata0-slave UDMA100
> ad2: 78167MB <Maxtor 4W080H6> [158816/16/63] at ata1-master UDMA100
> ad3: 78167MB <Maxtor 4W080H6> [158816/16/63] at ata1-slave UDMA100
> ad4: 78167MB <Maxtor 4W080H6> [158816/16/63] at ata2-master WDMA2
> ad5: 78167MB <Maxtor 4W080H6> [158816/16/63] at ata2-slave WDMA2
> ad6: 78167MB <Maxtor 4W080H6> [158816/16/63] at ata3-master WDMA2
> ad7: 78167MB <Maxtor 4W080H6> [158816/16/63] at ata3-slave WDMA2
> ad8: 78167MB <Maxtor 4W080H6> [158816/16/63] at ata4-master WDMA2
> ad9: 78167MB <Maxtor 4W080H6> [158816/16/63] at ata4-slave WDMA2
> 
> Yes, we know that the "WDMA2" is happening, this state proved to be
> independant of a drive failing. It has to do with 10 drives in a tower 
> and cable lengths... =(

Hmm, first of the error above looks very much to be a genuine media
error on the disks, are the bad spot always the same or random ?
Anyhow does the 3 bad ones produce the error regardless of what 
controller they are put on? I assume that its always the same 3
drives that are failing right ?

Oh, and you should take cable length seriously, remember you only
get ICRC errors (which the ATA driver retries) on UDMA33 and above,
at WDMA2 speed there is *NO* CRC check at all (the HW doesn't
support that), so you wont know when your data has been currupted :)
So thinking that you solved the problem by going to WDMA2 mode is
extremly dangerous, you are just hiding the problem as data
corruption will very likely still happen when you use off-spec
cableing.

-Søren

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message