From owner-freebsd-hackers Sun Sep 23 14:13: 9 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from freebsd.dk (fw-rl0.freebsd.dk [212.242.86.114]) by hub.freebsd.org (Postfix) with ESMTP id 0B71137B435 for ; Sun, 23 Sep 2001 14:13:00 -0700 (PDT) Received: (from sos@localhost) by freebsd.dk (8.11.3/8.11.3) id f8NLCsE42136; Sun, 23 Sep 2001 23:12:54 +0200 (CEST) (envelope-from sos) From: Søren Schmidt Message-Id: <200109232112.f8NLCsE42136@freebsd.dk> Subject: Re: Problems with many ATA drives In-Reply-To: <200109231643.JAA09454@hokkshideh.jetcafe.org> "from Dave Hayes at Sep 23, 2001 09:43:25 am" To: Dave Hayes Date: Sun, 23 Sep 2001 23:12:54 +0200 (CEST) Cc: freebsd-hackers@FreeBSD.ORG Reply-To: sos@freebsd.dk X-Mailer: ELM [version 2.4ME+ PL88 (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=ISO-8859-1 Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG It seems Dave Hayes wrote: > > ad1: READ command timeout tag=0 serv=0 - resetting > ata0: resetting devices .. done > ad1a: hard error reading fsbn 5068879 (ad1 bn 5068879; cn 315 tn 133 sn > 25)ad1a: hard error reading fsbn 5068879 (ad1 bn 5068879; cn 315 tn 133 sn 25) > status=59 error=40 > > I notice 3 out of 11 drives produce this error, so far one on each > controller (ruling out a specific controller issue). I didn't want to > just assume the failure rate of 80GB IDE drives is that large, so > I'm asking this list for it's opinion: > > a) Is this a bug or consequence of software drivers? (see > bug kern/17592) > > b) Or is it just that IDE drives are cheap and fail this much? > > Relevant data from dmesg: > > atapci0: port 0xb000-0xb00f,0xb400-0xb403,0xb800-0x > b807,0xd000-0xd003,0xd400-0xd407 mem 0xf5800000-0xf5803fff irq 6 at device > 10.0 on pci2 > ata2: at 0xd400 on atapci0 > ata3: at 0xb800 on atapci0 > atapci1: port 0x9400-0x940f,0x9800-0x9803,0xa000-0x > a007,0xa400-0xa403,0xa800-0xa807 mem 0xf5000000-0xf5003fff irq 9 at device > 11.0 on pci2 > ata4: at 0xa800 on atapci1 > ata5: at 0xa000 on atapci1 > ... > atapci2: port 0x8800-0x880f at device 31.1 on > pci0 > ata0: at 0x1f0 irq 14 on atapci2 > ata1: at 0x170 irq 15 on atapci2 > ... > ad0: 78167MB [158816/16/63] at ata0-master UDMA100 > ad1: 78167MB [158816/16/63] at ata0-slave UDMA100 > ad2: 78167MB [158816/16/63] at ata1-master UDMA100 > ad3: 78167MB [158816/16/63] at ata1-slave UDMA100 > ad4: 78167MB [158816/16/63] at ata2-master WDMA2 > ad5: 78167MB [158816/16/63] at ata2-slave WDMA2 > ad6: 78167MB [158816/16/63] at ata3-master WDMA2 > ad7: 78167MB [158816/16/63] at ata3-slave WDMA2 > ad8: 78167MB [158816/16/63] at ata4-master WDMA2 > ad9: 78167MB [158816/16/63] at ata4-slave WDMA2 > > Yes, we know that the "WDMA2" is happening, this state proved to be > independant of a drive failing. It has to do with 10 drives in a tower > and cable lengths... =( Hmm, first of the error above looks very much to be a genuine media error on the disks, are the bad spot always the same or random ? Anyhow does the 3 bad ones produce the error regardless of what controller they are put on? I assume that its always the same 3 drives that are failing right ? Oh, and you should take cable length seriously, remember you only get ICRC errors (which the ATA driver retries) on UDMA33 and above, at WDMA2 speed there is *NO* CRC check at all (the HW doesn't support that), so you wont know when your data has been currupted :) So thinking that you solved the problem by going to WDMA2 mode is extremly dangerous, you are just hiding the problem as data corruption will very likely still happen when you use off-spec cableing. -Søren To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message