From owner-freebsd-fs@FreeBSD.ORG Wed Apr 18 16:18:38 2007 Return-Path: X-Original-To: fs@freebsd.org Delivered-To: freebsd-fs@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 99E8116A401 for ; Wed, 18 Apr 2007 16:18:38 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: from kiwi-computer.com (keira.kiwi-computer.com [63.224.10.3]) by mx1.freebsd.org (Postfix) with SMTP id 399E413C465 for ; Wed, 18 Apr 2007 16:18:38 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: (qmail 21184 invoked by uid 2001); 18 Apr 2007 15:51:56 -0000 Date: Wed, 18 Apr 2007 10:51:56 -0500 From: "Rick C. Petty" To: St?le Kristoffersen Message-ID: <20070418155156.GB20441@keira.kiwi-computer.com> References: <20070418104155.GA31727@eschew.pusen.org> <86hcrdlqak.fsf@dwp.des.no> <20070418144103.GB31727@eschew.pusen.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20070418144103.GB31727@eschew.pusen.org> User-Agent: Mutt/1.4.2.1i Cc: fs@freebsd.org Subject: Re: ZFS + replacing failing hard-drive. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: rick-freebsd@kiwi-computer.com List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Apr 2007 16:18:38 -0000 On Wed, Apr 18, 2007 at 04:41:03PM +0200, St?le Kristoffersen wrote: > > > > I don't think you do. This appears to be a bug in the ata driver > > which ZFS is particularly good at triggering. > > I first noticed the problems running UFS an the first partition, and I have > tried the drive on all of the following controllers: > atapci0: port 0xcf00-0xcf7f mem 0xfddff000-0xfddff07f,0xfddf8000-0xfddfbfff irq 19 at device 0.0 on pci4 > atapci1: port 0xaf00-0xaf07,0xae00-0xae03,0xad00-0xad07,0xac00-0xac03,0xab00-0xab0f mem 0xfd9fe000-0xfd9fffff irq 17 at device 0.0 on pci6 > atapci2: port 0xfa00-0xfa07,0xf900-0xf903,0xf800-0xf807,0xf700-0xf703,0xf600-0xf60f,0xf500-0xf50f irq 19 at device 31.2 on pci0 > atapci3: port 0xf300-0xf307,0xf200-0xf203,0xf100-0xf107,0xf000-0xf003,0xef00-0xef0f,0xee00-0xee0f irq 19 at device 31.5 on pci0 > > Same problem on all. And to support my theory that the disk was bad the new > disk does not behave badly, even after a zpool scrub. That doesn't prove the disk was/is "bad". Here I'm using the word "bad" to mean the disk has had at least 1 non-recoverable failure (i.e. a bad area on the platter surface was discovered and the drive was unable to remap it). As new as SATA300 is, it is doubtful (although possible) that the drive is "bad"/defective. > > BTW, the message you show is harmless: see where it says "retrying"? > > No need to worry until it says "FAILURE - WRITE_DMA timed out". > > Just had a quick peek in the logs and did not find any of them the last > time, but I do get them: > Apr 13 21:17:14 fs kernel: ad14: FAILURE - WRITE_DMA48 timed out LBA=719378349 > Apr 13 21:22:23 fs kernel: ad14: FAILURE - WRITE_DMA48 status=51 error=10 LBA=719341415 I've noticed rarely that the DMA timeouts aren't always reported before a drive is dropped, and oftentimes DMA timeouts *don't* drop the drive. The latter case is good cuz I'll stop the disk activity and tell gvinum to start the disk again, but the former confounds me-- it's never been reproducable so I couldn't track it down. It could also just be a syslog issue. > Another issue is that even if all the drives support SATA300, and all the > controllers does so as well, they still come up as SATA150 (except one). > (And yeah, I have removed that jumper) > ad8: 305245MB at ata4-master SATA300 > ad10: 381554MB at ata5-master SATA150 > ad14: 305245MB at ata7-master SATA150 > ad15: 305245MB at ata7-slave SATA150 > ad16: 305245MB at ata8-master SATA150 I've noticed this behavior on certain controllers (Intel in particular). Which drives correspond to which controller cards? -- Rick C. Petty