From owner-freebsd-stable@FreeBSD.ORG Fri Jan 25 16:29:41 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0CA9F16A420 for ; Fri, 25 Jan 2008 16:29:41 +0000 (UTC) (envelope-from jdc@parodius.com) Received: from mx01.sc1.parodius.com (mx01.sc1.parodius.com [72.20.106.3]) by mx1.freebsd.org (Postfix) with ESMTP id 0B2EC13C4F3 for ; Fri, 25 Jan 2008 16:29:40 +0000 (UTC) (envelope-from jdc@parodius.com) Received: by mx01.sc1.parodius.com (Postfix, from userid 1000) id D99F11CC079; Fri, 25 Jan 2008 08:29:40 -0800 (PST) Date: Fri, 25 Jan 2008 08:29:40 -0800 From: Jeremy Chadwick To: Joe Peterson Message-ID: <20080125162940.GA38494@eos.sc1.parodius.com> References: <479A0731.6020405@skyrush.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <479A0731.6020405@skyrush.com> User-Agent: Mutt/1.5.16 (2007-06-09) Cc: freebsd-stable@freebsd.org Subject: Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Jan 2008 16:29:41 -0000 On Fri, Jan 25, 2008 at 08:58:41AM -0700, Joe Peterson wrote: > I've seen mention of this kind of issue before, but I never saw a > solution, except that someone reported that a certain version of 6.x > seemed to make it go away - accounts of this problem are a bit vague. I > am running 7.0-RC1, and I am seeing the errors periodically, and I am > wondering if this is a known issue. Note that smartctl does not report > errors logged and gives a "PASSED" to the drive. I am running at > UDMA100 ATA. Also, if it matters, I am using ZFS. What you've shown is usually the sign of a disk-related problem. It's very obvious when it's just one disk reporting DMA errors. You use ZFS, so chances are you have more than one disk in a pool/volume -- there's no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate something specific to ad0. Manufacturers pick very passive (non-aggressive) thresholds for error conditions on disks, so disks which are failing very commonly show "PASSED" during SMART analysis. To make matters worse, most users I know read SMART stats incorrectly (they're easy to misinterpret). Can you please provide output of the following: * smartctl -a /dev/ad0 * atacontrol cap ad0 * atacontrol info * Relevant dmesg output that indicates what kind of ATA controller these disks are attached to. Start with output from 'ad0:' and work backwards. For example, ad0 on this machine is using an Intel ICH6 controller: atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0 ata0: on atapci0 ad0: 238475MB at ata0-master SATA150 Other stuff: SMART stats which are labelled "Offline" are only updated when a short or long offline test is performed. Have you tried using "smartctl -t short /dev/ad0" and "smartctl -t long /dev/ad0" to see if any of the raw values on the far right column increment? Have you tried using "zpool scrub" on the ZFS pool, then "zpool status" to see if READ/WRITE/CHKSUM counters increment or if the "scrub" line states there were errors? Other things which have fixed problems in the past for others: * BIOS updates * Change of motherboards (sometimes replacing board with same model, other times going with a completely different vendor (implies weird implementation issues or BIOS problems)) * Changing SATA cables * Getting a larger power supply (usually when lots of disk are involved) -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |