From owner-freebsd-stable@FreeBSD.ORG  Fri Jan 25 16:29:41 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0CA9F16A420
	for <freebsd-stable@freebsd.org>; Fri, 25 Jan 2008 16:29:41 +0000 (UTC)
	(envelope-from jdc@parodius.com)
Received: from mx01.sc1.parodius.com (mx01.sc1.parodius.com [72.20.106.3])
	by mx1.freebsd.org (Postfix) with ESMTP id 0B2EC13C4F3
	for <freebsd-stable@freebsd.org>; Fri, 25 Jan 2008 16:29:40 +0000 (UTC)
	(envelope-from jdc@parodius.com)
Received: by mx01.sc1.parodius.com (Postfix, from userid 1000)
	id D99F11CC079; Fri, 25 Jan 2008 08:29:40 -0800 (PST)
Date: Fri, 25 Jan 2008 08:29:40 -0800
From: Jeremy Chadwick <koitsu@FreeBSD.org>
To: Joe Peterson <joe@skyrush.com>
Message-ID: <20080125162940.GA38494@eos.sc1.parodius.com>
References: <479A0731.6020405@skyrush.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <479A0731.6020405@skyrush.com>
User-Agent: Mutt/1.5.16 (2007-06-09)
Cc: freebsd-stable@freebsd.org
Subject: Re: "ad0: TIMEOUT - WRITE_DMA" type errors with 7.0-RC1
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 25 Jan 2008 16:29:41 -0000

On Fri, Jan 25, 2008 at 08:58:41AM -0700, Joe Peterson wrote:
> I've seen mention of this kind of issue before, but I never saw a
> solution, except that someone reported that a certain version of 6.x
> seemed to make it go away - accounts of this problem are a bit vague.  I
> am running 7.0-RC1, and I am seeing the errors periodically, and I am
> wondering if this is a known issue.  Note that smartctl does not report
> errors logged and gives a "PASSED" to the drive.  I am running at
> UDMA100 ATA.  Also, if it matters, I am using ZFS.

What you've shown is usually the sign of a disk-related problem.  It's
very obvious when it's just one disk reporting DMA errors.  You use ZFS,
so chances are you have more than one disk in a pool/volume -- there's
no indication ad1, ad4, ad6, etc. are failing, so this seems to indicate
something specific to ad0.

Manufacturers pick very passive (non-aggressive) thresholds for error
conditions on disks, so disks which are failing very commonly show
"PASSED" during SMART analysis.  To make matters worse, most users I
know read SMART stats incorrectly (they're easy to misinterpret).

Can you please provide output of the following:

* smartctl -a /dev/ad0
* atacontrol cap ad0
* atacontrol info <ata0, ata1, etc. -- any controller used by ZFS>
* Relevant dmesg output that indicates what kind of ATA controller
  these disks are attached to.  Start with output from 'ad0:' and
  work backwards.  For example, ad0 on this machine is using an Intel
  ICH6 controller:
  atapci0: <Intel ICH6 SATA150 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 31.2 on pci0
  ata0: <ATA channel 0> on atapci0
  ad0: 238475MB <WDC WD2500KS-00MJB0 02.01C03> at ata0-master SATA150

Other stuff:

SMART stats which are labelled "Offline" are only updated when a short
or long offline test is performed.  Have you tried using "smartctl -t
short /dev/ad0" and "smartctl -t long /dev/ad0" to see if any of the raw
values on the far right column increment?

Have you tried using "zpool scrub" on the ZFS pool, then "zpool status"
to see if READ/WRITE/CHKSUM counters increment or if the "scrub" line
states there were errors?

Other things which have fixed problems in the past for others:

* BIOS updates
* Change of motherboards (sometimes replacing board with same model,
  other times going with a completely different vendor (implies weird
  implementation issues or BIOS problems))
* Changing SATA cables
* Getting a larger power supply (usually when lots of disk are involved)

-- 
| Jeremy Chadwick                                    jdc at parodius.com |
| Parodius Networking                           http://www.parodius.com/ |
| UNIX Systems Administrator                      Mountain View, CA, USA |
| Making life hard for others since 1977.                  PGP: 4BD6C0CB |