From owner-freebsd-current@FreeBSD.ORG  Fri Feb 18 18:47:35 2005
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 8BE5416A4CE
	for <freebsd-current@freebsd.org>;
	Fri, 18 Feb 2005 18:47:35 +0000 (GMT)
Received: from gromit.dlib.vt.edu (gromit.dlib.vt.edu [128.173.49.29])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 01CB143D45
	for <freebsd-current@freebsd.org>;
	Fri, 18 Feb 2005 18:47:35 +0000 (GMT)
	(envelope-from paul@gromit.dlib.vt.edu)
Received: from zappa.Chelsea-Ct.Org (pool-151-199-113-125.roa.east.verizon.net
	[151.199.113.125])
	by gromit.dlib.vt.edu (8.13.1/8.13.1) with ESMTP id j1IIlVBt095877
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
	Fri, 18 Feb 2005 13:47:32 -0500 (EST)
	(envelope-from paul@gromit.dlib.vt.edu)
Received: from zappa.Chelsea-Ct.Org (localhost.Chelsea-Ct.Org [127.0.0.1])
	by zappa.Chelsea-Ct.Org (8.13.1/8.13.1) with ESMTP id j1IIlQg5001203
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
	Fri, 18 Feb 2005 13:47:26 -0500 (EST)
	(envelope-from paul@gromit.dlib.vt.edu)
Received: (from paul@localhost)
	by zappa.Chelsea-Ct.Org (8.13.1/8.13.1/Submit) id j1IIlPD5001202;
	Fri, 18 Feb 2005 13:47:25 -0500 (EST)
	(envelope-from paul@gromit.dlib.vt.edu)
X-Authentication-Warning: zappa.Chelsea-Ct.Org: paul set sender to
	paul@gromit.dlib.vt.edu using -f
From: Paul Mather <paul@gromit.dlib.vt.edu>
To: freebsd-current@freebsd.org
In-Reply-To: <20050218175318.3681916A4CF@hub.freebsd.org>
References: <20050218175318.3681916A4CF@hub.freebsd.org>
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Date: Fri, 18 Feb 2005 13:47:25 -0500
Message-Id: <1108752445.1105.34.camel@zappa.Chelsea-Ct.Org>
Mime-Version: 1.0
X-Mailer: Evolution 2.0.3 FreeBSD GNOME Team Port 
cc: Reid Linnemann <lreid@a.cs.okstate.edu>
Subject: Re: ad WRITE_DMA timing out frequently
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Feb 2005 18:47:35 -0000

On Fri, 18 Feb 2005 09:03:35 -0600 (CST), "Reid Linnemann"
<lreid@cs.okstate.edu> wrote:

> I've recently brought a machine up from 5.3-STABLE to 6-CURRENT. It
> usually just sits in the corner and runs services, but lately I've
> come
> home form work or woken up to find that it is completely unresponsive,
> and I have to hard reset the machine. It happens at least once a day,
> and it's becoming more and more frequent. When I look at the console,
> I
> always have the same 4 messages before the failure:
> 
> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
>  kernel: ad0: FAILURE - WRITE_DMA timed out
> kernel: g_vfs_done():ad0s1d[WRITE(offset=52772864, length=16384)]error
> = 5
> 
> It seems to me that a sector on the disk might be dead in the ad0s1d
> slice (/var), but I want to be certain before I take further steps
> that
> the behavior I'm experiencing is positively unrelated to the migration
> to 6-CURRENT.
> 
> I started poking around /var to see if anything was amiss, and I found
> that mail messages are being stacked up in /var/spool/clientmqueue,
> even
> though nothing should be using the msp queue (I've redirected periodic
> outputs to logfiles).  In the last daily run mailed to root in
> January,
> I found records in the submit queue that looked like this:
> 
> j0EDINHh049826     2489 Fri Jan 14 07:18 MAILER-DAEMON
>                  (Deferred: Permission denied)
> 
> There were nearly 500 of them.
> 
> Even after redirecting periodic output to logs and clearing out the
> client mail queue, this continues to happen, and I have a hunch that
> it
> may be related to the WRITE_DMA timeouts, as it's the only weird
> behavior I can see on /var. If anyone can help me shed some light on
> this, I'd appreciate it. I've had 2 IDE drives die in this machine
> already, I'm going to be severely depressed if I've killed a third.

The "TIMEOUT - WRITE_DMA" issue has been a recurring problem for me
since somewhere in the 5.2.1--5.3 release range.  (It's been so long now
that I don't remember whether it first started plaguing me in 5.2.1 or
5.3.  I do know for definite I never got this problem in 5.1 and it only
crept in during an "upgrade.")

Like you, this has been happening more frequently with 6-CURRENT for me.
As in your case, I come to find the machine completely unresponsive
(though still pingable) and I have to hard reset the machine.  I'm
finding this is now happening roughly every other day on average for the
past week since my last system rebuild (FreeBSD 6.0-CURRENT #0: Fri Feb
11 09:03:49 EST 2005).

In my case, I'm using geom_mirror to mirror two drives.  The "TIMEOUT -
WRITE_DMA" involves the geom_mirror metadata sector on one of the two
drives, but not always the same one (sometimes it is ad0, sometimes
ad2).  The net result is to cause the drive in question to be removed
from the mirror.  Disappointingly, rather than carry on in degraded
fashion, lately the system seems eventually to seize up as you describe.
It doesn't seem to seize up immediately, because I notice an entry
in /var/log/messages after the error but before the required hard reset
reboot:

Feb 18 05:24:38 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679
Feb 18 05:24:43 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679
Feb 18 05:24:43 zappa kernel: ad2: FAILURE - WRITE_DMA timed out
Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Cannot update metadata on disk ad2 (error=5).
Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Device raid1: provider ad2 disconnected.
Feb 18 09:46:35 zappa named[349]: zone ./IN: Transfer started.
Feb 18 09:46:35 zappa named[349]: transfer of './IN' from 128.9.0.107#53: connected using 192.168.1.25#64153
Feb 18 09:46:37 zappa named[349]: zone ./IN: transferred serial 2005021800
Feb 18 09:46:37 zappa named[349]: transfer of './IN' from 128.9.0.107#53: end of transfer
[[forced reboot]]
Feb 18 11:48:46 zappa syslogd: kernel boot file is /boot/kernel/kernel
Feb 18 11:48:46 zappa kernel: Copyright (c) 1992-2005 The FreeBSD Project.
Feb 18 11:48:46 zappa kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Feb 18 11:48:46 zappa kernel: The Regents of the University of California. All rights reserved.
Feb 18 11:48:46 zappa kernel: FreeBSD 6.0-CURRENT #0: Fri Feb 11 09:03:49 EST 2005


I get this problem on 6-CURRENT and also RELENG_5.  The RELENG_5 system
has a geom_vinum mirrored setup, and when the "TIMEOUT - WRITE_DMA"
occurs I lose the associated drive and plexes from the configuration.
The problem does not happen as often on the RELENG_5 system, as it does
on HEAD, at least nowadays it doesn't.

I run smartctl on the systems, and none of the drives report any errors,
and the "WORST" values recorded are nowhere near close to their
respective failure thresholds.

In my case, I have one area of commonality.  Between the three different
systems on which I've experienced this problem, all use the Intel PIIX4
ATA controller and the same IBM-DJNA-352500/J51OA30K hard drives.  So,
I'm wondering if there is something about this particular combination
that gives rise to this annoying problem?

I do use the same IBM-DJNA-352500/J51OA30K hard drives in another system
and have never experienced this (or any other) problem.  However, it is
running 4.11-STABLE and has a "VIA 82C686 ATA66 controller", so it's
impossible to tell if it's 4.11-STABLE or the VIA ATA controller
contributing to the stability in that case.

I don't think I have a hardware problem.  The same setup ran fine under
earlier 5.x releases.  But, somewhere, this issue crept in (I remember
threads on freebsd-current about it), and recently it appears to be
getting worse (at least for me).  Also, unfortunately for me,
geom_mirror used to roll with the punches when I lost a drive through
this "TIMEOUT - WRITE_DMA" problem, but now it doesn't. :-(

Cheers,

Paul.
-- 
e-mail: paul@gromit.dlib.vt.edu

"Without music to decorate it, time is just a bunch of boring production
 deadlines or dates by which bills must be paid."
        --- Frank Vincent Zappa