From owner-freebsd-current@FreeBSD.ORG  Fri Feb 18 19:09:53 2005
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A0F1A16A4CE
	for <freebsd-current@freebsd.org>;
	Fri, 18 Feb 2005 19:09:53 +0000 (GMT)
Received: from csa.cs.okstate.edu (a.cs.okstate.edu [139.78.113.1])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3AC2143D1F
	for <freebsd-current@freebsd.org>;
	Fri, 18 Feb 2005 19:09:53 +0000 (GMT)
	(envelope-from lreid@a.cs.okstate.edu)
Received: by csa.cs.okstate.edu (Postfix, from userid 601)
	id E8EF2A063E; Fri, 18 Feb 2005 13:09:52 -0600 (CST)
To: freebsd-current@freebsd.org
Received: from 164.58.79.196 (auth. user lreid@a.cs.okstate.edu)
          by cs.okstate.edu with HTTP; Fri, 18 Feb 2005 13:09:52 -0600
X-IlohaMail-Blah: lreid@a.cs.okstate.edu
X-IlohaMail-Method: mail() [mem]
X-IlohaMail-Dummy: moo
X-Mailer: IlohaMail/0.8.12 (On: cs.okstate.edu)
From: "Reid Linnemann" <lreid@cs.okstate.edu>
Bounce-To: "Reid Linnemann" <lreid@cs.okstate.edu>
Errors-To: "Reid Linnemann" <lreid@cs.okstate.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Message-Id: <20050218190952.E8EF2A063E@csa.cs.okstate.edu>
Date: Fri, 18 Feb 2005 13:09:52 -0600 (CST)
Subject: Re: ad WRITE_DMA timing out frequently
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Feb 2005 19:09:53 -0000


On 2/18/2005, "Paul Mather" <paul@gromit.dlib.vt.edu> wrote:

>On Fri, 18 Feb 2005 09:03:35 -0600 (CST), "Reid Linnemann"
><lreid@cs.okstate.edu> wrote:
>
>> I've recently brought a machine up from 5.3-STABLE to 6-CURRENT. It
>> usually just sits in the corner and runs services, but lately I've
>> come
>> home form work or woken up to find that it is completely unresponsive,
>> and I have to hard reset the machine. It happens at least once a day,
>> and it's becoming more and more frequent. When I look at the console,
>> I
>> always have the same 4 messages before the failure:
>>
>> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
>> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599
>>  kernel: ad0: FAILURE - WRITE_DMA timed out
>> kernel: g_vfs_done():ad0s1d[WRITE(offset=52772864, length=16384)]error
>> = 5
>>
>> It seems to me that a sector on the disk might be dead in the ad0s1d
>> slice (/var), but I want to be certain before I take further steps
>> that
>> the behavior I'm experiencing is positively unrelated to the migration
>> to 6-CURRENT.
>>
>> I started poking around /var to see if anything was amiss, and I found
>> that mail messages are being stacked up in /var/spool/clientmqueue,
>> even
>> though nothing should be using the msp queue (I've redirected periodic
>> outputs to logfiles).  In the last daily run mailed to root in
>> January,
>> I found records in the submit queue that looked like this:
>>
>> j0EDINHh049826     2489 Fri Jan 14 07:18 MAILER-DAEMON
>>                  (Deferred: Permission denied)
>>
>> There were nearly 500 of them.
>>
>> Even after redirecting periodic output to logs and clearing out the
>> client mail queue, this continues to happen, and I have a hunch that
>> it
>> may be related to the WRITE_DMA timeouts, as it's the only weird
>> behavior I can see on /var. If anyone can help me shed some light on
>> this, I'd appreciate it. I've had 2 IDE drives die in this machine
>> already, I'm going to be severely depressed if I've killed a third.
>
>The "TIMEOUT - WRITE_DMA" issue has been a recurring problem for me
>since somewhere in the 5.2.1--5.3 release range.  (It's been so long now
>that I don't remember whether it first started plaguing me in 5.2.1 or
>5.3.  I do know for definite I never got this problem in 5.1 and it only
>crept in during an "upgrade.")
>
>Like you, this has been happening more frequently with 6-CURRENT for me.
>As in your case, I come to find the machine completely unresponsive
>(though still pingable) and I have to hard reset the machine.  I'm
>finding this is now happening roughly every other day on average for the
>past week since my last system rebuild (FreeBSD 6.0-CURRENT #0: Fri Feb
>11 09:03:49 EST 2005).
>
>In my case, I'm using geom_mirror to mirror two drives.  The "TIMEOUT -
>WRITE_DMA" involves the geom_mirror metadata sector on one of the two
>drives, but not always the same one (sometimes it is ad0, sometimes
>ad2).  The net result is to cause the drive in question to be removed
>from the mirror.  Disappointingly, rather than carry on in degraded
>fashion, lately the system seems eventually to seize up as you describe.
>It doesn't seem to seize up immediately, because I notice an entry
>in /var/log/messages after the error but before the required hard reset
>reboot:
>
>Feb 18 05:24:38 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679
>Feb 18 05:24:43 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679
>Feb 18 05:24:43 zappa kernel: ad2: FAILURE - WRITE_DMA timed out
>Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Cannot update metadata on disk ad2 (error=5).
>Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Device raid1: provider ad2 disconnected.
>Feb 18 09:46:35 zappa named[349]: zone ./IN: Transfer started.
>Feb 18 09:46:35 zappa named[349]: transfer of './IN' from 128.9.0.107#53: connected using 192.168.1.25#64153
>Feb 18 09:46:37 zappa named[349]: zone ./IN: transferred serial 2005021800
>Feb 18 09:46:37 zappa named[349]: transfer of './IN' from 128.9.0.107#53: end of transfer
>[[forced reboot]]
>Feb 18 11:48:46 zappa syslogd: kernel boot file is /boot/kernel/kernel
>Feb 18 11:48:46 zappa kernel: Copyright (c) 1992-2005 The FreeBSD Project.
>Feb 18 11:48:46 zappa kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
>Feb 18 11:48:46 zappa kernel: The Regents of the University of California. All rights reserved.
>Feb 18 11:48:46 zappa kernel: FreeBSD 6.0-CURRENT #0: Fri Feb 11 09:03:49 EST 2005
>
>
>I get this problem on 6-CURRENT and also RELENG_5.  The RELENG_5 system
>has a geom_vinum mirrored setup, and when the "TIMEOUT - WRITE_DMA"
>occurs I lose the associated drive and plexes from the configuration.
>The problem does not happen as often on the RELENG_5 system, as it does
>on HEAD, at least nowadays it doesn't.
>
>I run smartctl on the systems, and none of the drives report any errors,
>and the "WORST" values recorded are nowhere near close to their
>respective failure thresholds.
>
>In my case, I have one area of commonality.  Between the three different
>systems on which I've experienced this problem, all use the Intel PIIX4
>ATA controller and the same IBM-DJNA-352500/J51OA30K hard drives.  So,
>I'm wondering if there is something about this particular combination
>that gives rise to this annoying problem?
>
>I do use the same IBM-DJNA-352500/J51OA30K hard drives in another system
>and have never experienced this (or any other) problem.  However, it is
>running 4.11-STABLE and has a "VIA 82C686 ATA66 controller", so it's
>impossible to tell if it's 4.11-STABLE or the VIA ATA controller
>contributing to the stability in that case.
>
>I don't think I have a hardware problem.  The same setup ran fine under
>earlier 5.x releases.  But, somewhere, this issue crept in (I remember
>threads on freebsd-current about it), and recently it appears to be
>getting worse (at least for me).  Also, unfortunately for me,
>geom_mirror used to roll with the punches when I lost a drive through
>this "TIMEOUT - WRITE_DMA" problem, but now it doesn't. :-(
>
>Cheers,
>
>Paul.
>--
>e-mail: paul@gromit.dlib.vt.edu
>
>"Without music to decorate it, time is just a bunch of boring production
> deadlines or dates by which bills must be paid."
>        --- Frank Vincent Zappa

The disk I am using is an IBM as well:

smartctl -a output yields this info on the device:

Device Model:     IBM-DPTA-372050
Serial Number:    JMYJM131600
Firmware Version: P76OA30A

My ATAPI controller is a VIA 82C686A as well.  I have been running
FreeBSD 4.3 up to 6-CURRENT with this controller without issue until now
too. So I think we can assume that the problem was introduced in 5 and
carried on through 6. I think I recall bumping into this with a Western
Digital 10 gig disk a while back on 5.3-STABLE. I was under school
pressure then and just dropped the drive out completely when I started
getting a hung system and ad0 messages. I'll plug it back in as a slave
this weekend and run smartctl on it.