From owner-freebsd-current@FreeBSD.ORG Fri Feb 18 18:47:35 2005 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8BE5416A4CE for ; Fri, 18 Feb 2005 18:47:35 +0000 (GMT) Received: from gromit.dlib.vt.edu (gromit.dlib.vt.edu [128.173.49.29]) by mx1.FreeBSD.org (Postfix) with ESMTP id 01CB143D45 for ; Fri, 18 Feb 2005 18:47:35 +0000 (GMT) (envelope-from paul@gromit.dlib.vt.edu) Received: from zappa.Chelsea-Ct.Org (pool-151-199-113-125.roa.east.verizon.net [151.199.113.125]) by gromit.dlib.vt.edu (8.13.1/8.13.1) with ESMTP id j1IIlVBt095877 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 18 Feb 2005 13:47:32 -0500 (EST) (envelope-from paul@gromit.dlib.vt.edu) Received: from zappa.Chelsea-Ct.Org (localhost.Chelsea-Ct.Org [127.0.0.1]) by zappa.Chelsea-Ct.Org (8.13.1/8.13.1) with ESMTP id j1IIlQg5001203 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 18 Feb 2005 13:47:26 -0500 (EST) (envelope-from paul@gromit.dlib.vt.edu) Received: (from paul@localhost) by zappa.Chelsea-Ct.Org (8.13.1/8.13.1/Submit) id j1IIlPD5001202; Fri, 18 Feb 2005 13:47:25 -0500 (EST) (envelope-from paul@gromit.dlib.vt.edu) X-Authentication-Warning: zappa.Chelsea-Ct.Org: paul set sender to paul@gromit.dlib.vt.edu using -f From: Paul Mather To: freebsd-current@freebsd.org In-Reply-To: <20050218175318.3681916A4CF@hub.freebsd.org> References: <20050218175318.3681916A4CF@hub.freebsd.org> Content-Type: text/plain Content-Transfer-Encoding: 7bit Date: Fri, 18 Feb 2005 13:47:25 -0500 Message-Id: <1108752445.1105.34.camel@zappa.Chelsea-Ct.Org> Mime-Version: 1.0 X-Mailer: Evolution 2.0.3 FreeBSD GNOME Team Port cc: Reid Linnemann Subject: Re: ad WRITE_DMA timing out frequently X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Feb 2005 18:47:35 -0000 On Fri, 18 Feb 2005 09:03:35 -0600 (CST), "Reid Linnemann" wrote: > I've recently brought a machine up from 5.3-STABLE to 6-CURRENT. It > usually just sits in the corner and runs services, but lately I've > come > home form work or woken up to find that it is completely unresponsive, > and I have to hard reset the machine. It happens at least once a day, > and it's becoming more and more frequent. When I look at the console, > I > always have the same 4 messages before the failure: > > ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599 > ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599 > kernel: ad0: FAILURE - WRITE_DMA timed out > kernel: g_vfs_done():ad0s1d[WRITE(offset=52772864, length=16384)]error > = 5 > > It seems to me that a sector on the disk might be dead in the ad0s1d > slice (/var), but I want to be certain before I take further steps > that > the behavior I'm experiencing is positively unrelated to the migration > to 6-CURRENT. > > I started poking around /var to see if anything was amiss, and I found > that mail messages are being stacked up in /var/spool/clientmqueue, > even > though nothing should be using the msp queue (I've redirected periodic > outputs to logfiles). In the last daily run mailed to root in > January, > I found records in the submit queue that looked like this: > > j0EDINHh049826 2489 Fri Jan 14 07:18 MAILER-DAEMON > (Deferred: Permission denied) > > There were nearly 500 of them. > > Even after redirecting periodic output to logs and clearing out the > client mail queue, this continues to happen, and I have a hunch that > it > may be related to the WRITE_DMA timeouts, as it's the only weird > behavior I can see on /var. If anyone can help me shed some light on > this, I'd appreciate it. I've had 2 IDE drives die in this machine > already, I'm going to be severely depressed if I've killed a third. The "TIMEOUT - WRITE_DMA" issue has been a recurring problem for me since somewhere in the 5.2.1--5.3 release range. (It's been so long now that I don't remember whether it first started plaguing me in 5.2.1 or 5.3. I do know for definite I never got this problem in 5.1 and it only crept in during an "upgrade.") Like you, this has been happening more frequently with 6-CURRENT for me. As in your case, I come to find the machine completely unresponsive (though still pingable) and I have to hard reset the machine. I'm finding this is now happening roughly every other day on average for the past week since my last system rebuild (FreeBSD 6.0-CURRENT #0: Fri Feb 11 09:03:49 EST 2005). In my case, I'm using geom_mirror to mirror two drives. The "TIMEOUT - WRITE_DMA" involves the geom_mirror metadata sector on one of the two drives, but not always the same one (sometimes it is ad0, sometimes ad2). The net result is to cause the drive in question to be removed from the mirror. Disappointingly, rather than carry on in degraded fashion, lately the system seems eventually to seize up as you describe. It doesn't seem to seize up immediately, because I notice an entry in /var/log/messages after the error but before the required hard reset reboot: Feb 18 05:24:38 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679 Feb 18 05:24:43 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679 Feb 18 05:24:43 zappa kernel: ad2: FAILURE - WRITE_DMA timed out Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Cannot update metadata on disk ad2 (error=5). Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Device raid1: provider ad2 disconnected. Feb 18 09:46:35 zappa named[349]: zone ./IN: Transfer started. Feb 18 09:46:35 zappa named[349]: transfer of './IN' from 128.9.0.107#53: connected using 192.168.1.25#64153 Feb 18 09:46:37 zappa named[349]: zone ./IN: transferred serial 2005021800 Feb 18 09:46:37 zappa named[349]: transfer of './IN' from 128.9.0.107#53: end of transfer [[forced reboot]] Feb 18 11:48:46 zappa syslogd: kernel boot file is /boot/kernel/kernel Feb 18 11:48:46 zappa kernel: Copyright (c) 1992-2005 The FreeBSD Project. Feb 18 11:48:46 zappa kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 Feb 18 11:48:46 zappa kernel: The Regents of the University of California. All rights reserved. Feb 18 11:48:46 zappa kernel: FreeBSD 6.0-CURRENT #0: Fri Feb 11 09:03:49 EST 2005 I get this problem on 6-CURRENT and also RELENG_5. The RELENG_5 system has a geom_vinum mirrored setup, and when the "TIMEOUT - WRITE_DMA" occurs I lose the associated drive and plexes from the configuration. The problem does not happen as often on the RELENG_5 system, as it does on HEAD, at least nowadays it doesn't. I run smartctl on the systems, and none of the drives report any errors, and the "WORST" values recorded are nowhere near close to their respective failure thresholds. In my case, I have one area of commonality. Between the three different systems on which I've experienced this problem, all use the Intel PIIX4 ATA controller and the same IBM-DJNA-352500/J51OA30K hard drives. So, I'm wondering if there is something about this particular combination that gives rise to this annoying problem? I do use the same IBM-DJNA-352500/J51OA30K hard drives in another system and have never experienced this (or any other) problem. However, it is running 4.11-STABLE and has a "VIA 82C686 ATA66 controller", so it's impossible to tell if it's 4.11-STABLE or the VIA ATA controller contributing to the stability in that case. I don't think I have a hardware problem. The same setup ran fine under earlier 5.x releases. But, somewhere, this issue crept in (I remember threads on freebsd-current about it), and recently it appears to be getting worse (at least for me). Also, unfortunately for me, geom_mirror used to roll with the punches when I lost a drive through this "TIMEOUT - WRITE_DMA" problem, but now it doesn't. :-( Cheers, Paul. -- e-mail: paul@gromit.dlib.vt.edu "Without music to decorate it, time is just a bunch of boring production deadlines or dates by which bills must be paid." --- Frank Vincent Zappa