From owner-freebsd-current@FreeBSD.ORG Fri Feb 18 19:09:53 2005 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A0F1A16A4CE for ; Fri, 18 Feb 2005 19:09:53 +0000 (GMT) Received: from csa.cs.okstate.edu (a.cs.okstate.edu [139.78.113.1]) by mx1.FreeBSD.org (Postfix) with ESMTP id 3AC2143D1F for ; Fri, 18 Feb 2005 19:09:53 +0000 (GMT) (envelope-from lreid@a.cs.okstate.edu) Received: by csa.cs.okstate.edu (Postfix, from userid 601) id E8EF2A063E; Fri, 18 Feb 2005 13:09:52 -0600 (CST) To: freebsd-current@freebsd.org Received: from 164.58.79.196 (auth. user lreid@a.cs.okstate.edu) by cs.okstate.edu with HTTP; Fri, 18 Feb 2005 13:09:52 -0600 X-IlohaMail-Blah: lreid@a.cs.okstate.edu X-IlohaMail-Method: mail() [mem] X-IlohaMail-Dummy: moo X-Mailer: IlohaMail/0.8.12 (On: cs.okstate.edu) From: "Reid Linnemann" Bounce-To: "Reid Linnemann" Errors-To: "Reid Linnemann" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Message-Id: <20050218190952.E8EF2A063E@csa.cs.okstate.edu> Date: Fri, 18 Feb 2005 13:09:52 -0600 (CST) Subject: Re: ad WRITE_DMA timing out frequently X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Feb 2005 19:09:53 -0000 On 2/18/2005, "Paul Mather" wrote: >On Fri, 18 Feb 2005 09:03:35 -0600 (CST), "Reid Linnemann" > wrote: > >> I've recently brought a machine up from 5.3-STABLE to 6-CURRENT. It >> usually just sits in the corner and runs services, but lately I've >> come >> home form work or woken up to find that it is completely unresponsive, >> and I have to hard reset the machine. It happens at least once a day, >> and it's becoming more and more frequent. When I look at the console, >> I >> always have the same 4 messages before the failure: >> >> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599 >> ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2085599 >> kernel: ad0: FAILURE - WRITE_DMA timed out >> kernel: g_vfs_done():ad0s1d[WRITE(offset=52772864, length=16384)]error >> = 5 >> >> It seems to me that a sector on the disk might be dead in the ad0s1d >> slice (/var), but I want to be certain before I take further steps >> that >> the behavior I'm experiencing is positively unrelated to the migration >> to 6-CURRENT. >> >> I started poking around /var to see if anything was amiss, and I found >> that mail messages are being stacked up in /var/spool/clientmqueue, >> even >> though nothing should be using the msp queue (I've redirected periodic >> outputs to logfiles). In the last daily run mailed to root in >> January, >> I found records in the submit queue that looked like this: >> >> j0EDINHh049826 2489 Fri Jan 14 07:18 MAILER-DAEMON >> (Deferred: Permission denied) >> >> There were nearly 500 of them. >> >> Even after redirecting periodic output to logs and clearing out the >> client mail queue, this continues to happen, and I have a hunch that >> it >> may be related to the WRITE_DMA timeouts, as it's the only weird >> behavior I can see on /var. If anyone can help me shed some light on >> this, I'd appreciate it. I've had 2 IDE drives die in this machine >> already, I'm going to be severely depressed if I've killed a third. > >The "TIMEOUT - WRITE_DMA" issue has been a recurring problem for me >since somewhere in the 5.2.1--5.3 release range. (It's been so long now >that I don't remember whether it first started plaguing me in 5.2.1 or >5.3. I do know for definite I never got this problem in 5.1 and it only >crept in during an "upgrade.") > >Like you, this has been happening more frequently with 6-CURRENT for me. >As in your case, I come to find the machine completely unresponsive >(though still pingable) and I have to hard reset the machine. I'm >finding this is now happening roughly every other day on average for the >past week since my last system rebuild (FreeBSD 6.0-CURRENT #0: Fri Feb >11 09:03:49 EST 2005). > >In my case, I'm using geom_mirror to mirror two drives. The "TIMEOUT - >WRITE_DMA" involves the geom_mirror metadata sector on one of the two >drives, but not always the same one (sometimes it is ad0, sometimes >ad2). The net result is to cause the drive in question to be removed >from the mirror. Disappointingly, rather than carry on in degraded >fashion, lately the system seems eventually to seize up as you describe. >It doesn't seem to seize up immediately, because I notice an entry >in /var/log/messages after the error but before the required hard reset >reboot: > >Feb 18 05:24:38 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679 >Feb 18 05:24:43 zappa kernel: ad2: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=49981679 >Feb 18 05:24:43 zappa kernel: ad2: FAILURE - WRITE_DMA timed out >Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Cannot update metadata on disk ad2 (error=5). >Feb 18 05:24:43 zappa kernel: GEOM_MIRROR: Device raid1: provider ad2 disconnected. >Feb 18 09:46:35 zappa named[349]: zone ./IN: Transfer started. >Feb 18 09:46:35 zappa named[349]: transfer of './IN' from 128.9.0.107#53: connected using 192.168.1.25#64153 >Feb 18 09:46:37 zappa named[349]: zone ./IN: transferred serial 2005021800 >Feb 18 09:46:37 zappa named[349]: transfer of './IN' from 128.9.0.107#53: end of transfer >[[forced reboot]] >Feb 18 11:48:46 zappa syslogd: kernel boot file is /boot/kernel/kernel >Feb 18 11:48:46 zappa kernel: Copyright (c) 1992-2005 The FreeBSD Project. >Feb 18 11:48:46 zappa kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 >Feb 18 11:48:46 zappa kernel: The Regents of the University of California. All rights reserved. >Feb 18 11:48:46 zappa kernel: FreeBSD 6.0-CURRENT #0: Fri Feb 11 09:03:49 EST 2005 > > >I get this problem on 6-CURRENT and also RELENG_5. The RELENG_5 system >has a geom_vinum mirrored setup, and when the "TIMEOUT - WRITE_DMA" >occurs I lose the associated drive and plexes from the configuration. >The problem does not happen as often on the RELENG_5 system, as it does >on HEAD, at least nowadays it doesn't. > >I run smartctl on the systems, and none of the drives report any errors, >and the "WORST" values recorded are nowhere near close to their >respective failure thresholds. > >In my case, I have one area of commonality. Between the three different >systems on which I've experienced this problem, all use the Intel PIIX4 >ATA controller and the same IBM-DJNA-352500/J51OA30K hard drives. So, >I'm wondering if there is something about this particular combination >that gives rise to this annoying problem? > >I do use the same IBM-DJNA-352500/J51OA30K hard drives in another system >and have never experienced this (or any other) problem. However, it is >running 4.11-STABLE and has a "VIA 82C686 ATA66 controller", so it's >impossible to tell if it's 4.11-STABLE or the VIA ATA controller >contributing to the stability in that case. > >I don't think I have a hardware problem. The same setup ran fine under >earlier 5.x releases. But, somewhere, this issue crept in (I remember >threads on freebsd-current about it), and recently it appears to be >getting worse (at least for me). Also, unfortunately for me, >geom_mirror used to roll with the punches when I lost a drive through >this "TIMEOUT - WRITE_DMA" problem, but now it doesn't. :-( > >Cheers, > >Paul. >-- >e-mail: paul@gromit.dlib.vt.edu > >"Without music to decorate it, time is just a bunch of boring production > deadlines or dates by which bills must be paid." > --- Frank Vincent Zappa The disk I am using is an IBM as well: smartctl -a output yields this info on the device: Device Model: IBM-DPTA-372050 Serial Number: JMYJM131600 Firmware Version: P76OA30A My ATAPI controller is a VIA 82C686A as well. I have been running FreeBSD 4.3 up to 6-CURRENT with this controller without issue until now too. So I think we can assume that the problem was introduced in 5 and carried on through 6. I think I recall bumping into this with a Western Digital 10 gig disk a while back on 5.3-STABLE. I was under school pressure then and just dropped the drive out completely when I started getting a hung system and ad0 messages. I'll plug it back in as a slave this weekend and run smartctl on it.