From owner-freebsd-questions@FreeBSD.ORG Sun Oct 4 10:24:17 2009 Return-Path: Delivered-To: freebsd-questions@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2256F106566B for ; Sun, 4 Oct 2009 10:24:17 +0000 (UTC) (envelope-from olli@lurza.secnetix.de) Received: from lurza.secnetix.de (lurza.secnetix.de [IPv6:2a01:170:102f::2]) by mx1.freebsd.org (Postfix) with ESMTP id 5C7618FC18 for ; Sun, 4 Oct 2009 10:24:16 +0000 (UTC) Received: from lurza.secnetix.de (localhost [127.0.0.1]) by lurza.secnetix.de (8.14.3/8.14.3) with ESMTP id n94ANwFn008837; Sun, 4 Oct 2009 12:24:13 +0200 (CEST) (envelope-from oliver.fromme@secnetix.de) Received: (from olli@localhost) by lurza.secnetix.de (8.14.3/8.14.3/Submit) id n94ANw9i008836; Sun, 4 Oct 2009 12:23:58 +0200 (CEST) (envelope-from olli) Date: Sun, 4 Oct 2009 12:23:58 +0200 (CEST) Message-Id: <200910041023.n94ANw9i008836@lurza.secnetix.de> From: Oliver Fromme To: freebsd-questions@FreeBSD.ORG, idmc_vivr@intgdev.com In-Reply-To: X-Newsgroups: list.freebsd-questions User-Agent: tin/1.8.3-20070201 ("Scotasay") (UNIX) (FreeBSD/6.4-PRERELEASE-20080904 (i386)) MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-2.1.2 (lurza.secnetix.de [127.0.0.1]); Sun, 04 Oct 2009 12:24:14 +0200 (CEST) Cc: Subject: Re: Every 12-hrs -- "ad0: TIMEOUT - WRITE DMA" X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 04 Oct 2009 10:24:17 -0000 This is a reply to a very old thread. I decided to reply because 1. nobody has mentioned the real cause of the problem yet (some answers were misleading or even outright wrong), 2. I've experienced the same problem in the past few weeks, 3. my findings might be useful for other people who are googling for the symptoms (like me) and stumble across this thread. The drive in question seems to be very popular, especially in low-end private servers and home machines. It is very reliable; I still have these and similar ones in production. The drive of mine that exhibited the problem recently is this: ad0: 24405MB at ata0-master UDMA66 It is powering a small server running DNS, SMTP, WWW and other things for several private domains. The load is very low, most of the time. Now for the actual problem: V.I.Victor wrote: > For the last 4-days, our (otherwise OK) 5.4-RELEASE machine has been > reporting: > > Feb 12 12:08:05 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2701279 > Feb 13 00:08:51 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2701279 > Feb 13 12:09:38 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2963331 > Feb 14 00:10:24 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2705947 > Feb 14 12:11:09 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2706335 > Feb 15 00:12:02 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2832383 > Feb 15 12:12:57 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=139839 > Feb 16 00:13:50 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=131391 > Feb 16 12:14:36 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=131391 > > The system was created Jan 08 and, prior to the above, the ad0: timeout had > only been reported twice: > > Jan 25 11:43:34 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=17920255 > Feb 6 11:59:42 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2832383 > [...] > ad0: 14664MB [29795/16/63] at ata0-master UDMA66 First of all: The disk is *not* dying. "SMART" won't reveal anything. The behaviour is perfectly normal for IBM-DJNA-3* type disks. When those disks are used in continuous operation (24/7), they will go into automatic maintenance mode after 6 days. This is kind of a short self-test and recalibration to ensure reliable continous operation. It will be repeated after another 6 days ad infinitum. Note that there are exactly 12 days between your Jan 25 and Feb 6 incidents, and exactly 6 days between Feb 6 and Feb 12 incidents. An automatic maintenance on Jan 31 apparently finished successfully without a timeout message. Normally the drive will wait until it detects an idle period, then perform the maintenance, then continue normal operation. Maintenance mode involves a short spin down / spin up cycle. However, if the drive receives a command during spin down, it will abort maintenance mode, spin up (which takes a few seconds and might cause a "timeout" to the operating system), then perform the command, and RETRY MAINTENACE AFTER 12 HOURS. So that's where your timeout messages every 12 hours come from. This is not in any way harmful. Eventually the maintenance will succeed (i.e. the idle period is long enough to finish), then you won't get timeout messages anymore for at least 6 days. You mentioned that the problem appeared (and disappeared) when you set the machine's clock. This is easy to explain, too. The hard disk has its own clock which is not synchronized with the system clock. It starts counting from zero when the disk is powered up. By changing the system's clock, you shift the offset between it and the drive's clock. That means that periodic activity will happen at different times, relative to the drive's clock. Such periodic activity includes cron jobs and other things. For example, sendmail's queue runner wakes up every 30 minutes by default. Many other daemons also perform periodic activity. All of that can happen to start in the middle of the idle period that the drive chose to use for its maintenance, thus interrupting maintenance, as described above. If the offset between the system's clock and the drive's clock changes, chances are that such periodic activity will happen at different times, from the point of view of the drive, so the likelihood that the drive can complete its maintenance changes (better or worse). Unfortunately there is no way to configure or disable that maintenance mode. The only way to somewhat control it is to periodically enforce a spin-down ("standby" ATA command) when you know that the drive is idle. This usually requires to unmount the filesystems, though, because otherwise you can't guarantee that they will be idle for long enough. You can read IBM's official documentation here: http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A7900618DED/$file/djna_sp.pdf If that link doesn't work anymore, google for this: "OEM HARD DISK DRIVE SPECIFICATIONS for DJNA-3xxxxx" The maintenance mode is described in chapter 10.12 (page 99). Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "I started using PostgreSQL around a month ago, and the feeling is similar to the switch from Linux to FreeBSD in '96 -- 'wow!'." -- Oddbjorn Steffensen