Date: Sun, 4 Oct 2009 12:23:58 +0200 (CEST) From: Oliver Fromme <olli@lurza.secnetix.de> To: freebsd-questions@FreeBSD.ORG, idmc_vivr@intgdev.com Subject: Re: Every 12-hrs -- "ad0: TIMEOUT - WRITE DMA" Message-ID: <200910041023.n94ANw9i008836@lurza.secnetix.de> In-Reply-To: <W204334961192261140120631@webmail15>
next in thread | previous in thread | raw e-mail | index | archive | help
This is a reply to a very old thread. I decided to reply because 1. nobody has mentioned the real cause of the problem yet (some answers were misleading or even outright wrong), 2. I've experienced the same problem in the past few weeks, 3. my findings might be useful for other people who are googling for the symptoms (like me) and stumble across this thread. The drive in question seems to be very popular, especially in low-end private servers and home machines. It is very reliable; I still have these and similar ones in production. The drive of mine that exhibited the problem recently is this: ad0: 24405MB <IBM DJNA-352500 J51OA30K> at ata0-master UDMA66 It is powering a small server running DNS, SMTP, WWW and other things for several private domains. The load is very low, most of the time. Now for the actual problem: V.I.Victor <idmc_vivr@intgdev.com> wrote: > For the last 4-days, our (otherwise OK) 5.4-RELEASE machine has been > reporting: > > Feb 12 12:08:05 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2701279 > Feb 13 00:08:51 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2701279 > Feb 13 12:09:38 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2963331 > Feb 14 00:10:24 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2705947 > Feb 14 12:11:09 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2706335 > Feb 15 00:12:02 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2832383 > Feb 15 12:12:57 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=139839 > Feb 16 00:13:50 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=131391 > Feb 16 12:14:36 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=131391 > > The system was created Jan 08 and, prior to the above, the ad0: timeout had > only been reported twice: > > Jan 25 11:43:34 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=17920255 > Feb 6 11:59:42 : ad0: TIMEOUT - WRITE_DMA retrying (2 retries left) LBA=2832383 > [...] > ad0: 14664MB <IBM-DJNA-351520/J56OA30K> [29795/16/63] at ata0-master UDMA66 First of all: The disk is *not* dying. "SMART" won't reveal anything. The behaviour is perfectly normal for IBM-DJNA-3* type disks. When those disks are used in continuous operation (24/7), they will go into automatic maintenance mode after 6 days. This is kind of a short self-test and recalibration to ensure reliable continous operation. It will be repeated after another 6 days ad infinitum. Note that there are exactly 12 days between your Jan 25 and Feb 6 incidents, and exactly 6 days between Feb 6 and Feb 12 incidents. An automatic maintenance on Jan 31 apparently finished successfully without a timeout message. Normally the drive will wait until it detects an idle period, then perform the maintenance, then continue normal operation. Maintenance mode involves a short spin down / spin up cycle. However, if the drive receives a command during spin down, it will abort maintenance mode, spin up (which takes a few seconds and might cause a "timeout" to the operating system), then perform the command, and RETRY MAINTENACE AFTER 12 HOURS. So that's where your timeout messages every 12 hours come from. This is not in any way harmful. Eventually the maintenance will succeed (i.e. the idle period is long enough to finish), then you won't get timeout messages anymore for at least 6 days. You mentioned that the problem appeared (and disappeared) when you set the machine's clock. This is easy to explain, too. The hard disk has its own clock which is not synchronized with the system clock. It starts counting from zero when the disk is powered up. By changing the system's clock, you shift the offset between it and the drive's clock. That means that periodic activity will happen at different times, relative to the drive's clock. Such periodic activity includes cron jobs and other things. For example, sendmail's queue runner wakes up every 30 minutes by default. Many other daemons also perform periodic activity. All of that can happen to start in the middle of the idle period that the drive chose to use for its maintenance, thus interrupting maintenance, as described above. If the offset between the system's clock and the drive's clock changes, chances are that such periodic activity will happen at different times, from the point of view of the drive, so the likelihood that the drive can complete its maintenance changes (better or worse). Unfortunately there is no way to configure or disable that maintenance mode. The only way to somewhat control it is to periodically enforce a spin-down ("standby" ATA command) when you know that the drive is idle. This usually requires to unmount the filesystems, though, because otherwise you can't guarantee that they will be idle for long enough. You can read IBM's official documentation here: http://www.hitachigst.com/tech/techlib.nsf/techdocs/85256AB8006A31E587256A7900618DED/$file/djna_sp.pdf If that link doesn't work anymore, google for this: "OEM HARD DISK DRIVE SPECIFICATIONS for DJNA-3xxxxx" The maintenance mode is described in chapter 10.12 (page 99). Best regards Oliver -- Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M. Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung: secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün- chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd "I started using PostgreSQL around a month ago, and the feeling is similar to the switch from Linux to FreeBSD in '96 -- 'wow!'." -- Oddbjorn Steffensen
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200910041023.n94ANw9i008836>