From owner-freebsd-stable@FreeBSD.ORG Mon Oct 27 16:08:32 2008 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id ADD4B1065672 for ; Mon, 27 Oct 2008 16:08:32 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from QMTA08.westchester.pa.mail.comcast.net (qmta08.westchester.pa.mail.comcast.net [76.96.62.80]) by mx1.freebsd.org (Postfix) with ESMTP id 19DD68FC1E for ; Mon, 27 Oct 2008 16:08:31 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from OMTA14.westchester.pa.mail.comcast.net ([76.96.62.60]) by QMTA08.westchester.pa.mail.comcast.net with comcast id XqNR1a00L1HzFnQ58s8Cdn; Mon, 27 Oct 2008 16:08:12 +0000 Received: from koitsu.dyndns.org ([69.181.141.110]) by OMTA14.westchester.pa.mail.comcast.net with comcast id Xs8U1a00Y2P6wsM3as8UtL; Mon, 27 Oct 2008 16:08:29 +0000 X-Authority-Analysis: v=1.0 c=1 a=KeEtZEhZrkgA:10 a=ieuqav0nqiAA:10 a=6I5d2MoRAAAA:8 a=QycZ5dHgAAAA:8 a=fEAqBA1rE-sgMAfmhTcA:9 a=CTT0H_cuS7nIOsb--lUA:7 a=V27EtbvlOLZXyGoM-Xcq8oTUGxwA:4 a=EoioJ0NPDVgA:10 a=LY0hPdMaydYA:10 Received: by icarus.home.lan (Postfix, from userid 1000) id 140E6C9419; Mon, 27 Oct 2008 09:08:28 -0700 (PDT) Date: Mon, 27 Oct 2008 09:08:28 -0700 From: Jeremy Chadwick To: Vaclav Haisman Message-ID: <20081027160828.GA24496@icarus.home.lan> References: <4905951B.2050602@sh.cvut.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4905951B.2050602@sh.cvut.cz> User-Agent: Mutt/1.5.18 (2008-05-17) Cc: freebsd-stable@freebsd.org Subject: Re: Short SMART check causes disk op timeouts X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Oct 2008 16:08:32 -0000 On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > Hi, > I have recently bought a new disk (Seagate 500G, ST3500320NS). I have > enabled SMART checking using the smartmontools as usual for the disk > (/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem > is that each time the test runs I get messages like the following in > /var/log/messages: > > Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry > left) LBA=836986454 > Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0 > retries left) LBA=836986454 > Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out > LBA=836986454 > Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464, > length=16384)]error = 5 > > And the SMART test results log on the disk contains line like this: > > # 1 Short offline Interrupted (host reset) 00% 297 > - First and foremost, your above smartd.conf -s flags are conflicting. Your long offline test will never get run on Sunday; the short will run first, and the long won't ever start (because the short is already running). I would recommend telling the short test to run only between days 0-6, leaving Sunday solely for the long test. (I noticed this because the above "Interrupted" test indicates a short test was interrupted and not a long). Second, your short offline test runs at 0300, but the errors you're seeing are at 0454 in the morning. A short offline test does not take 2 hours to run -- they take between 2-10 minutes -- unless the system is also in the middle of doing a lot of I/O, in which case the short test will be suspended. There are cronjobs (specifically periodic jobs) that run starting at 0301 in the morning ("periodic daily"), and many of those are I/O bound. This could possibly extend the length of the short test until 0454. Weekly periodic jobs run at 0415 in the morning, on Sundays. These also perform a lot of disk I/O, so it's possible that on Sunday specifically the short SMART test gets pushed back quite some time. Third, the DMA timeouts you're seeing are possibly caused by the drive taking too long when internally suspending the SMART test. In most cases, it's safe for SMART tests (short and long) to be run while the machine is operational, and disk I/O requests are being performed. When an I/O request comes and the disk is in the middle of performing a SMART test, the drive has to stop the SMART test (e.g. "suspend" it), complete the I/O request, then resume the SMART test. The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it doesn't receive an acknowledgement back from the controller (disk) within 5 seconds, it'll report a timeout on whatever operation it was performing. I'm thinking the disk gets stuck in a "do the offline test, no wait stop there's an I/O request, okay its done continue the test, no way stop there's another I/O" loop. Another possibility is that your drive really *does* have a bad block at LBA 836986454, and that one of those cron/periodic jobs is what's noticing it, and that upon noticing a bad block, the drive more or less aborts the SMART test to perform internal remapping of the block. To confirm this, you would need to boot the SeaTools utilities from DOS or from a CD (see Seagate's site) and run a full sector scan (NOT the "quick" test). This takes a few hours. Assuming it comes back clean, then my above claim of the offline test taking too long to suspend is probably the case. Possibly this is a firmware bug in the drive -- you might consider mailing Seagate about this problem, although I'm doubting their Tier 1 support will understand what the issue is. Is the block number always the same? Do you only see this error on Sundays? These are two questions which might help narrow things down. > This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC > kernel. > > Now, does the timeout cause loss of any data? Is there anything besides > disabling the testing that I can do about it? Do you understand what short and long offline tests actually do and what they're used for? :-) If so, you'd know that running them periodically is more or less silly (IMHO). If you're trying to accomplish a cheap version of disk scrubbing, e.g. scanning the entire disk for bad blocks and report them or have them automatically remapped by the drive, consider using sysutils/diskcheckd, which was made for this purpose. However, be aware of a problem I've run into with it (still needs someone clueful to figure out why this happens): http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/115853 I do not advocate the use of periodic offline tests on disks, especially at such aggressive intervals (daily). In fact, I don't even know why Bruce added that option to smartd. There are only a few attributes in SMART which get updated on offline tests, so I cease to see the point. You shouldn't be doing what you're doing, IMHO. If you want to do these tests once every 2 weeks or once a month, that'd be a better idea. Stick with the short test, and do it during a time when disk I/O is very low (try something like 7am on a Saturday). Don't go with 2am if your system/environment honours Daylight Saving Time, because that could cause the test to run twice. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |