Date: Mon, 27 Oct 2008 18:22:03 +0100 From: Vaclav Haisman <v.haisman@sh.cvut.cz> Cc: freebsd-stable@freebsd.org Subject: Re: Short SMART check causes disk op timeouts Message-ID: <4905F8BB.3080302@sh.cvut.cz> In-Reply-To: <20081027160828.GA24496@icarus.home.lan> References: <4905951B.2050602@sh.cvut.cz> <20081027160828.GA24496@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Jeremy Chadwick wrote: > On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> Hi, >> I have recently bought a new disk (Seagate 500G, ST3500320NS). I have >> enabled SMART checking using the smartmontools as usual for the disk >> (/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem >> is that each time the test runs I get messages like the following in >> /var/log/messages: >> >> Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry >> left) LBA=836986454 >> Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0 >> retries left) LBA=836986454 >> Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out >> LBA=836986454 >> Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464, >> length=16384)]error = 5 >> >> And the SMART test results log on the disk contains line like this: >> >> # 1 Short offline Interrupted (host reset) 00% 297 >> - > > First and foremost, your above smartd.conf -s flags are conflicting. > Your long offline test will never get run on Sunday; the short will run > first, and the long won't ever start (because the short is already > running). I would recommend telling the short test to run only between > days 0-6, leaving Sunday solely for the long test. (I noticed this > because the above "Interrupted" test indicates a short test was > interrupted and not a long). Thanks, I have not noticed the overlap at all. > > Second, your short offline test runs at 0300, but the errors you're > seeing are at 0454 in the morning. A short offline test does not > take 2 hours to run -- they take between 2-10 minutes -- unless the > system is also in the middle of doing a lot of I/O, in which case the > short test will be suspended. > > There are cronjobs (specifically periodic jobs) that run starting at > 0301 in the morning ("periodic daily"), and many of those are I/O bound. > This could possibly extend the length of the short test until 0454. > > Weekly periodic jobs run at 0415 in the morning, on Sundays. These also > perform a lot of disk I/O, so it's possible that on Sunday specifically > the short SMART test gets pushed back quite some time. > > Third, the DMA timeouts you're seeing are possibly caused by the drive > taking too long when internally suspending the SMART test. > > In most cases, it's safe for SMART tests (short and long) to be run > while the machine is operational, and disk I/O requests are being > performed. When an I/O request comes and the disk is in the middle of > performing a SMART test, the drive has to stop the SMART test (e.g. > "suspend" it), complete the I/O request, then resume the SMART test. > > The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it > doesn't receive an acknowledgement back from the controller (disk) > within 5 seconds, it'll report a timeout on whatever operation it was > performing. I'm thinking the disk gets stuck in a "do the offline > test, no wait stop there's an I/O request, okay its done continue the > test, no way stop there's another I/O" loop. Can I make the timeout higher? For the sake of elimination. > > Another possibility is that your drive really *does* have a bad block at > LBA 836986454, and that one of those cron/periodic jobs is what's > noticing it, and that upon noticing a bad block, the drive more or less > aborts the SMART test to perform internal remapping of the block. > > To confirm this, you would need to boot the SeaTools utilities from DOS > or from a CD (see Seagate's site) and run a full sector scan (NOT the > "quick" test). This takes a few hours. Assuming it comes back clean, > then my above claim of the offline test taking too long to suspend is > probably the case. > > Possibly this is a firmware bug in the drive -- you might consider > mailing Seagate about this problem, although I'm doubting their Tier 1 > support will understand what the issue is. > > Is the block number always the same? Do you only see this error on > Sundays? These are two questions which might help narrow things down. Nope, the LBA is always different and I see it in the logs once every day. > >> This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC >> kernel. >> >> Now, does the timeout cause loss of any data? Is there anything besides >> disabling the testing that I can do about it? > > Do you understand what short and long offline tests actually do and what > they're used for? :-) If so, you'd know that running them periodically > is more or less silly (IMHO). I do not, not completely :) I think I have just copied the settings from somewhere and only just tweaked it a bit whenever I have added a disk. > > If you're trying to accomplish a cheap version of disk scrubbing, e.g. > scanning the entire disk for bad blocks and report them or have them > automatically remapped by the drive, consider using sysutils/diskcheckd, > which was made for this purpose. However, be aware of a problem I've > run into with it (still needs someone clueful to figure out why this > happens): > http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/115853 > > I do not advocate the use of periodic offline tests on disks, especially > at such aggressive intervals (daily). In fact, I don't even know why > Bruce added that option to smartd. There are only a few attributes in > SMART which get updated on offline tests, so I cease to see the point. > > You shouldn't be doing what you're doing, IMHO. If you want to do > these tests once every 2 weeks or once a month, that'd be a better idea. > Stick with the short test, and do it during a time when disk I/O is > very low (try something like 7am on a Saturday). Don't go with 2am > if your system/environment honours Daylight Saving Time, because that > could cause the test to run twice. Ok, I am taking the advice and I have set longer intervals of checking. Thanks for such extensive answer. - -- VH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (FreeBSD) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iFYEAREIAAYFAkkF+LoACgkQhQBMvHf/WHmX3ADfTosXsJI0wAKl1MT7PCvBpmOm WnK9GavuuFsptwDgnjD0+tLGkZ2EEXjiXnvN/6wkz+wMWPCXYcHpGQ== =oDRL -----END PGP SIGNATURE-----
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4905F8BB.3080302>