Date: Mon, 27 Oct 2008 20:39:19 +0100 From: Miroslav Lachman <000.fbsd@quip.cz> To: Jeremy Chadwick <koitsu@FreeBSD.org> Cc: Vaclav Haisman <v.haisman@sh.cvut.cz>, freebsd-stable@freebsd.org Subject: Re: Short SMART check causes disk op timeouts Message-ID: <490618E7.8000905@quip.cz> In-Reply-To: <20081027175337.GA27175@icarus.home.lan> References: <4905951B.2050602@sh.cvut.cz> <20081027160828.GA24496@icarus.home.lan> <4905F8BB.3080302@sh.cvut.cz> <20081027175337.GA27175@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
Jeremy Chadwick wrote: > On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote: > >>Jeremy Chadwick wrote: >> >>>On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote: >>>Second, your short offline test runs at 0300, but the errors you're >>>seeing are at 0454 in the morning. A short offline test does not >>>take 2 hours to run -- they take between 2-10 minutes -- unless the >>>system is also in the middle of doing a lot of I/O, in which case the >>>short test will be suspended. >>> >>>There are cronjobs (specifically periodic jobs) that run starting at >>>0301 in the morning ("periodic daily"), and many of those are I/O bound. >>>This could possibly extend the length of the short test until 0454. >>> >>>Weekly periodic jobs run at 0415 in the morning, on Sundays. These also >>>perform a lot of disk I/O, so it's possible that on Sunday specifically >>>the short SMART test gets pushed back quite some time. >>> >>>Third, the DMA timeouts you're seeing are possibly caused by the drive >>>taking too long when internally suspending the SMART test. >>> >>>In most cases, it's safe for SMART tests (short and long) to be run >>>while the machine is operational, and disk I/O requests are being >>>performed. When an I/O request comes and the disk is in the middle of >>>performing a SMART test, the drive has to stop the SMART test (e.g. >>>"suspend" it), complete the I/O request, then resume the SMART test. >>> >>>The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it >>>doesn't receive an acknowledgement back from the controller (disk) >>>within 5 seconds, it'll report a timeout on whatever operation it was >>>performing. I'm thinking the disk gets stuck in a "do the offline >>>test, no wait stop there's an I/O request, okay its done continue the >>>test, no way stop there's another I/O" loop. >> >>Can I make the timeout higher? For the sake of elimination. > > > You will have to make modifications to the ata(4) driver code, and > rebuild+reinstall your kernel. > > There is a patch from the FreeNAS folks which turns the command timeout > value into a sysctl for tuning, but that patch has not been brought into > FreeBSD (any version) at this time. You can find it referenced below > (see one of the "Workarounds" sections). You will probably have to > apply the patch "by hand" rather than blindly using patch < patchfile, > because the ATA code has changed since the patch was created. > > http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting > > >>>Another possibility is that your drive really *does* have a bad block at >>>LBA 836986454, and that one of those cron/periodic jobs is what's >>>noticing it, and that upon noticing a bad block, the drive more or less >>>aborts the SMART test to perform internal remapping of the block. >>> >>>To confirm this, you would need to boot the SeaTools utilities from DOS >>>or from a CD (see Seagate's site) and run a full sector scan (NOT the >>>"quick" test). This takes a few hours. Assuming it comes back clean, >>>then my above claim of the offline test taking too long to suspend is >>>probably the case. >>> >>>Possibly this is a firmware bug in the drive -- you might consider >>>mailing Seagate about this problem, although I'm doubting their Tier 1 >>>support will understand what the issue is. >>> >>>Is the block number always the same? Do you only see this error on >>>Sundays? These are two questions which might help narrow things down. >> >>Nope, the LBA is always different and I see it in the logs once every day. > > > Okay, so that greatly diminishes the possibility of it being a bad > block. I'd still advocate running SeaTools on the disk to ensure > everything is 100% okay (re: "sake of elimination"); chances are it will > pass with flying colours. > > >>>>This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC >>>>kernel. >>>> >>>>Now, does the timeout cause loss of any data? Is there anything besides >>>>disabling the testing that I can do about it? >>> >>>Do you understand what short and long offline tests actually do and what >>>they're used for? :-) If so, you'd know that running them periodically >>>is more or less silly (IMHO). >> >>I do not, not completely :) I think I have just copied the settings from >>somewhere and only just tweaked it a bit whenever I have added a disk. > > > Let me know if you figure out who or what online resource solicited > adding daily short/long tests, as I'd like to talk to them about their > decision. I have a feeling whoever thought it up felt that the tests > were performing entire sector scans of the entire disk, which is simply > not the case. It seems like a little modified example from smartd.conf.sample # First (primary) ATA/IDE hard disk. Monitor all attributes, enable # automatic online data collection, automatic Attribute autosave, and # start a short self-test every day between 2-3am, and a long self test # Saturdays between 3-4am. #/dev/hda -a -o on -S on -s (S/../.././02|L/../../6/03) I am using similar config without problem: /dev/ad4 -a -o on -S on -m root -M test -M diminishing -s (S/../.././01|L/../../(3|6)/05) -t -I 194 /dev/ad6 -a -o on -S on -m root -M test -M diminishing -s (S/../.././01|L/../../(3|6)/04) -t -I 194 Miroslav Lachman
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?490618E7.8000905>