Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 Oct 2008 20:39:19 +0100
From:      Miroslav Lachman <000.fbsd@quip.cz>
To:        Jeremy Chadwick <koitsu@FreeBSD.org>
Cc:        Vaclav Haisman <v.haisman@sh.cvut.cz>, freebsd-stable@freebsd.org
Subject:   Re: Short SMART check causes disk op timeouts
Message-ID:  <490618E7.8000905@quip.cz>
In-Reply-To: <20081027175337.GA27175@icarus.home.lan>
References:  <4905951B.2050602@sh.cvut.cz>	<20081027160828.GA24496@icarus.home.lan>	<4905F8BB.3080302@sh.cvut.cz> <20081027175337.GA27175@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
Jeremy Chadwick wrote:

> On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote:
> 
>>Jeremy Chadwick wrote:
>>
>>>On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
>>>Second, your short offline test runs at 0300, but the errors you're
>>>seeing are at 0454 in the morning.  A short offline test does not
>>>take 2 hours to run -- they take between 2-10 minutes -- unless the
>>>system is also in the middle of doing a lot of I/O, in which case the
>>>short test will be suspended.
>>>
>>>There are cronjobs (specifically periodic jobs) that run starting at
>>>0301 in the morning ("periodic daily"), and many of those are I/O bound.
>>>This could possibly extend the length of the short test until 0454.
>>>
>>>Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
>>>perform a lot of disk I/O, so it's possible that on Sunday specifically
>>>the short SMART test gets pushed back quite some time.
>>>
>>>Third, the DMA timeouts you're seeing are possibly caused by the drive
>>>taking too long when internally suspending the SMART test.
>>>
>>>In most cases, it's safe for SMART tests (short and long) to be run
>>>while the machine is operational, and disk I/O requests are being
>>>performed.  When an I/O request comes and the disk is in the middle of
>>>performing a SMART test, the drive has to stop the SMART test (e.g.
>>>"suspend" it), complete the I/O request, then resume the SMART test.
>>>
>>>The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
>>>doesn't receive an acknowledgement back from the controller (disk)
>>>within 5 seconds, it'll report a timeout on whatever operation it was
>>>performing.  I'm thinking the disk gets stuck in a "do the offline
>>>test, no wait stop there's an I/O request, okay its done continue the
>>>test, no way stop there's another I/O" loop.
>>
>>Can I make the timeout higher? For the sake of elimination.
> 
> 
> You will have to make modifications to the ata(4) driver code, and
> rebuild+reinstall your kernel.
> 
> There is a patch from the FreeNAS folks which turns the command timeout
> value into a sysctl for tuning, but that patch has not been brought into
> FreeBSD (any version) at this time.  You can find it referenced below
> (see one of the "Workarounds" sections).  You will probably have to
> apply the patch "by hand" rather than blindly using patch < patchfile,
> because the ATA code has changed since the patch was created.
> 
> http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting
> 
> 
>>>Another possibility is that your drive really *does* have a bad block at
>>>LBA 836986454, and that one of those cron/periodic jobs is what's
>>>noticing it, and that upon noticing a bad block, the drive more or less
>>>aborts the SMART test to perform internal remapping of the block.
>>>
>>>To confirm this, you would need to boot the SeaTools utilities from DOS
>>>or from a CD (see Seagate's site) and run a full sector scan (NOT the
>>>"quick" test).  This takes a few hours.  Assuming it comes back clean,
>>>then my above claim of the offline test taking too long to suspend is
>>>probably the case.
>>>
>>>Possibly this is a firmware bug in the drive -- you might consider
>>>mailing Seagate about this problem, although I'm doubting their Tier 1
>>>support will understand what the issue is.
>>>
>>>Is the block number always the same?  Do you only see this error on
>>>Sundays?  These are two questions which might help narrow things down.
>>
>>Nope, the LBA is always different and I see it in the logs once every day.
> 
> 
> Okay, so that greatly diminishes the possibility of it being a bad
> block.  I'd still advocate running SeaTools on the disk to ensure
> everything is 100% okay (re: "sake of elimination"); chances are it will
> pass with flying colours.
> 
> 
>>>>This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
>>>>kernel.
>>>>
>>>>Now, does the timeout cause loss of any data? Is there anything besides
>>>>disabling the testing that I can do about it?
>>>
>>>Do you understand what short and long offline tests actually do and what
>>>they're used for?  :-)  If so, you'd know that running them periodically
>>>is more or less silly (IMHO).
>>
>>I do not, not completely :) I think I have just copied the settings from
>>somewhere and only just tweaked it a bit whenever I have added a disk.
> 
> 
> Let me know if you figure out who or what online resource solicited
> adding daily short/long tests, as I'd like to talk to them about their
> decision.  I have a feeling whoever thought it up felt that the tests
> were performing entire sector scans of the entire disk, which is simply
> not the case.

It seems like a little modified example from smartd.conf.sample

# First (primary) ATA/IDE hard disk.  Monitor all attributes, enable
# automatic online data collection, automatic Attribute autosave, and
# start a short self-test every day between 2-3am, and a long self test
# Saturdays between 3-4am.
#/dev/hda -a -o on -S on -s (S/../.././02|L/../../6/03)

I am using similar config without problem:

/dev/ad4 -a -o on -S on -m root -M test -M diminishing -s 
(S/../.././01|L/../../(3|6)/05) -t -I 194
/dev/ad6 -a -o on -S on -m root -M test -M diminishing -s 
(S/../.././01|L/../../(3|6)/04) -t -I 194

Miroslav Lachman



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?490618E7.8000905>