Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Oct 2003 11:12:39 -0700
From:      DavidB <david@whatistruth.net>
To:        freebsd-stable@freebsd.org
Subject:   Re: ATA failure with 4.6.2 & 250GB drive?
Message-ID:  <3F8C3C97.3050405@whatistruth.net>

next in thread | raw e-mail | index | archive | help
Kevin Oberman wrote:

>> Date: Tue, 14 Oct 2003 09:55:54 +0100
>> From: Scott Mitchell <scott+freebsd@fishballoon.org>
>> Sender: owner-freebsd-stable@freebsd.org
>>
>> On Mon, Oct 13, 2003 at 10:09:10AM +0100, Scott Mitchell wrote:
>>
>>> Hi all,
>>>
>>> Just installed a Maxtor 250GB PATA drive in one of our servers, to 
>>> be used
>>> as a backup staging area.  This was actually a replacement for an 
>>> identical
>>> drive that appeared to have died after a month of service.
>>>
>>> Anyway, 2 days after this drive was installed I start seeing this in 
>>> the
>>> daily logs:
>>>
>>>
>>>> ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 
>>>> (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) trying PIO mode
>>>> ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 
>>>> (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40
>>>> ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 
>>>> (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40
>>>> ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 
>>>> (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40
>>>
>>>
>>> ...
>>
>>
>> OK, swapped out the cable (from an 80- to 40-wire one, as it happened,
>> although that should make no difference on a UDMA33 controller).  Same
>> errors appeared again while the backups were running.
>>
>> Some more information on how this drive is being used - we're dumping 
>> two
>> vinum RAID5 volumes onto it, one local and one remote, writing to the
>> backup disk over NFS.  Both dumps kick off at 0300, with the remote one
>> finishing at 0305 last night.  The first ATA error appeared in the 
>> logs at
>> 0325, while the local backup was still running.  The last error was 
>> logged
>> at 0355, but the backup itself didn't finish until nearly 0500.
>>
>> Anyone have any more ideas on how to diagnose this?  It does occur to me
>> that the daily periodic run also kicks off at 0301 but that is 
>> usually all
>> done before 0330.
>
>
>
> It's a real drive problem, but possibly not a terminal one. (I had the
> same issue on one of my drives a few months ago and it's fine now.)
>
> The problem is that the system is getting an error trying to read this
> area of the disk. It's an unmapped bunch of bad blocks. The system
> gets an unrecoverable error trying to read these blocks and that is
> what you see reported. Since it can't read "good" data, it does not
> relocate the bad data, but just leaves it there and reports errors
> every time it tries to read the data.
>
> First, any files containing data stored in these blocks are probably
> toast. Or, at least garbled. Sorry.
>
> The fix/workaround is to move the file(s) involved so that the damaged
> blocks are marked free and relocated to spar space on the drive. You
> can try to figure out just which file(s) use those blocks. There
> might even be a reasonable way to do this...I just don't know what it
> is.
>
> Another "fix"is to simply copy the drive onto another and then copy it
> back. dd(1) will do the trick as will dump/restore. (I'd suggest the
> dump/restore to copy the data out and dd to copy it back if the disks
> have identical geometries.) Once the data is restored to the original
> disk, the bad blocks will have been re-directed by the drive and will
> no longer trouble you.
>
> Modern disks are pretty smart at error recovery, but some failures are
> too sudden for the drive to be able to deal with them without losing
> data. 


Regarding a fix:

I had similar read error message not long ago when dumping to tape, 
wondered what they could mean. So I went to the hard drive 
manufacturer's website and download a DOS tool to scan/repair the 
harddrive.

Just to note an issue:  I had one bootdisk for to check my harddrive 
which was an Hitachi (HGST) drive in my laptop and one for the Western 
Digital which was the drive of concern. For some reason I used the 
software utility from Hitachi on the WD, which was a good thing, because 
it reported bad blocks and wouldn't fix them because it recognized that 
it wasn't their drive. Then I used the bootdisk I had created for the WD 
utility and ran it, (this is why it was a good thing) it did the scan 
reported NO issues, checked its logs to see if it had logged fixing any 
problems. The utilities logs said the drive had no issues.
Just to double check I re-ran the HGST tool and it didn't find any bad 
blocks.  Hmm. Those knuckle-heads at Western Digital made the utility to 
fix the bad blocks silently.  I find this under-handed because you might 
not find a disk going bad until the disk is totally failing. Hmm. wonder 
if this helps get it past the warranty before the drive completely fails.
[ya know when 1:bad blocks show up, 2:you clean 'em up 3:return to step1 
that the drive will die the death in the near future]

So you can get utilities from the manufacturer usually {atleast WD, 
HGST, and Seagate} to do some subset of turn on and off S.M.A.R.T., 
exercise the harddrive, scan for errors, repair errors, low-level 
format, ....

If you don't mind booting from a DOS bootdisk to run the tool.  WD is a 
little confusing which to grab.  But note as I found out some 
manufacturer's might silently repair certain issues.

Hope this help,
David



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3F8C3C97.3050405>