Date: Tue, 14 Oct 2003 09:02:14 -0700 From: "Kevin Oberman" <oberman@es.net> To: Scott Mitchell <scott+freebsd@fishballoon.org> Cc: freebsd-stable@freebsd.org Subject: Re: ATA failure with 4.6.2 & 250GB drive? Message-ID: <20031014160214.C51925D07@ptavv.es.net> In-Reply-To: Message from Scott Mitchell <scott%2Bfreebsd@fishballoon.org> <20031014085554.GC84877@llama.fishballoon.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> Date: Tue, 14 Oct 2003 09:55:54 +0100 > From: Scott Mitchell <scott+freebsd@fishballoon.org> > Sender: owner-freebsd-stable@freebsd.org > > On Mon, Oct 13, 2003 at 10:09:10AM +0100, Scott Mitchell wrote: > > Hi all, > > > > Just installed a Maxtor 250GB PATA drive in one of our servers, to be used > > as a backup staging area. This was actually a replacement for an identical > > drive that appeared to have died after a month of service. > > > > Anyway, 2 days after this drive was installed I start seeing this in the > > daily logs: > > > > > ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) trying PIO mode > > > ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40 > > > ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40 > > > ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40 > > ... > > OK, swapped out the cable (from an 80- to 40-wire one, as it happened, > although that should make no difference on a UDMA33 controller). Same > errors appeared again while the backups were running. > > Some more information on how this drive is being used - we're dumping two > vinum RAID5 volumes onto it, one local and one remote, writing to the > backup disk over NFS. Both dumps kick off at 0300, with the remote one > finishing at 0305 last night. The first ATA error appeared in the logs at > 0325, while the local backup was still running. The last error was logged > at 0355, but the backup itself didn't finish until nearly 0500. > > Anyone have any more ideas on how to diagnose this? It does occur to me > that the daily periodic run also kicks off at 0301 but that is usually all > done before 0330. It's a real drive problem, but possibly not a terminal one. (I had the same issue on one of my drives a few months ago and it's fine now.) The problem is that the system is getting an error trying to read this area of the disk. It's an unmapped bunch of bad blocks. The system gets an unrecoverable error trying to read these blocks and that is what you see reported. Since it can't read "good" data, it does not relocate the bad data, but just leaves it there and reports errors every time it tries to read the data. First, any files containing data stored in these blocks are probably toast. Or, at least garbled. Sorry. The fix/workaround is to move the file(s) involved so that the damaged blocks are marked free and relocated to spar space on the drive. You can try to figure out just which file(s) use those blocks. There might even be a reasonable way to do this...I just don't know what it is. Another "fix"is to simply copy the drive onto another and then copy it back. dd(1) will do the trick as will dump/restore. (I'd suggest the dump/restore to copy the data out and dd to copy it back if the disks have identical geometries.) Once the data is restored to the original disk, the bad blocks will have been re-directed by the drive and will no longer trouble you. Modern disks are pretty smart at error recovery, but some failures are too sudden for the drive to be able to deal with them without losing data. -- R. Kevin Oberman, Network Engineer Energy Sciences Network (ESnet) Ernest O. Lawrence Berkeley National Laboratory (Berkeley Lab) E-mail: oberman@es.net Phone: +1 510 486-8634
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20031014160214.C51925D07>