From owner-freebsd-stable@FreeBSD.ORG Tue Oct 14 11:12:42 2003 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A0EA916A4B3 for ; Tue, 14 Oct 2003 11:12:42 -0700 (PDT) Received: from mail.whatistruth.net (h-66-166-44-252.STTNWAHO.covad.net [66.166.44.252]) by mx1.FreeBSD.org (Postfix) with ESMTP id B518F43F3F for ; Tue, 14 Oct 2003 11:12:40 -0700 (PDT) (envelope-from david@whatistruth.net) Received: from whatistruth.net (unknown [192.168.1.2]) by mail.whatistruth.net (Postfix) with ESMTP id 0142B26D for ; Tue, 14 Oct 2003 11:12:39 -0700 (PDT) Message-ID: <3F8C3C97.3050405@whatistruth.net> Date: Tue, 14 Oct 2003 11:12:39 -0700 From: DavidB User-Agent: Mozilla/5.0 (X11; U; Linux i386; en-US; rv:1.4) Gecko/20030624 X-Accept-Language: en-us, en MIME-Version: 1.0 To: freebsd-stable@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: ATA failure with 4.6.2 & 250GB drive? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 14 Oct 2003 18:12:42 -0000 Kevin Oberman wrote: >> Date: Tue, 14 Oct 2003 09:55:54 +0100 >> From: Scott Mitchell >> Sender: owner-freebsd-stable@freebsd.org >> >> On Mon, Oct 13, 2003 at 10:09:10AM +0100, Scott Mitchell wrote: >> >>> Hi all, >>> >>> Just installed a Maxtor 250GB PATA drive in one of our servers, to >>> be used >>> as a backup staging area. This was actually a replacement for an >>> identical >>> drive that appeared to have died after a month of service. >>> >>> Anyway, 2 days after this drive was installed I start seeing this in >>> the >>> daily logs: >>> >>> >>>> ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 >>>> (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) trying PIO mode >>>> ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 >>>> (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40 >>>> ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 >>>> (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40 >>>> ad1s1e: hard error reading fsbn 850845887 of 425422912-425422943 >>>> (ad1s1 bn 850845887; cn 52962 tn 180 sn 17) status=59 error=40 >>> >>> >>> ... >> >> >> OK, swapped out the cable (from an 80- to 40-wire one, as it happened, >> although that should make no difference on a UDMA33 controller). Same >> errors appeared again while the backups were running. >> >> Some more information on how this drive is being used - we're dumping >> two >> vinum RAID5 volumes onto it, one local and one remote, writing to the >> backup disk over NFS. Both dumps kick off at 0300, with the remote one >> finishing at 0305 last night. The first ATA error appeared in the >> logs at >> 0325, while the local backup was still running. The last error was >> logged >> at 0355, but the backup itself didn't finish until nearly 0500. >> >> Anyone have any more ideas on how to diagnose this? It does occur to me >> that the daily periodic run also kicks off at 0301 but that is >> usually all >> done before 0330. > > > > It's a real drive problem, but possibly not a terminal one. (I had the > same issue on one of my drives a few months ago and it's fine now.) > > The problem is that the system is getting an error trying to read this > area of the disk. It's an unmapped bunch of bad blocks. The system > gets an unrecoverable error trying to read these blocks and that is > what you see reported. Since it can't read "good" data, it does not > relocate the bad data, but just leaves it there and reports errors > every time it tries to read the data. > > First, any files containing data stored in these blocks are probably > toast. Or, at least garbled. Sorry. > > The fix/workaround is to move the file(s) involved so that the damaged > blocks are marked free and relocated to spar space on the drive. You > can try to figure out just which file(s) use those blocks. There > might even be a reasonable way to do this...I just don't know what it > is. > > Another "fix"is to simply copy the drive onto another and then copy it > back. dd(1) will do the trick as will dump/restore. (I'd suggest the > dump/restore to copy the data out and dd to copy it back if the disks > have identical geometries.) Once the data is restored to the original > disk, the bad blocks will have been re-directed by the drive and will > no longer trouble you. > > Modern disks are pretty smart at error recovery, but some failures are > too sudden for the drive to be able to deal with them without losing > data. Regarding a fix: I had similar read error message not long ago when dumping to tape, wondered what they could mean. So I went to the hard drive manufacturer's website and download a DOS tool to scan/repair the harddrive. Just to note an issue: I had one bootdisk for to check my harddrive which was an Hitachi (HGST) drive in my laptop and one for the Western Digital which was the drive of concern. For some reason I used the software utility from Hitachi on the WD, which was a good thing, because it reported bad blocks and wouldn't fix them because it recognized that it wasn't their drive. Then I used the bootdisk I had created for the WD utility and ran it, (this is why it was a good thing) it did the scan reported NO issues, checked its logs to see if it had logged fixing any problems. The utilities logs said the drive had no issues. Just to double check I re-ran the HGST tool and it didn't find any bad blocks. Hmm. Those knuckle-heads at Western Digital made the utility to fix the bad blocks silently. I find this under-handed because you might not find a disk going bad until the disk is totally failing. Hmm. wonder if this helps get it past the warranty before the drive completely fails. [ya know when 1:bad blocks show up, 2:you clean 'em up 3:return to step1 that the drive will die the death in the near future] So you can get utilities from the manufacturer usually {atleast WD, HGST, and Seagate} to do some subset of turn on and off S.M.A.R.T., exercise the harddrive, scan for errors, repair errors, low-level format, .... If you don't mind booting from a DOS bootdisk to run the tool. WD is a little confusing which to grab. But note as I found out some manufacturer's might silently repair certain issues. Hope this help, David