From owner-freebsd-hackers@FreeBSD.ORG Sat Dec 18 17:09:02 2004 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4392B16A4CE for ; Sat, 18 Dec 2004 17:09:02 +0000 (GMT) Received: from pimout2-ext.prodigy.net (pimout2-ext.prodigy.net [207.115.63.101]) by mx1.FreeBSD.org (Postfix) with ESMTP id BBCA243D46 for ; Sat, 18 Dec 2004 17:08:59 +0000 (GMT) (envelope-from julian@elischer.org) Received: from [192.168.1.102] (adsl-216-100-134-143.dsl.snfc21.pacbell.net [216.100.134.143])iBIH8sGr085906; Sat, 18 Dec 2004 12:08:56 -0500 Message-ID: <41C46426.3090900@elischer.org> Date: Sat, 18 Dec 2004 09:08:54 -0800 From: Julian Elischer User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-US; rv:1.8a3) Gecko/20041017 X-Accept-Language: en, hu MIME-Version: 1.0 To: Peter Jeremy References: <41C3D62D.7000808@comcast.net> <20041218091739.GC97121@cirb503493.alcatel.com.au> In-Reply-To: <20041218091739.GC97121@cirb503493.alcatel.com.au> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit cc: freebsd-hackers@freebsd.org cc: Gary Corcoran Subject: Re: Multiple hard disk failures - coincidence ? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Dec 2004 17:09:02 -0000 Peter Jeremy wrote: > On Sat, 2004-Dec-18 02:03:09 -0500, Gary Corcoran wrote: > >>I've just had *THREE* Maxtor 250GB hard disk failures on my >>FreeBSD 4.10 server within a matter of days. One I could >>attribute to actual failure. Two made me suspicious. Three >>has me wondering if this is some software problem... (or >>a conspiracy (just kidding) ;-) ) > > > Seems unlikely that faulty server software could cause a disk failure. > One possibility is that your power supply is a but stressed and the > supply rails are out of tolerance. The other possibility is that the > drives are overheating. Higher density drives will be more sensitive > to both heat and dirty power. > > >> I suppose it >>is possible these errors may have shown up more than a week or >>two ago, because my windows machines, reaching them via samba, >>haven't shown any problems until today, and of course with almost >>750GB of data, it's not all accessed over a short time span. > > > My approach to this is to add a line similar to > dd if=/dev/ad0 of=/dev/null bs=32k > for each disk into /etc/daily.local (or /etc/weekly.local or whatever). > This ensures that the disks are readable on a regular basis. > > >>P.S. I *can't* be the first person to run into this problem: >>When one gets a "hard error" reported for a certain block number, >>how does one find out exactly *which* file or directory is now >>unreadable? With hundreds of thousands of megabytes on one disk, >>a manual search is not practical - somebody must have written a >>program to 'backtrack' a block number to a particular file name >>- no? > I generally do a tar cf /dev/lubb /mountpoint We have some tools that do teh reverse.. tell you what blocks are in a file.. It should be possible to modify fsck to do the inverse.. fsck -n --findblocks 234234,56546,2342342 > > I know I've done this in the past but I don't recall exactly how. > About all you can do is search through the inode list for the > relevant blocks and then map the inode numbers to file names. >