From owner-freebsd-hackers@FreeBSD.ORG Sat Dec 18 09:17:44 2004 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 2DF3616A4CE for ; Sat, 18 Dec 2004 09:17:44 +0000 (GMT) Received: from mail26.syd.optusnet.com.au (mail26.syd.optusnet.com.au [211.29.133.167]) by mx1.FreeBSD.org (Postfix) with ESMTP id 642B743D48 for ; Sat, 18 Dec 2004 09:17:43 +0000 (GMT) (envelope-from PeterJeremy@optushome.com.au) Received: from cirb503493.alcatel.com.au (c211-30-75-229.belrs2.nsw.optusnet.com.au [211.30.75.229]) iBI9He5g011887 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Sat, 18 Dec 2004 20:17:40 +1100 Received: from cirb503493.alcatel.com.au (localhost.alcatel.com.au [127.0.0.1])iBI9HdxP098267; Sat, 18 Dec 2004 20:17:39 +1100 (EST) (envelope-from pjeremy@cirb503493.alcatel.com.au) Received: (from pjeremy@localhost)iBI9HddQ098266; Sat, 18 Dec 2004 20:17:39 +1100 (EST) (envelope-from pjeremy) Date: Sat, 18 Dec 2004 20:17:39 +1100 From: Peter Jeremy To: Gary Corcoran Message-ID: <20041218091739.GC97121@cirb503493.alcatel.com.au> References: <41C3D62D.7000808@comcast.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <41C3D62D.7000808@comcast.net> User-Agent: Mutt/1.4.2i cc: freebsd-hackers@freebsd.org Subject: Re: Multiple hard disk failures - coincidence ? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Dec 2004 09:17:44 -0000 On Sat, 2004-Dec-18 02:03:09 -0500, Gary Corcoran wrote: >I've just had *THREE* Maxtor 250GB hard disk failures on my >FreeBSD 4.10 server within a matter of days. One I could >attribute to actual failure. Two made me suspicious. Three >has me wondering if this is some software problem... (or >a conspiracy (just kidding) ;-) ) Seems unlikely that faulty server software could cause a disk failure. One possibility is that your power supply is a but stressed and the supply rails are out of tolerance. The other possibility is that the drives are overheating. Higher density drives will be more sensitive to both heat and dirty power. > I suppose it >is possible these errors may have shown up more than a week or >two ago, because my windows machines, reaching them via samba, >haven't shown any problems until today, and of course with almost >750GB of data, it's not all accessed over a short time span. My approach to this is to add a line similar to dd if=/dev/ad0 of=/dev/null bs=32k for each disk into /etc/daily.local (or /etc/weekly.local or whatever). This ensures that the disks are readable on a regular basis. >P.S. I *can't* be the first person to run into this problem: >When one gets a "hard error" reported for a certain block number, >how does one find out exactly *which* file or directory is now >unreadable? With hundreds of thousands of megabytes on one disk, >a manual search is not practical - somebody must have written a >program to 'backtrack' a block number to a particular file name >- no? I know I've done this in the past but I don't recall exactly how. About all you can do is search through the inode list for the relevant blocks and then map the inode numbers to file names. -- Peter Jeremy