From owner-freebsd-questions@FreeBSD.ORG  Mon Nov 12 16:17:58 2007
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 6515016A417
	for <freebsd-questions@freebsd.org>;
	Mon, 12 Nov 2007 16:17:58 +0000 (UTC)
	(envelope-from jerrymc@gizmo.acns.msu.edu)
Received: from gizmo.acns.msu.edu (gizmo.acns.msu.edu [35.8.1.43])
	by mx1.freebsd.org (Postfix) with ESMTP id 3937D13C4A5
	for <freebsd-questions@freebsd.org>;
	Mon, 12 Nov 2007 16:17:57 +0000 (UTC)
	(envelope-from jerrymc@gizmo.acns.msu.edu)
Received: from gizmo.acns.msu.edu (localhost [127.0.0.1])
	by gizmo.acns.msu.edu (8.13.6/8.13.6) with ESMTP id lACGEGjF098863;
	Mon, 12 Nov 2007 11:14:16 -0500 (EST)
	(envelope-from jerrymc@gizmo.acns.msu.edu)
Received: (from jerrymc@localhost)
	by gizmo.acns.msu.edu (8.13.6/8.13.6/Submit) id lACGEGiW098862;
	Mon, 12 Nov 2007 11:14:16 -0500 (EST) (envelope-from jerrymc)
Date: Mon, 12 Nov 2007 11:14:16 -0500
From: Jerry McAllister <jerrymc@msu.edu>
To: David Newman <dnewman@networktest.com>
Message-ID: <20071112161416.GB98697@gizmo.acns.msu.edu>
References: <4736593E.1090905@networktest.com>
	<64c038660711102109x2ea186afjdd219292d8eed700@mail.gmail.com>
	<47372644.4060201@networktest.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <47372644.4060201@networktest.com>
User-Agent: Mutt/1.4.2.2i
Cc: freebsd-questions@freebsd.org
Subject: Re: dealing with a failing drive
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Nov 2007 16:17:58 -0000

On Sun, Nov 11, 2007 at 07:56:52AM -0800, David Newman wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 11/10/07 9:09 PM, Modulok wrote:
> >>> I'd welcome suggestions on how (or whether) to try to revive a SCSI
> > drive that's failing.
> > 
> > It depends on how valuable the data on the array is, and more
> > importantly, how much funding you have at your disposal to fix the
> > problem. If it were me, I would set aside the bad disk, connect a new
> > disk to the card and re-synchronize the array. (Assuming one of the
> > members still retains a good copy of the data.) Afterwards I would
> > destroy, or toss the existing disk in the trash can (depending on the
> > sensitivity of the data stored on it.)
> 
> Thanks for your reply.
> 
> An update: After doing what you suggest (leaving in the "good" disk,
> adding a new disk, RAID rebuilding) I still got soft write errors --
> with *either one* of the disks I tried.
> 
> Then I tried putting both disks in an identical server and they came up
> fine, no read or write errors.
> 
> Ergo, the bad RAID controller is bad and the disks may be OK.

Probably not.
Generally, if the RAID controller is bad, you will see errors
all over and not it just one place, tho I suppose it is possible.
Check and see what it reports as error locations and see if they
move around any.

A soft error is usually one that can be corrected within the limits
of rereads and any error correction that the system is using.  It
may be that the error was introduced when the problems with the old
disk was occuring so that there was an error written on to the other
supposedly good disk and then mirrored to the new disk - errors can
be preserved by mirroring too.

Having said that, I don't know where this error is from.  Try reading up
and rewriting the data that is in the spot getting the error and then 
reading it from the new location.   It is pretty hard to figure out
and specifically rewrite one certain block on modern systems because
the physical locations are virtual.   Although you would expect the
same sector number to be in the same place from one write to the next,
if it happens that that sector gets remapped due to an error, then
it will actually be a different physical location the next time and
you don't really prove anything.   But, it is worth experimenting 
with if you want.

You can dd from and to any sector on the partition by carefully
using skip counts and block counts.   But, you have to figure out
the location (sector number) first.

Good luck,

////jerry

> 
> >>> Is there some other way to:
> >>> b)monitor the health of disks on a Compaq controller so it doesn't
> > get to this point to begin with?
> > 
> > There are various tools out there that attempt to 'monitor' the
> > condition of disk drives to try and predict when failure is eminent.
> > For valuable data, it is safer to setup a mirror and simply toss out
> > bad disks as they fail. For extremely valuable data use a 3 disk
> > array. With a 3 disk setup you will still be covered in the event that
> > an additional disk craps out during the re-sync.
> > 
> > To quote google's article on disk failure, regarding SMART:
> 
> Right, I've heard it said that "SMART isn't."
> 
> Nonetheless, I'd appreciate any suggestions to monitor the health of
> disks -- and RAID controllers too -- on HP Proliant servers running FreeBSD.
> 
> thanks again.
> 
> dn
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.1 (Darwin)
> 
> iD8DBQFHNyZDyPxGVjntI4IRAqk1AKCUwByNOAJZwvtD9V21TZfyaMWaxgCdFSCZ
> dZjf3ynK+4OffBzsDOawF9A=
> =DUqc
> -----END PGP SIGNATURE-----
> _______________________________________________
> freebsd-questions@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org"