From owner-freebsd-questions@FreeBSD.ORG  Thu Jun 19 10:11:59 2008
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 81CF3106564A
	for <freebsd-questions@freebsd.org>;
	Thu, 19 Jun 2008 10:11:59 +0000 (UTC)
	(envelope-from daniel_k_eriksson@telia.com)
Received: from pne-smtpout1-sn2.hy.skanova.net
	(pne-smtpout1-sn2.hy.skanova.net [81.228.8.83])
	by mx1.freebsd.org (Postfix) with ESMTP id 3AA338FC0A
	for <freebsd-questions@freebsd.org>;
	Thu, 19 Jun 2008 10:11:59 +0000 (UTC)
	(envelope-from daniel_k_eriksson@telia.com)
Received: from royal64.emp.zapto.org (195.198.193.168) by
	pne-smtpout1-sn2.hy.skanova.net (7.3.129)
	id 483EBD68004422F9; Thu, 19 Jun 2008 11:02:16 +0200
MIME-Version: 1.0
Content-Type: text/plain;
	charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
X-MimeOLE: Produced By Microsoft Exchange V6.5.7235.2
Date: Thu, 19 Jun 2008 11:02:14 +0200
Message-ID: <4F9C9299A10AE74E89EA580D14AA10A61A1947@royal64.emp.zapto.org>
In-Reply-To: <2812.71.63.150.244.1213842028.squirrel@www.pictureprints.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: "Fixing" a RAID
thread-index: AcjRs0VCi6C4VfTGRH+R2uLsWymNOQAL7kSA
References: <2812.71.63.150.244.1213842028.squirrel@www.pictureprints.net>
From: "Daniel Eriksson" <daniel_k_eriksson@telia.com>
To: <freebsd-questions@freebsd.org>
Cc: ryan.coleman@cwis.biz
Subject: RE: "Fixing" a RAID
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
	<mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Jun 2008 10:11:59 -0000

Ryan Coleman wrote:

> Jun  4 23:02:28 testserver kernel: ar0: 715425MB <HighPoint v3
RocketRAID> RAID5 (stripe 64 KB)> status: READY
> Jun  4 23:02:28 testserver kernel: ar0: disk0 READY using ad13 at
ata6-slave
> Jun  4 23:02:28 testserver kernel: ar0: disk1 READY using ad16 at
ata8-master
> Jun  4 23:02:28 testserver kernel: ar0: disk2 READY using ad15 at
ata7-slave
> Jun  4 23:02:28 testserver kernel: ar0: disk3 READY using ad17 at
ata8-slave
> Jun  4 23:05:35 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=3D501963358208, length=3D16384)]error =
=3D 5
> ...

My guess is that the rebuild failure is due to unreadable sectors on one
(or more) of the original three drives.

I recently had this happen to me on an 8 x 1 TB RAID-5 array on a
Highpoint RocketRAID 2340 controller. For some unknown reason two drives
developed unreadable sectors within hours of each other. To make a long
story short, the way I "fixed" this was to:

1. Used a tool I got from Highpoint tech-support to re-init the array
information (so the array was no longer marked as broken).
2. Unplugged both drives and hooked them up to another computer using a
regular SATA controller.
3. One of the drives was put through a complete "recondition" cycle(a).
4. The other drive was put through a partial "recondition" cycle(b).
5. I hooked up both drives to the 2340 controller again. The BIOS
immediately marked the array as degraded (because it didn't recognize
the wiped drive as part of the array), and I could re-add the wiped
drive so a rebuild of the array could start.
6. I finally ran a "zpool scrub" on the tank, and restored the few files
that had checksum errors.

(a) I tried to run a SMART long selftest, but it failed. I then
completely wiped the drive by writing zeroes to the entire surface,
allowing the firmware to remap the bad sectors. After this procedure the
long selftest succeeded. I finally used a diagnostic program from the
drive vendor (Western Digital) to again verify that the drive was
working properly.

(b) The SMART long selftest failed the first time, but after running a
surface scan using the diagnostic program from Western Digital the
selftest passed. I'm pretty sure the diagnostic program remapped the bad
sector, replacing it with a blank one. At least the program warned me to
back up all data before starting the surface scan. Alternatively I could
have used dd (with offset) to write to just the failed sector (available
in the SMART selftest log).


If I were you I would run all three drives through a SMART long
selftest. I'm sure you'll find that at least one of them will fail the
selftest. Use something like SpinRite 6 to recover the drive, or use dd
/ dd_rescue to copy the data to a fresh drive. Once all three of the
original drives pass a long selftest the array should be able to finish
a rebuild using a fourth (blank) drive.

By the way, don't try to use SpinRite 6 on 1 TB drives, it will fail
halfway through with a division-by-zero error. I haven't tried it on any
500 GB drives yet.

/Daniel Eriksson