Date: Thu, 19 Jun 2008 11:02:14 +0200 From: "Daniel Eriksson" <daniel_k_eriksson@telia.com> To: <freebsd-questions@freebsd.org> Cc: ryan.coleman@cwis.biz Subject: RE: "Fixing" a RAID Message-ID: <4F9C9299A10AE74E89EA580D14AA10A61A1947@royal64.emp.zapto.org> In-Reply-To: <2812.71.63.150.244.1213842028.squirrel@www.pictureprints.net> References: <2812.71.63.150.244.1213842028.squirrel@www.pictureprints.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Ryan Coleman wrote: > Jun 4 23:02:28 testserver kernel: ar0: 715425MB <HighPoint v3 RocketRAID> RAID5 (stripe 64 KB)> status: READY > Jun 4 23:02:28 testserver kernel: ar0: disk0 READY using ad13 at ata6-slave > Jun 4 23:02:28 testserver kernel: ar0: disk1 READY using ad16 at ata8-master > Jun 4 23:02:28 testserver kernel: ar0: disk2 READY using ad15 at ata7-slave > Jun 4 23:02:28 testserver kernel: ar0: disk3 READY using ad17 at ata8-slave > Jun 4 23:05:35 testserver kernel: g_vfs_done():ar0s1c[READ(offset=3D501963358208, length=3D16384)]error = =3D 5 > ... My guess is that the rebuild failure is due to unreadable sectors on one (or more) of the original three drives. I recently had this happen to me on an 8 x 1 TB RAID-5 array on a Highpoint RocketRAID 2340 controller. For some unknown reason two drives developed unreadable sectors within hours of each other. To make a long story short, the way I "fixed" this was to: 1. Used a tool I got from Highpoint tech-support to re-init the array information (so the array was no longer marked as broken). 2. Unplugged both drives and hooked them up to another computer using a regular SATA controller. 3. One of the drives was put through a complete "recondition" cycle(a). 4. The other drive was put through a partial "recondition" cycle(b). 5. I hooked up both drives to the 2340 controller again. The BIOS immediately marked the array as degraded (because it didn't recognize the wiped drive as part of the array), and I could re-add the wiped drive so a rebuild of the array could start. 6. I finally ran a "zpool scrub" on the tank, and restored the few files that had checksum errors. (a) I tried to run a SMART long selftest, but it failed. I then completely wiped the drive by writing zeroes to the entire surface, allowing the firmware to remap the bad sectors. After this procedure the long selftest succeeded. I finally used a diagnostic program from the drive vendor (Western Digital) to again verify that the drive was working properly. (b) The SMART long selftest failed the first time, but after running a surface scan using the diagnostic program from Western Digital the selftest passed. I'm pretty sure the diagnostic program remapped the bad sector, replacing it with a blank one. At least the program warned me to back up all data before starting the surface scan. Alternatively I could have used dd (with offset) to write to just the failed sector (available in the SMART selftest log). If I were you I would run all three drives through a SMART long selftest. I'm sure you'll find that at least one of them will fail the selftest. Use something like SpinRite 6 to recover the drive, or use dd / dd_rescue to copy the data to a fresh drive. Once all three of the original drives pass a long selftest the array should be able to finish a rebuild using a fourth (blank) drive. By the way, don't try to use SpinRite 6 on 1 TB drives, it will fail halfway through with a division-by-zero error. I haven't tried it on any 500 GB drives yet. /Daniel Eriksson
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4F9C9299A10AE74E89EA580D14AA10A61A1947>