Date: Sat, 26 Jun 2010 16:57:48 +0100 From: Matthew Lear <matt@bubblegen.co.uk> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: Adam Vande More <amvandemore@gmail.com>, freebsd-stable@freebsd.org Subject: Re: 7.2-RELEASE-p4, IO errors & RAID1 failure Message-ID: <1277567868.1870.21.camel@almscliff.bubblegen.co.uk> In-Reply-To: <20100625071644.GA75910@icarus.home.lan> References: <1276876031.7519.39.camel@almscliff.bubblegen.co.uk> <20100618174208.GA47470@icarus.home.lan> <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk> <20100622074541.GA71157@icarus.home.lan> <82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk> <1277401934.1874.12.camel@almscliff.bubblegen.co.uk> <20100624181535.GA58443@icarus.home.lan> <1277417182.1874.30.camel@almscliff.bubblegen.co.uk> <AANLkTimo1Vb461DHw3ZXNwK5BxDcgzKSkdxc3Dnqizge@mail.gmail.com> <20100625071644.GA75910@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 2010-06-25 at 00:16 -0700, Jeremy Chadwick wrote: > > All in all, replacing a drive is a completely reasonable action when > there's evidence confirming the need for its replacement. I don't like > replacing hardware when there's no indication replacing it will > necessarily fix the problem; I'd rather understand the problem. > > Matthew, if you're able to take the system down for 2-3 hours, I would > recommend downloading Western Digital's Data Lifeguard Diagnostics > software (for DOS; you'll need a CD burner to burn the ISO) and running > that on your drive. If that fails on a Long/Extended test, yep, replace > the disk. Said utility tests a lot more than just SMART. Ok. I've tried this but I think there are some BIOS settings that mean that the WD DOS env can't find the license file (I've read several postings about this). I'd rather not mess around with BIOS settings on the machine I'm trying to restore so I'll remove the drive and plug it into another machine and attempt to run the WD's diagnostics on it. I'll post the results here if anything interesting crops up. > If it passes the test, then we're back at square one, and you can try > replacing the disk if you'd like (then boot from the 2nd disk in the > RAID-1 array). My concern is that replacing it isn't going to fix > anything (meaning you might have a SATA port that's going bad or the > controller itself is broken). > Meanwhile, I powered off the RAID 1 machine, removed the [apparently] faulty drive (ad0), also removed the 160G drive that was a slave on ATA channel 0 (just to simplify things since it wasn't part of the array), replaced ad0 with a brand spanking new one (same make/model), switched the BIOS to boot from the 2nd disk (ie ad2) and booted the machine. Bootmgr started fine, booted the kernel and the machine booted normally. atacontrol status on ar0 gives: ar0: ATA RAID1 status: DEGRADED subdisks: 0 ---- MISSING 1 ---- ONLINE Importantly, atacontrol did detect that the RAID was degraded at boot time: ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode ar0: 305245MB <Intel MatrixRAID RAID1> status: DEGRADED ar0: disk0 DOWN no device found for this subdisk ar0: disk1 READY (mirror) using ad2 at ata1-master Just to clarify, the array was created using atacontrol so why it's reporting Intel MatrixRAID I have no idea. Trying to rebuild the array with atacontrol rebuild ar0 gives: atacontrol: ioctl(IOCATARAIDREBUILD): Input/output error So I tried to detach channel ata0 and reattach it. This appeared to go ok. Trying to rebuild the array again gave the same error as above. I found a post on nabble (can't find it now!) where a chap was having the same problem rebuilding his RAID1 array using atacontrol rebuild. Turns out that because it's a software RAID array, atacontrol rebuild won't work. The only recommended way to get the array back on track was to dd the contents of the healthy drive onto the new drive. I tried this just to see what would happen: dd if=/dev/ad2 of=/dev/ad0 bs=1024k Seemed to work just fine as expected. I was hoping that after another reboot, atacontrol would have seen ad0 as the missing array device on chanel 0, done anything required and hey presto, I'd have a health RAID 1 array again. Sadly, not. atacontrol still insists that the array is DEGRADED despite having manually mirrored the contents of ad2 to ad0. Is this a case of RTFM some more or have I missed something? It should surely be possible to restore the array?! -- Matt
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1277567868.1870.21.camel>