From owner-freebsd-stable@FreeBSD.ORG Sat Jun 26 17:12:53 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E328E106564A for ; Sat, 26 Jun 2010 17:12:53 +0000 (UTC) (envelope-from jdc@koitsu.dyndns.org) Received: from qmta08.emeryville.ca.mail.comcast.net (qmta08.emeryville.ca.mail.comcast.net [76.96.30.80]) by mx1.freebsd.org (Postfix) with ESMTP id C0E9F8FC13 for ; Sat, 26 Jun 2010 17:12:53 +0000 (UTC) Received: from omta08.emeryville.ca.mail.comcast.net ([76.96.30.12]) by qmta08.emeryville.ca.mail.comcast.net with comcast id agyU1e0030FhH24A8hCtYR; Sat, 26 Jun 2010 17:12:53 +0000 Received: from koitsu.dyndns.org ([98.248.46.159]) by omta08.emeryville.ca.mail.comcast.net with comcast id ahCs1e0023S48mS8UhCsQL; Sat, 26 Jun 2010 17:12:52 +0000 Received: by icarus.home.lan (Postfix, from userid 1000) id AF3539B425; Sat, 26 Jun 2010 10:12:51 -0700 (PDT) Date: Sat, 26 Jun 2010 10:12:51 -0700 From: Jeremy Chadwick To: Matthew Lear Message-ID: <20100626171251.GA26022@icarus.home.lan> References: <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk> <20100622074541.GA71157@icarus.home.lan> <82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk> <1277401934.1874.12.camel@almscliff.bubblegen.co.uk> <20100624181535.GA58443@icarus.home.lan> <1277417182.1874.30.camel@almscliff.bubblegen.co.uk> <20100625071644.GA75910@icarus.home.lan> <1277567868.1870.21.camel@almscliff.bubblegen.co.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1277567868.1870.21.camel@almscliff.bubblegen.co.uk> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: Adam Vande More , freebsd-stable@freebsd.org Subject: Re: 7.2-RELEASE-p4, IO errors & RAID1 failure X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jun 2010 17:12:54 -0000 On Sat, Jun 26, 2010 at 04:57:48PM +0100, Matthew Lear wrote: > On Fri, 2010-06-25 at 00:16 -0700, Jeremy Chadwick wrote: > > > > All in all, replacing a drive is a completely reasonable action when > > there's evidence confirming the need for its replacement. I don't like > > replacing hardware when there's no indication replacing it will > > necessarily fix the problem; I'd rather understand the problem. > > > > Matthew, if you're able to take the system down for 2-3 hours, I would > > recommend downloading Western Digital's Data Lifeguard Diagnostics > > software (for DOS; you'll need a CD burner to burn the ISO) and running > > that on your drive. If that fails on a Long/Extended test, yep, replace > > the disk. Said utility tests a lot more than just SMART. > > Ok. I've tried this but I think there are some BIOS settings that mean > that the WD DOS env can't find the license file (I've read several > postings about this). I'd rather not mess around with BIOS settings on > the machine I'm trying to restore so I'll remove the drive and plug it > into another machine and attempt to run the WD's diagnostics on it. I'll > post the results here if anything interesting crops up. > > > If it passes the test, then we're back at square one, and you can try > > replacing the disk if you'd like (then boot from the 2nd disk in the > > RAID-1 array). My concern is that replacing it isn't going to fix > > anything (meaning you might have a SATA port that's going bad or the > > controller itself is broken). > > > > Meanwhile, I powered off the RAID 1 machine, removed the [apparently] > faulty drive (ad0), also removed the 160G drive that was a slave on ATA > channel 0 (just to simplify things since it wasn't part of the array), > replaced ad0 with a brand spanking new one (same make/model), switched > the BIOS to boot from the 2nd disk (ie ad2) and booted the machine. > Bootmgr started fine, booted the kernel and the machine booted normally. > atacontrol status on ar0 gives: > > ar0: ATA RAID1 status: DEGRADED > subdisks: > 0 ---- MISSING > 1 ---- ONLINE > > Importantly, atacontrol did detect that the RAID was degraded at boot > time: > > ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode > ar0: 305245MB status: DEGRADED > ar0: disk0 DOWN no device found for this subdisk > ar0: disk1 READY (mirror) using ad2 at ata1-master Does "atacontrol list" show the existence of disks ad0 and ad2? If so, then the message probably indicate "ad0 exists but there's missing metadata, so I'm ignoring it". If not, then I have no real explanation other than it sounds like the SATA controller is broken. > Just to clarify, the array was created using atacontrol so why it's > reporting Intel MatrixRAID I have no idea. Are you absolutely 100% positively certain that your system/motherboard does not have "SATA RAID" enabled in the system BIOS? The ar0 "Intel MatrixRAID" line really has me concerned. If MatrixRAID is indeed enabled in the BIOS, then almost all these problems can be explained. > Trying to rebuild the array with atacontrol rebuild ar0 gives: > > atacontrol: ioctl(IOCATARAIDREBUILD): Input/output error > > So I tried to detach channel ata0 and reattach it. This appeared to go > ok. Trying to rebuild the array again gave the same error as above. More on this later. > I found a post on nabble (can't find it now!) where a chap was having > the same problem rebuilding his RAID1 array using atacontrol rebuild. > Turns out that because it's a software RAID array, atacontrol rebuild > won't work. The only recommended way to get the array back on track was > to dd the contents of the healthy drive onto the new drive. I tried this > just to see what would happen: > > dd if=/dev/ad2 of=/dev/ad0 bs=1024k > > Seemed to work just fine as expected. I was hoping that after another > reboot, atacontrol would have seen ad0 as the missing array device on > chanel 0, done anything required and hey presto, I'd have a health RAID > 1 array again. > > Sadly, not. atacontrol still insists that the array is DEGRADED despite > having manually mirrored the contents of ad2 to ad0. This probably has to do with corrupt/missing/incorrect metadata. The dd method (to copy disk X to disk Y) isn't sufficient. The atacontrol man page states the following for your situation: If the system has a pure software array and is not using a "real" ATA RAID controller, then shut the system down, make sure that the disk that was still working is moved to the bootable position (channel 0 or what‐ ever the BIOS allows the system to boot from) and the blank disk is placed in the secondary position, then boot the system into single-user mode and issue the command: atacontrol addspare ar0 ad6 atacontrol rebuild ar0 So I believe what the man page is telling you to do is: 1) Power down the system 2) Physically connect the ad2 (working/has-data) disk to SATA channel 0 3) Physically connect the ad0 (brand-new) disk to SATA channel 1 4) Make mental note that the disk names will now be swapped: ad0 will now be the working/has-data disk, and ad2 will be the brand-new disk 5) Power up the system and make sure you're booting from SATA channel 0 5) Go into single-user 6) Execute: atacontrol addspare ar0 ad2 atacontrol rebuild ar0 I have no idea if this will work or not. If this doesn't work, I'm out of ideas other than restoring from backups or running in degraded mode to back up your data, then afterward, rebuild the system using something like gmirror. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |