From owner-freebsd-stable@FreeBSD.ORG Sat Jun 26 23:04:52 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 63D59106566B for ; Sat, 26 Jun 2010 23:04:52 +0000 (UTC) (envelope-from matt@bubblegen.co.uk) Received: from relay.ptn-ipout02.plus.net (relay.ptn-ipout02.plus.net [212.159.7.36]) by mx1.freebsd.org (Postfix) with ESMTP id BA7948FC14 for ; Sat, 26 Jun 2010 23:04:51 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av0EAPMkJkxUXeb6/2dsb2JhbACDHZwLca5CkB6BKYE5gVByBA Received: from outmx05.plus.net ([84.93.230.250]) by relay.ptn-ipout02.plus.net with ESMTP; 27 Jun 2010 00:04:50 +0100 Received: from bubblegen.plus.com ([80.229.236.194] helo=[192.136.1.18]) by outmx05.plus.net with esmtp (Exim) id 1OSeQb-0004HL-VK; Sun, 27 Jun 2010 00:04:50 +0100 From: Matthew Lear To: Jeremy Chadwick In-Reply-To: <20100626171251.GA26022@icarus.home.lan> References: <1276889330.2210.44.camel@almscliff.bubblegen.co.uk> <1277155992.1860.3.camel@almscliff.bubblegen.co.uk> <20100622074541.GA71157@icarus.home.lan> <82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk> <1277401934.1874.12.camel@almscliff.bubblegen.co.uk> <20100624181535.GA58443@icarus.home.lan> <1277417182.1874.30.camel@almscliff.bubblegen.co.uk> <20100625071644.GA75910@icarus.home.lan> <1277567868.1870.21.camel@almscliff.bubblegen.co.uk> <20100626171251.GA26022@icarus.home.lan> Content-Type: text/plain; charset="UTF-8" Date: Sun, 27 Jun 2010 00:04:48 +0100 Message-ID: <1277593488.1884.107.camel@almscliff.bubblegen.co.uk> Mime-Version: 1.0 X-Mailer: Evolution 2.28.3 Content-Transfer-Encoding: 8bit Cc: Adam Vande More , freebsd-stable@freebsd.org Subject: Re: 7.2-RELEASE-p4, IO errors & RAID1 failure X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 26 Jun 2010 23:04:52 -0000 On Sat, 2010-06-26 at 10:12 -0700, Jeremy Chadwick wrote: > On Sat, Jun 26, 2010 at 04:57:48PM +0100, Matthew Lear wrote: > > On Fri, 2010-06-25 at 00:16 -0700, Jeremy Chadwick wrote: > > > > > > All in all, replacing a drive is a completely reasonable action when > > > there's evidence confirming the need for its replacement. I don't like > > > replacing hardware when there's no indication replacing it will > > > necessarily fix the problem; I'd rather understand the problem. > > > > > > Matthew, if you're able to take the system down for 2-3 hours, I would > > > recommend downloading Western Digital's Data Lifeguard Diagnostics > > > software (for DOS; you'll need a CD burner to burn the ISO) and running > > > that on your drive. If that fails on a Long/Extended test, yep, replace > > > the disk. Said utility tests a lot more than just SMART. > > > > Ok. I've tried this but I think there are some BIOS settings that mean > > that the WD DOS env can't find the license file (I've read several > > postings about this). I'd rather not mess around with BIOS settings on > > the machine I'm trying to restore so I'll remove the drive and plug it > > into another machine and attempt to run the WD's diagnostics on it. I'll > > post the results here if anything interesting crops up. > > > > > If it passes the test, then we're back at square one, and you can try > > > replacing the disk if you'd like (then boot from the 2nd disk in the > > > RAID-1 array). My concern is that replacing it isn't going to fix > > > anything (meaning you might have a SATA port that's going bad or the > > > controller itself is broken). > > > > > > > Meanwhile, I powered off the RAID 1 machine, removed the [apparently] > > faulty drive (ad0), also removed the 160G drive that was a slave on ATA > > channel 0 (just to simplify things since it wasn't part of the array), > > replaced ad0 with a brand spanking new one (same make/model), switched > > the BIOS to boot from the 2nd disk (ie ad2) and booted the machine. > > Bootmgr started fine, booted the kernel and the machine booted normally. > > atacontrol status on ar0 gives: > > > > ar0: ATA RAID1 status: DEGRADED > > subdisks: > > 0 ---- MISSING > > 1 ---- ONLINE > > > > Importantly, atacontrol did detect that the RAID was degraded at boot > > time: > > > > ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode > > ar0: 305245MB status: DEGRADED > > ar0: disk0 DOWN no device found for this subdisk > > ar0: disk1 READY (mirror) using ad2 at ata1-master > > Does "atacontrol list" show the existence of disks ad0 and ad2? If so, > then the message probably indicate "ad0 exists but there's missing > metadata, so I'm ignoring it". If not, then I have no real explanation > other than it sounds like the SATA controller is broken. Yes. I agree. > > Just to clarify, the array was created using atacontrol so why it's > > reporting Intel MatrixRAID I have no idea. > Are you absolutely 100% positively certain that your system/motherboard > does not have "SATA RAID" enabled in the system BIOS? The ar0 "Intel > MatrixRAID" line really has me concerned. If MatrixRAID is indeed > enabled in the BIOS, then almost all these problems can be explained. Yep. Agreed! 100% positive. I've just double checked. SATA RAID Enable is definitely set to Disabled in the BIOS. > > Trying to rebuild the array with atacontrol rebuild ar0 gives: > > > > atacontrol: ioctl(IOCATARAIDREBUILD): Input/output error > > > > So I tried to detach channel ata0 and reattach it. This appeared to go > > ok. Trying to rebuild the array again gave the same error as above. > > More on this later. > > > I found a post on nabble (can't find it now!) where a chap was having > > the same problem rebuilding his RAID1 array using atacontrol rebuild. > > Turns out that because it's a software RAID array, atacontrol rebuild > > won't work. The only recommended way to get the array back on track was > > to dd the contents of the healthy drive onto the new drive. I tried this > > just to see what would happen: > > > > dd if=/dev/ad2 of=/dev/ad0 bs=1024k > > > > Seemed to work just fine as expected. I was hoping that after another > > reboot, atacontrol would have seen ad0 as the missing array device on > > chanel 0, done anything required and hey presto, I'd have a health RAID > > 1 array again. > > > > Sadly, not. atacontrol still insists that the array is DEGRADED despite > > having manually mirrored the contents of ad2 to ad0. > > This probably has to do with corrupt/missing/incorrect metadata. The dd > method (to copy disk X to disk Y) isn't sufficient. Yes I suspected as much :-( It felt an extremely flimsy, optimistic and pathetic long shot. > The atacontrol man page states the following for your situation: > > If the system has a pure software array and is not using a "real" ATA > RAID controller, then shut the system down, make sure that the disk that > was still working is moved to the bootable position (channel 0 or what‐ > ever the BIOS allows the system to boot from) and the blank disk is > placed in the secondary position, then boot the system into single-user > mode and issue the command: > > atacontrol addspare ar0 ad6 > atacontrol rebuild ar0 > > So I believe what the man page is telling you to do is: > > 1) Power down the system > 2) Physically connect the ad2 (working/has-data) disk to SATA channel 0 > 3) Physically connect the ad0 (brand-new) disk to SATA channel 1 > 4) Make mental note that the disk names will now be swapped: ad0 will > now be the working/has-data disk, and ad2 will be the brand-new disk > 5) Power up the system and make sure you're booting from SATA channel 0 > 5) Go into single-user > 6) Execute: > atacontrol addspare ar0 ad2 > atacontrol rebuild ar0 > > I have no idea if this will work or not. Worked a treat. I didn't swap the drives around but with ad2 running as the 'good' bootable disk and with a new disk in the ad0 position: # atacontrol addspare ar0 ad0 ad0: inserted into ar0 disk0 as spare # atacontrol rebuild ar0 # atacontrol status ar0 ar0: ATA RAID1 status: REBUILDING 0% completed subdisks: 0 ad0 SPARE 1 ad2 ONLINE ..some time later.. # atacontrol status ar0 ar0: ATA RAID1 status: READY subdisks: 0 ad0 ONLINE 1 ad2 ONLINE Immediately followed by: ad0: WARNING - WRITE_DMA taskqueue timeout - completing request directly ad0: WARNING - WRITE_DMA48 freeing taskqueue zombie request > If this doesn't work, I'm out of ideas other than restoring from backups > or running in degraded mode to back up your data, then afterward, > rebuild the system using something like gmirror. > So it appears to be ok! :-) And upon reboot, everything also seems ok. Phew! The warnings above are somewhat concerning but I wonder if these wouldn't be seen with newer kernels (given the talk of increasing ata timeouts etc)... Incidentally, is there a way to easily migrate from a atacontrol created array to a gmirror created array? I'm running FreeBSD 8.0 on another machine with a gmirror created RAID1 array with no problem whatsoever (I chose gmirror as the choice for this machine over atacontrol after reading various postings about software RAID under recent releases of FreeBSD). I was planning on upgrading the 7.2 machine to 8.0-RC1 anyway so if I could easily move to using gmirror then I would. That said, atacontrol should (I assume) function correctly with 8.x, shouldn't it, or is support of it dwindling somewhat? How easy is it to upgrade an array to use larger disks - atacontrol or gmirror? Feel free to respond with RTFM :-) I suppose one possible solution would be to use something like GpartEd (example Linux land tool) to grow a certain partition on an array (eg the partition mounted on /usr/local). That way both partitions on each of the separate array subdisks would be grown transparently since the operation would be performed on partition ar0s1 (ie, taken care of by atacontrol / gmirror). Thank you for taking the time time to detail and describe things for me to try, Jeremy. I very much appreciate it indeed. Normal services have been resumed! :-) Cheers, -- Matt