From owner-freebsd-stable@FreeBSD.ORG  Sat Jun 26 23:04:52 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 63D59106566B
	for <freebsd-stable@freebsd.org>; Sat, 26 Jun 2010 23:04:52 +0000 (UTC)
	(envelope-from matt@bubblegen.co.uk)
Received: from relay.ptn-ipout02.plus.net (relay.ptn-ipout02.plus.net
	[212.159.7.36]) by mx1.freebsd.org (Postfix) with ESMTP id BA7948FC14
	for <freebsd-stable@freebsd.org>; Sat, 26 Jun 2010 23:04:51 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Av0EAPMkJkxUXeb6/2dsb2JhbACDHZwLca5CkB6BKYE5gVByBA
Received: from outmx05.plus.net ([84.93.230.250])
	by relay.ptn-ipout02.plus.net with ESMTP; 27 Jun 2010 00:04:50 +0100
Received: from bubblegen.plus.com ([80.229.236.194] helo=[192.136.1.18])
	by outmx05.plus.net with esmtp (Exim) id 1OSeQb-0004HL-VK;
	Sun, 27 Jun 2010 00:04:50 +0100
From: Matthew Lear <matt@bubblegen.co.uk>
To: Jeremy Chadwick <freebsd@jdc.parodius.com>
In-Reply-To: <20100626171251.GA26022@icarus.home.lan>
References: <1276889330.2210.44.camel@almscliff.bubblegen.co.uk>
	<1277155992.1860.3.camel@almscliff.bubblegen.co.uk>
	<20100622074541.GA71157@icarus.home.lan>
	<82A96ECD-676C-4A4D-A328-0CFAABD64D50@gid.co.uk>
	<1277401934.1874.12.camel@almscliff.bubblegen.co.uk>
	<20100624181535.GA58443@icarus.home.lan>
	<1277417182.1874.30.camel@almscliff.bubblegen.co.uk>
	<AANLkTimo1Vb461DHw3ZXNwK5BxDcgzKSkdxc3Dnqizge@mail.gmail.com>
	<20100625071644.GA75910@icarus.home.lan>
	<1277567868.1870.21.camel@almscliff.bubblegen.co.uk>
	<20100626171251.GA26022@icarus.home.lan>
Content-Type: text/plain; charset="UTF-8"
Date: Sun, 27 Jun 2010 00:04:48 +0100
Message-ID: <1277593488.1884.107.camel@almscliff.bubblegen.co.uk>
Mime-Version: 1.0
X-Mailer: Evolution 2.28.3 
Content-Transfer-Encoding: 8bit
Cc: Adam Vande More <amvandemore@gmail.com>, freebsd-stable@freebsd.org
Subject: Re: 7.2-RELEASE-p4, IO errors & RAID1 failure
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 26 Jun 2010 23:04:52 -0000

On Sat, 2010-06-26 at 10:12 -0700, Jeremy Chadwick wrote:
> On Sat, Jun 26, 2010 at 04:57:48PM +0100, Matthew Lear wrote:
> > On Fri, 2010-06-25 at 00:16 -0700, Jeremy Chadwick wrote:
> > > 
> > > All in all, replacing a drive is a completely reasonable action when
> > > there's evidence confirming the need for its replacement.  I don't like
> > > replacing hardware when there's no indication replacing it will
> > > necessarily fix the problem; I'd rather understand the problem.
> > > 
> > > Matthew, if you're able to take the system down for 2-3 hours, I would
> > > recommend downloading Western Digital's Data Lifeguard Diagnostics
> > > software (for DOS; you'll need a CD burner to burn the ISO) and running
> > > that on your drive.  If that fails on a Long/Extended test, yep, replace
> > > the disk.  Said utility tests a lot more than just SMART.
> > 
> > Ok. I've tried this but I think there are some BIOS settings that mean
> > that the WD DOS env can't find the license file (I've read several
> > postings about this). I'd rather not mess around with BIOS settings on
> > the machine I'm trying to restore so I'll remove the drive and plug it
> > into another machine and attempt to run the WD's diagnostics on it. I'll
> > post the results here if anything interesting crops up.
> > 
> > > If it passes the test, then we're back at square one, and you can try
> > > replacing the disk if you'd like (then boot from the 2nd disk in the
> > > RAID-1 array).  My concern is that replacing it isn't going to fix
> > > anything (meaning you might have a SATA port that's going bad or the
> > > controller itself is broken).
> > > 
> > 
> > Meanwhile, I powered off the RAID 1 machine, removed the [apparently]
> > faulty drive (ad0), also removed the 160G drive that was a slave on ATA
> > channel 0 (just to simplify things since it wasn't part of the array),
> > replaced ad0 with a brand spanking new one (same make/model), switched
> > the BIOS to boot from the 2nd disk (ie ad2) and booted the machine.
> > Bootmgr started fine, booted the kernel and the machine booted normally.
> > atacontrol status on ar0 gives:
> > 
> > ar0: ATA RAID1 status: DEGRADED
> >  subdisks:
> >    0 ---- MISSING
> >    1 ---- ONLINE
> > 
> > Importantly, atacontrol did detect that the RAID was degraded at boot
> > time:
> > 
> > ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode
> > ar0: 305245MB <Intel MatrixRAID RAID1> status: DEGRADED
> > ar0: disk0 DOWN no device found for this subdisk
> > ar0: disk1 READY (mirror) using ad2 at ata1-master
> 
> Does "atacontrol list" show the existence of disks ad0 and ad2?  If so,
> then the message probably indicate "ad0 exists but there's missing
> metadata, so I'm ignoring it".  If not, then I have no real explanation
> other than it sounds like the SATA controller is broken.

Yes. I agree.

> > Just to clarify, the array was created using atacontrol so why it's
> > reporting Intel MatrixRAID I have no idea.

> Are you absolutely 100% positively certain that your system/motherboard
> does not have "SATA RAID" enabled in the system BIOS?  The ar0 "Intel
> MatrixRAID" line really has me concerned.  If MatrixRAID is indeed
> enabled in the BIOS, then almost all these problems can be explained.

Yep. Agreed! 100% positive. I've just double checked. SATA RAID Enable
is definitely set to Disabled in the BIOS.

> > Trying to rebuild the array with atacontrol rebuild ar0 gives:
> > 
> > atacontrol: ioctl(IOCATARAIDREBUILD): Input/output error
> >
> > So I tried to detach channel ata0 and reattach it. This appeared to go
> > ok. Trying to rebuild the array again gave the same error as above.
> 
> More on this later.
> 
> > I found a post on nabble (can't find it now!) where a chap was having
> > the same problem rebuilding his RAID1 array using atacontrol rebuild.
> > Turns out that because it's a software RAID array, atacontrol rebuild
> > won't work. The only recommended way to get the array back on track was
> > to dd the contents of the healthy drive onto the new drive. I tried this
> > just to see what would happen:
> > 
> > dd if=/dev/ad2 of=/dev/ad0 bs=1024k
> > 
> > Seemed to work just fine as expected. I was hoping that after another
> > reboot, atacontrol would have seen ad0 as the missing array device on
> > chanel 0, done anything required and hey presto, I'd have a health RAID
> > 1 array again.
> > 
> > Sadly, not. atacontrol still insists that the array is DEGRADED despite
> > having manually mirrored the contents of ad2 to ad0.
> 
> This probably has to do with corrupt/missing/incorrect metadata.  The dd
> method (to copy disk X to disk Y) isn't sufficient.

Yes I suspected as much :-( It felt an extremely flimsy, optimistic and
pathetic long shot.

> The atacontrol man page states the following for your situation:
> 
>    If the system has a pure software array and is not using a "real" ATA
>    RAID controller, then shut the system down, make sure that the disk that
>    was still working is moved to the bootable position (channel 0 or what‐
>    ever the BIOS allows the system to boot from) and the blank disk is
>    placed in the secondary position, then boot the system into single-user
>    mode and issue the command:
> 
>            atacontrol addspare ar0 ad6
>            atacontrol rebuild ar0
> 
> So I believe what the man page is telling you to do is:
> 
> 1) Power down the system
> 2) Physically connect the ad2 (working/has-data) disk to SATA channel 0
> 3) Physically connect the ad0 (brand-new) disk to SATA channel 1
> 4) Make mental note that the disk names will now be swapped: ad0 will
>    now be the working/has-data disk, and ad2 will be the brand-new disk
> 5) Power up the system and make sure you're booting from SATA channel 0
> 5) Go into single-user
> 6) Execute:
>    atacontrol addspare ar0 ad2
>    atacontrol rebuild ar0
> 
> I have no idea if this will work or not.

Worked a treat. I didn't swap the drives around but with ad2 running as
the 'good' bootable disk and with a new disk in the ad0 position:

# atacontrol addspare ar0 ad0
ad0: inserted into ar0 disk0 as spare

# atacontrol rebuild ar0

# atacontrol status ar0
ar0: ATA RAID1 status: REBUILDING 0% completed
 subdisks:
  0 ad0 SPARE
  1 ad2 ONLINE

..some time later..

# atacontrol status ar0
ar0: ATA RAID1 status: READY
subdisks:
  0 ad0 ONLINE
  1 ad2 ONLINE

Immediately followed by:
ad0: WARNING - WRITE_DMA taskqueue timeout - completing request directly
ad0: WARNING - WRITE_DMA48 freeing taskqueue zombie request

> If this doesn't work, I'm out of ideas other than restoring from backups
> or running in degraded mode to back up your data, then afterward,
> rebuild the system using something like gmirror.
> 

So it appears to be ok! :-) And upon reboot, everything also seems ok.
Phew! The warnings above are somewhat concerning but I wonder if these
wouldn't be seen with newer kernels (given the talk of increasing ata
timeouts etc)...

<cheekily piggy-back two questions>
Incidentally, is there a way to easily migrate from a atacontrol created
array to a gmirror created array? I'm running FreeBSD 8.0 on another
machine with a gmirror created RAID1 array with no problem whatsoever (I
chose gmirror as the choice for this machine over atacontrol after
reading various postings about software RAID under recent releases of
FreeBSD). I was planning on upgrading the 7.2 machine to 8.0-RC1 anyway
so if I could easily move to using gmirror then I would. That said,
atacontrol should (I assume) function correctly with 8.x, shouldn't it,
or is support of it dwindling somewhat?

How easy is it to upgrade an array to use larger disks - atacontrol or
gmirror? Feel free to respond with RTFM :-) I suppose one possible
solution would be to use something like GpartEd (example Linux land
tool) to grow a certain partition on an array (eg the partition mounted
on /usr/local). That way both partitions on each of the separate array
subdisks would be grown transparently since the operation would be
performed on partition ar0s1<n> (ie, taken care of by atacontrol /
gmirror).
</cheekily piggy-back two questions>

Thank you for taking the time time to detail and describe things for me
to try, Jeremy. I very much appreciate it indeed. Normal services have
been resumed! :-)

Cheers,
--  Matt