Date: Fri, 18 Jun 2010 01:21:27 -0700 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Matthew Lear <matt@bubblegen.co.uk> Cc: freebsd-stable@freebsd.org Subject: Re: 7.2-RELEASE-p4, IO errors & RAID1 failure Message-ID: <20100618082127.GA34578@icarus.home.lan> In-Reply-To: <1276844904.7519.19.camel@almscliff.bubblegen.co.uk> References: <1276844904.7519.19.camel@almscliff.bubblegen.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jun 18, 2010 at 08:08:24AM +0100, Matthew Lear wrote: > Hi there, > > I'm running 7.2-RELEASE-p4 on an i386 HP server (ML G5) in RAID1 > configuration. Very recently, I've seen IO errors such as: > > ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=20472527 > > reported and the RAID mirror is now offline. > > ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=395032335 > ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> > error=10<NID_NOT_FOUND> LBA=395032335 > ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode > > Strangely, I've ran some SMART tests on the device and no error has been > recorded. Health checks pass. Running a long test on the device doesn't > show any problem. While SMART can be manufacturer specific I at least > expected to see something which looked to be suspicious. Could you please provide the full output from "smartctl -a /dev/ad0" here? Your drive may be completely fine and you may not have to swap it at all; hard to say. > The drives in the RAID exist on two seperate ATA channels: > [root@meshuga /home/matt]# atacontrol list > ATA channel 0: > Master: ad0 <WDC WD3200AAKS-00VYA0/12.01B02> SATA revision 2.x > Slave: ad1 <FB160C4081/HPF0> SATA revision 1.x > ATA channel 1: > Master: ad2 <WDC WD3200AAKS-00VYA0/12.01B02> SATA revision 2.x > Slave: no device present > ATA channel 2: > Master: acd0 <HL-DT-ST DVDRAM GH22NS40/NL01> SATA revision 1.x > Slave: no device present > ATA channel 3: > Master: no device present > Slave: no device present > > ad1 is a third 160G drive that I periodically back up to using cron. So your RAID-1 array consists of ad0 and ad2? You didn't provide "atacontrol status" output so I'm going to assume that's the case. What's odd to me is that you somehow have two disks on a single ATA channel -- look closely at channel 0. SATA has a 1:1 device-to-channel mapping, so I'm a little surprised to see there's two devices on channel 0. To me, this indicates your system BIOS is configured to run in "Emulation" mode -- where the ATA controller pretends to be a PATA/IDE controller, thus SATA-0 and SATA-1 devices appear as primary master and primary slave, respectively. What motherboard is this? Can you change the setting to either "Native", "Enhanced", or (even better) "AHCI"? I've seen some systems where the Serial ATA option in the BIOS has an "Auto" option, which does totally bizarre things at times. But before changing the setting, I would recommend dealing with the disk problem first. Changing the SATA controller operation mode will almost certainly change all of your device names (you'll have to go into single-user mode, mount filesystems by hand, fix /etc/fstab, etc.). Also, can you please provide output from "dmesg | grep -i ata"? > I've seen the thread below but I'm not using ZFS. This seems similar to > what I'm experiencing. > http://freebsd.monkey.org/freebsd-stable/200801/msg00617.html > > I'm using software RAID with atacontrol but the drives are not hot-swap. When you say "software RAID", I'm assuming you're referring to ata(4)'s native OS-level RAID (as in "atacontrol create RAID1 ad0 ad1"). Or are you using something like Intel MatrixRAID? > Therefore I expect that I need to detach ad0 from the RAID, power down > the unit, replace the drive, power on the unit and rebuild the array in > order to fix things. Trouble is, I'm struggling to find out if this can > be done safely with atacontrol and the hw configuration I have, and if > so, how best to do it? The atacontrol man page covers your situation: It is NOT recommended to create such arrays on a primary/secondary pair on a SINGLE channel since the throughput of the mirror would be severely compromised, the ability to rebuild the array in the event of a disk failure would be greatly complicated, and if a disk controller electronics failed it could wedge the channel and take both disks in the mirror offline. (which would defeat the purpose of having a mirror in the first place) I realise ad0 is on channel 0 and ad2 is on channel 1, but you have a "mystery device" as a Slave on channel 0, which is going to be impacted. You really need AHCI to be able to hot-swap effectively. The procedure I've followed for years -- without ZFS in the picture (that should just add a few extra commands to the picture) -- relies on AHCI and a proper hot-swap bay/backplane. Hot-swapping disks without such a backplane, in my experience, results in the system powering off suddenly. Anyway, this is the procedure: - atacontrol detach ataX (where ataX = channel disk is attached to) - Physically remove the bad disk - Physically insert a new disk - Wait 15 seconds for drive to settle - atacontrol attach ataX The new disk should appear automatically, and should appear as the same device name (adX) that it did before. At least that's my experience when using AHCI with ataahci.ko (I haven't tried when using ahci.ko, which uses CAM). We can discuss the details/differences later. If the disk doesn't reappear ("atacontrol list" shows no device attached) then do "atacontrol reinit ataX", which should make it appear. I've had to do this once or twice, and it worked fine. I've also seen this command lock the system up or panic the kernel. But as stated, you won't be able to do this because you have two SATA devices appearing under one channel. Given that, I would recommend you follow this procedure instead: - Power down system cleanly ("shutdown -p") - Remove power cable from PSU - Physically disconnect + remove the bad disk - Physically add + connect the new disk - Power up system - Go into system BIOS and make sure the new disk appears. (FreeBSD doesn't care what the BIOS thinks, so this step is done solely to make sure that the PC sees the disk at all) - Let FreeBSD boot/etc. -- I believe ata(4) will automatically begin rebuilding the array when it tastes the new/replacement disk and sees it has no metadata. "atacontrol status" should show the state. > It may well be a case of RTFM (again) but I just wanted to run this by > the community to get some feedback. Loosing data is not an option here > so hopefully I can get the machine back up on its feet soon. Don't take this as a pot-shot, but you should have tested this whole ordeal before putting the machine into a mission-critical role. It's important to do this rather than just blindly assume there won't be any complications; better to be safe than sorry. :-) Testing disk failures of this specific nature is pretty simple, especially if there's a hot-swap backplane involved. -- | Jeremy Chadwick jdc@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100618082127.GA34578>