Date: Fri, 18 Jun 2010 08:08:24 +0100 From: Matthew Lear <matt@bubblegen.co.uk> To: freebsd-stable@freebsd.org Subject: 7.2-RELEASE-p4, IO errors & RAID1 failure Message-ID: <1276844904.7519.19.camel@almscliff.bubblegen.co.uk>
next in thread | raw e-mail | index | archive | help
Hi there, I'm running 7.2-RELEASE-p4 on an i386 HP server (ML G5) in RAID1 configuration. Very recently, I've seen IO errors such as: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=20472527 reported and the RAID mirror is now offline. ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=395032335 ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR> error=10<NID_NOT_FOUND> LBA=395032335 ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode Strangely, I've ran some SMART tests on the device and no error has been recorded. Health checks pass. Running a long test on the device doesn't show any problem. While SMART can be manufacturer specific I at least expected to see something which looked to be suspicious. The drives in the RAID exist on two seperate ATA channels: [root@meshuga /home/matt]# atacontrol list ATA channel 0: Master: ad0 <WDC WD3200AAKS-00VYA0/12.01B02> SATA revision 2.x Slave: ad1 <FB160C4081/HPF0> SATA revision 1.x ATA channel 1: Master: ad2 <WDC WD3200AAKS-00VYA0/12.01B02> SATA revision 2.x Slave: no device present ATA channel 2: Master: acd0 <HL-DT-ST DVDRAM GH22NS40/NL01> SATA revision 1.x Slave: no device present ATA channel 3: Master: no device present Slave: no device present ad1 is a third 160G drive that I periodically back up to using cron. I've seen the thread below but I'm not using ZFS. This seems similar to what I'm experiencing. http://freebsd.monkey.org/freebsd-stable/200801/msg00617.html I'm using software RAID with atacontrol but the drives are not hot-swap. Therefore I expect that I need to detach ad0 from the RAID, power down the unit, replace the drive, power on the unit and rebuild the array in order to fix things. Trouble is, I'm struggling to find out if this can be done safely with atacontrol and the hw configuration I have, and if so, how best to do it? It may well be a case of RTFM (again) but I just wanted to run this by the community to get some feedback. Loosing data is not an option here so hopefully I can get the machine back up on its feet soon. Any help or feedback much appreciated. Thanks, -- Matt
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1276844904.7519.19.camel>