From owner-freebsd-stable@FreeBSD.ORG  Fri Jun 18 07:37:19 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BA2A5106566B
	for <freebsd-stable@freebsd.org>; Fri, 18 Jun 2010 07:37:19 +0000 (UTC)
	(envelope-from matt@bubblegen.co.uk)
Received: from relay.pcl-ipout02.plus.net (relay.pcl-ipout02.plus.net
	[212.159.7.100])
	by mx1.freebsd.org (Postfix) with ESMTP id 5F9338FC0A
	for <freebsd-stable@freebsd.org>; Fri, 18 Jun 2010 07:37:19 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AroFAP63GkzUnw4R/2dsb2JhbACDHY9HjCJxrxSRJYElgTyBSW8E
Received: from outmx02.plus.net ([212.159.14.17])
	by relay.pcl-ipout02.plus.net with ESMTP; 18 Jun 2010 08:08:28 +0100
Received: from bubblegen.plus.com ([80.229.236.194] helo=[192.136.1.18])
	by outmx02.plus.net with esmtp (Exim) id 1OPVgh-0003sZ-Jc
	for freebsd-stable@freebsd.org; Fri, 18 Jun 2010 08:08:27 +0100
From: Matthew Lear <matt@bubblegen.co.uk>
To: freebsd-stable@freebsd.org
Content-Type: text/plain; charset="UTF-8"
Date: Fri, 18 Jun 2010 08:08:24 +0100
Message-ID: <1276844904.7519.19.camel@almscliff.bubblegen.co.uk>
Mime-Version: 1.0
X-Mailer: Evolution 2.28.1 
Content-Transfer-Encoding: 7bit
Subject: 7.2-RELEASE-p4, IO errors & RAID1 failure
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Jun 2010 07:37:19 -0000

Hi there,

I'm running 7.2-RELEASE-p4 on an i386 HP server (ML G5) in RAID1
configuration. Very recently, I've seen IO errors such as:

ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=20472527

reported and the RAID mirror is now offline.

ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=395032335
ad0: FAILURE - WRITE_DMA48 status=51<READY,DSC,ERROR>
error=10<NID_NOT_FOUND> LBA=395032335
ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode

Strangely, I've ran some SMART tests on the device and no error has been
recorded. Health checks pass. Running a long test on the device doesn't
show any problem. While SMART can be manufacturer specific I at least
expected to see something which looked to be suspicious.

The drives in the RAID exist on two seperate ATA channels:
[root@meshuga /home/matt]# atacontrol list
ATA channel 0:
    Master:  ad0 <WDC WD3200AAKS-00VYA0/12.01B02> SATA revision 2.x
    Slave:   ad1 <FB160C4081/HPF0> SATA revision 1.x
ATA channel 1:
    Master:  ad2 <WDC WD3200AAKS-00VYA0/12.01B02> SATA revision 2.x
    Slave:       no device present
ATA channel 2:
    Master: acd0 <HL-DT-ST DVDRAM GH22NS40/NL01> SATA revision 1.x
    Slave:       no device present
ATA channel 3:
    Master:      no device present
    Slave:       no device present

ad1 is a third 160G drive that I periodically back up to using cron.

I've seen the thread below but I'm not using ZFS. This seems similar to
what I'm experiencing.
http://freebsd.monkey.org/freebsd-stable/200801/msg00617.html

I'm using software RAID with atacontrol but the drives are not hot-swap.
Therefore I expect that I need to detach ad0 from the RAID, power down
the unit, replace the drive, power on the unit and rebuild the array in
order to fix things. Trouble is, I'm struggling to find out if this can
be done safely with atacontrol and the hw configuration I have, and if
so, how best to do it?

It may well be a case of RTFM (again) but I just wanted to run this by
the community to get some feedback. Loosing data is not an option here
so hopefully I can get the machine back up on its feet soon.

Any help or feedback much appreciated.
Thanks,
--  Matt