From owner-freebsd-stable@FreeBSD.ORG Fri Jun 18 07:37:19 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BA2A5106566B for ; Fri, 18 Jun 2010 07:37:19 +0000 (UTC) (envelope-from matt@bubblegen.co.uk) Received: from relay.pcl-ipout02.plus.net (relay.pcl-ipout02.plus.net [212.159.7.100]) by mx1.freebsd.org (Postfix) with ESMTP id 5F9338FC0A for ; Fri, 18 Jun 2010 07:37:19 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AroFAP63GkzUnw4R/2dsb2JhbACDHY9HjCJxrxSRJYElgTyBSW8E Received: from outmx02.plus.net ([212.159.14.17]) by relay.pcl-ipout02.plus.net with ESMTP; 18 Jun 2010 08:08:28 +0100 Received: from bubblegen.plus.com ([80.229.236.194] helo=[192.136.1.18]) by outmx02.plus.net with esmtp (Exim) id 1OPVgh-0003sZ-Jc for freebsd-stable@freebsd.org; Fri, 18 Jun 2010 08:08:27 +0100 From: Matthew Lear To: freebsd-stable@freebsd.org Content-Type: text/plain; charset="UTF-8" Date: Fri, 18 Jun 2010 08:08:24 +0100 Message-ID: <1276844904.7519.19.camel@almscliff.bubblegen.co.uk> Mime-Version: 1.0 X-Mailer: Evolution 2.28.1 Content-Transfer-Encoding: 7bit Subject: 7.2-RELEASE-p4, IO errors & RAID1 failure X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Jun 2010 07:37:19 -0000 Hi there, I'm running 7.2-RELEASE-p4 on an i386 HP server (ML G5) in RAID1 configuration. Very recently, I've seen IO errors such as: ad0: TIMEOUT - READ_DMA retrying (1 retry left) LBA=20472527 reported and the RAID mirror is now offline. ad0: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=395032335 ad0: FAILURE - WRITE_DMA48 status=51 error=10 LBA=395032335 ar0: WARNING - mirror protection lost. RAID1 array in DEGRADED mode Strangely, I've ran some SMART tests on the device and no error has been recorded. Health checks pass. Running a long test on the device doesn't show any problem. While SMART can be manufacturer specific I at least expected to see something which looked to be suspicious. The drives in the RAID exist on two seperate ATA channels: [root@meshuga /home/matt]# atacontrol list ATA channel 0: Master: ad0 SATA revision 2.x Slave: ad1 SATA revision 1.x ATA channel 1: Master: ad2 SATA revision 2.x Slave: no device present ATA channel 2: Master: acd0 SATA revision 1.x Slave: no device present ATA channel 3: Master: no device present Slave: no device present ad1 is a third 160G drive that I periodically back up to using cron. I've seen the thread below but I'm not using ZFS. This seems similar to what I'm experiencing. http://freebsd.monkey.org/freebsd-stable/200801/msg00617.html I'm using software RAID with atacontrol but the drives are not hot-swap. Therefore I expect that I need to detach ad0 from the RAID, power down the unit, replace the drive, power on the unit and rebuild the array in order to fix things. Trouble is, I'm struggling to find out if this can be done safely with atacontrol and the hw configuration I have, and if so, how best to do it? It may well be a case of RTFM (again) but I just wanted to run this by the community to get some feedback. Loosing data is not an option here so hopefully I can get the machine back up on its feet soon. Any help or feedback much appreciated. Thanks, -- Matt