From owner-freebsd-stable@FreeBSD.ORG Sat Aug 25 10:04:47 2007 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 20C1716A418 for ; Sat, 25 Aug 2007 10:04:47 +0000 (UTC) (envelope-from tom@tomjudge.com) Received: from smtp809.mail.ird.yahoo.com (smtp809.mail.ird.yahoo.com [217.146.188.69]) by mx1.freebsd.org (Postfix) with SMTP id 9948313C46A for ; Sat, 25 Aug 2007 10:04:46 +0000 (UTC) (envelope-from tom@tomjudge.com) Received: (qmail 22558 invoked from network); 25 Aug 2007 10:04:45 -0000 Received: from unknown (HELO ?192.168.1.2?) (thomasjudge@btinternet.com@86.140.28.215 with plain) by smtp809.mail.ird.yahoo.com with SMTP; 25 Aug 2007 10:04:44 -0000 X-YMail-OSG: RWhL4G8VM1nwNqoNF1z5kWLXsJ42ww5FCt_jx17PIXpfnKll Message-ID: <46D00CE1.9@tomjudge.com> Date: Sat, 25 Aug 2007 12:05:05 +0100 From: Tom Judge User-Agent: Thunderbird 1.5.0.12 (X11/20070604) MIME-Version: 1.0 To: Tom Samplonius References: <9812134.411188026402612.JavaMail.root@ly.sdf.com> In-Reply-To: <9812134.411188026402612.JavaMail.root@ly.sdf.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Cc: Artem Kuchin , freebsd-stable Subject: Re: A little story of failed raid5 (3ware 8000 series) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Aug 2007 10:04:47 -0000 Tom Samplonius wrote: > ----- "Artem Kuchin" wrote: ... >> But i don't understand how and why it happened. ONly 6 hours ago (a >> night before) all those files were backed up fine w/o any read >> error. And now, right after replacing the driver and starting >> rebuild it said that there are bad sectors all over those file. How >> come? > > What happened to you was an extremely common occurrence. You had a > disk develop a media failure sometime ago, but the controller never > detected it, because that particular bad area was not read. Your > backups worked because they never touched this portion of the disk > (ex. empty space, meta data, etc). And then another drive developed > a electronics failure, which is instantly detected, putting the array > into a degraded mode. When you did a rebuild onto a replace drive, > the controller discovered that there was a second failed disk, and > this is unrecoverable. 3ware controllers can recover from this situation, all you need to do is tell the controller not to verify the source data. This is a litle dangerous but it has saved me in the past where 1 drive died in a raid 10 array and 2 of the 3 remaining drives had surface defects. The trick was to replace each drive 1 at a time and rebuild without data verification. After 10 painful hours the array was rebuild with out any noticeable data corruption. > > RAID, of any level, isn't magic. It is important to understand how > it works, an realize that drives can passive fail. BTW, if you were > using RAID1 or RAID10, you would likely have had the same problem > (well, RAID10 can survive _some_ double-disk failures). RAID6 is the > only RAID level that can survive failure of any two disks. This is not all true RAID 1 can survive multiple disk failures as it has the storage capacity of 1 spindle and can tolerate the failure of N-1 spindles where N is the number of spindles in the mirror set. This also is kind of true in RAID 10, the more spindles in your mirror sets the more chance you have of being able to survive multiple failures in the array (Say use 6 disks in 2 3 disk mirror sets striped together). > > The real solution is RAID scrubbing: a low level background process > that reads every sector of every disk. All of the real RAID systems > do this (usually scheduled weekly, or every other week). Most 3ware > RAID card don't have this feature. > > So rather than not using RAID5 or RAID6 again, you should just not > use 3ware anymore. If you use the 3dm2 management interface you can schedule verify and rebuild tasks to run on a regular basis. I think that 7500 series controllers can do this, 9500 and 9550's definitely can. We have 50+ systems that are using 3ware cards (7500-9550 4 and 8 channel models) with 200+ spindles in use (no host spares unfortunately) and drives in that pool failing on average around once a month. We have only ever had trouble recovering from failed drives on 7500 series controllers that have been in production for a reasonably long time. I don't think that you are justified in your slagging off of 3ware controllers. Tom