From owner-freebsd-stable@FreeBSD.ORG Fri Jan 13 18:26:24 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 4AB5F16A41F for ; Fri, 13 Jan 2006 18:26:24 +0000 (GMT) (envelope-from ambrisko@ambrisko.com) Received: from mail.ambrisko.com (mail.ambrisko.com [64.174.51.43]) by mx1.FreeBSD.org (Postfix) with ESMTP id C585F43D48 for ; Fri, 13 Jan 2006 18:26:23 +0000 (GMT) (envelope-from ambrisko@ambrisko.com) Received: from server2.ambrisko.com (HELO www.ambrisko.com) ([192.168.1.2]) by mail.ambrisko.com with ESMTP; 13 Jan 2006 10:26:23 -0800 Received: from ambrisko.com (localhost [127.0.0.1]) by www.ambrisko.com (8.12.11/8.12.9) with ESMTP id k0DIQNcr088342; Fri, 13 Jan 2006 10:26:23 -0800 (PST) (envelope-from ambrisko@ambrisko.com) Received: (from ambrisko@localhost) by ambrisko.com (8.12.11/8.12.11/Submit) id k0DIQNeY088341; Fri, 13 Jan 2006 10:26:23 -0800 (PST) (envelope-from ambrisko) From: Doug Ambrisko Message-Id: <200601131826.k0DIQNeY088341@ambrisko.com> In-Reply-To: <6.2.3.4.0.20060113125258.045378d8@64.7.153.2> To: Mike Tancsa Date: Fri, 13 Jan 2006 10:26:23 -0800 (PST) X-Mailer: ELM [version 2.4ME+ PL94b (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII Cc: freebsd-stable@freebsd.org Subject: Re: 6.0 on Dell 1850 with PERC4e/DC RAID? X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Jan 2006 18:26:24 -0000 Mike Tancsa writes: | At 11:59 AM 13/01/2006, Doug Ambrisko wrote: | >| | >| That's lame. Under what condition does it happen, do you know? | > | >Running RAID 10, a drive was swapped and the rebuild started on the | >replacement drive. The rebuild complained about the source drive | >for the mirror rebuild having read errors that couldn't be recovered. | >It continued on and finished re-creating the mirror. Then the RAID | >proceeeded onto a background init which they normal did and started | >failing that and re-starting the background init over and over again. | >The box changed the RAID from degraded to optimal when the rebuild | >completed (with errors). Do a dd of the entire RAID logical device | >returned an error at the bad sector since it couldn't recover that. | >The RAID controller reported an I/O error and still left the RAID as | >optimal. | > | >We reported this and where told that's the way it is designed :-( | | Interesting timing as I ran into this sort of situation on the | weekend on a 3ware drive in RAID1. The card had complained for a week | about read errors on drive 1. We thought we would wait until the | weekend maintenance window to swap it out. Sadly, before that | window, drive zero totally died a horrible death. We popped in a new | drive on port zero, started the rebuild, and it crapped out saying | there was a read error on drive 1. However, there is a check box | that says continue the build, even with errors on the source drive. With Adaptec we used to do a verify of each disk before a swap to increase our chances of a successful disk swap. Adaptec was a little heavy handed in if you are running on the last disk of the mirror and it has a read-error it will fail the drive. If you have a RAID 10 then you lose 1/2 the file system :-( I'd rather just get the read error back to the OS then loose the entire drive. | This setup seems to give you the best of both worlds. We did a quick | check of the resultant files compared to backups and only a couple | were toasted. (The box is going to be retired in a month, so if there | is other hidden fs corruption if it holds out for another 3 weeks we | dont care too much). The correct approach would be to do a total | restore of course, but this was good enough for us in this | situation. I guess the question is, is this RAID1 in a proper mirror | given that there are hard errors on the drive on port 1 ? That sounds like a good controller assuming it says the RAID is still degraded and it's not optimal. I assume "optimal" means everything is fine and safe to read the entire volume. Doug A.