From owner-freebsd-stable@FreeBSD.ORG Mon Dec 13 19:21:53 2004 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 42E5F16A4CE for ; Mon, 13 Dec 2004 19:21:53 +0000 (GMT) Received: from outbound0.sv.meer.net (outbound0.sv.meer.net [205.217.152.13]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2408543D60 for ; Mon, 13 Dec 2004 19:21:53 +0000 (GMT) (envelope-from jrhett@mail.meer.net) Received: from mail.meer.net (mail.meer.net [209.157.152.14]) iBDJLbwR024172; Mon, 13 Dec 2004 11:21:44 -0800 (PST) (envelope-from jrhett@mail.meer.net) Received: from mail.meer.net (localhost [127.0.0.1]) by mail.meer.net (8.12.10/8.12.10/meer) with ESMTP id iBDJLMFL013738; Mon, 13 Dec 2004 11:21:22 -0800 (PST) (envelope-from jrhett@mail.meer.net) Received: (from jrhett@localhost) by mail.meer.net (8.12.1/8.12.10) id iBDJLK1J013730; Mon, 13 Dec 2004 11:21:20 -0800 (PST) (envelope-from jrhett) Date: Mon, 13 Dec 2004 11:21:20 -0800 From: Joe Rhett To: Doug White Message-ID: <20041213192119.GB4781@meer.net> Mail-Followup-To: Doug White , freebsd-stable@FreeBSD.org, =?iso-8859-1?Q?S=F8ren?= Schmidt References: <20041213052628.GB78120@meer.net> <20041213054159.GC78120@meer.net> <20041212215841.X83257@carver.gumbysoft.com> <20041213060549.GE78120@meer.net> <20041213102333.V92964@carver.gumbysoft.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20041213102333.V92964@carver.gumbysoft.com> User-Agent: Mutt/1.4i Organization: Meer.net LLC cc: freebsd-stable@freebsd.org cc: =?iso-8859-1?Q?S=F8ren?= Schmidt Subject: Re: drive failure during rebuild causes page fault X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Dec 2004 19:21:53 -0000 > > On Sun, Dec 12, 2004 at 09:59:16PM -0800, Doug White wrote: > > > Thats a nice shotgun you have there. > On Sun, 12 Dec 2004, Joe Rhett wrote: > > Yessir. And that's what testing is designed to uncover. The question is > > why this works, and how do we prevent it? On Mon, Dec 13, 2004 at 10:28:53AM -0800, Doug White wrote: > I'm sure Soren appreciates you donating your feet to the cause :) That's what sandbox feet are for ;-) > Why it works: the system assumes the administrator is competent enough to > not yank a disk that is being rebuilt to. Yes, I and most others are. But that's a bad assumption. The issue is fairly simple -- what occurs if the disk goes offline for a hardware failure? For example, that SATA interface starts having problems. We replace the drive, assuming it is the drive. The rebuild starts, and the interface dies again. Bam! There goes the system. Not good. Or, perhaps it's a DOA drive and it fails during the rebuild? > > Is there a proper way to handle these sort of events? If so, where is it > > documented? > > > > And fyi just pulling the drives causes the same failure so that means that > > RAID1 buys you nothing because your system will also crash. > > This is why I don't trust ATA RAID for fault tolerance -- it'll save your > data, but the system will tank. Since the disk state is maintained by > the OS and not abstracted by a separate processor, if a disk dies in a > particularly bad way the system may not be able to cope. Yes, but SATA isn't limited by this problem. It does have a processor per disk. (this is all SATA, if I didn't make that clear) -- Joe Rhett Senior Geek Meer.net