From owner-freebsd-geom@FreeBSD.ORG Tue Apr 10 18:14:27 2007 Return-Path: X-Original-To: freebsd-geom@freebsd.org Delivered-To: freebsd-geom@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id F159616A402 for ; Tue, 10 Apr 2007 18:14:27 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: from kiwi-computer.com (keira.kiwi-computer.com [63.224.10.3]) by mx1.freebsd.org (Postfix) with SMTP id 9284013C4BF for ; Tue, 10 Apr 2007 18:14:27 +0000 (UTC) (envelope-from rick@kiwi-computer.com) Received: (qmail 23006 invoked by uid 2001); 10 Apr 2007 18:14:26 -0000 Date: Tue, 10 Apr 2007 13:14:26 -0500 From: "Rick C. Petty" To: freebsd-geom@freebsd.org Message-ID: <20070410181426.GB21036@keira.kiwi-computer.com> References: <20070410111957.GA85578@garage.freebsd.pl> <461B75B2.40201@fer.hr> <20070410114115.GB85578@garage.freebsd.pl> <20070410161445.GA18858@keira.kiwi-computer.com> <20070410162129.GI85578@garage.freebsd.pl> <20070410172604.GA21036@keira.kiwi-computer.com> <461BCC85.2080900@freebsd.org> <20070410174607.GA26432@harmless.hu> <461BCF8A.3030307@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <461BCF8A.3030307@freebsd.org> User-Agent: Mutt/1.4.2.1i Subject: Re: volume management X-BeenThere: freebsd-geom@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: rick-freebsd@kiwi-computer.com List-Id: GEOM-specific discussions and implementations List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Apr 2007 18:14:28 -0000 On Tue, Apr 10, 2007 at 12:55:22PM -0500, Eric Anderson wrote: > > Personally, what I would want to prevent, is having a server go down due > to one file system having an issue, when it is serving (or using) many > more file systems. Exactly my point. The whole machine isn't hosed, just N file systems who use that GEOM provider. > What I want is a blast to my logs, the > erroneous file system to be evicted from further damage (mount read-only > and marked as dirty) and trickle an i/o error to any processes trying to > write to it. Even unmounting it would be ok, but that gets nasty with > NFS servers and other things. This is why I suggested that you propagate down to the GEOM consumers of the bad provider, either disallowing writes (which I don't think is a GEOM option) or removing the device completely... the file systems should be unmounted, etc. I pointed out that this seems already the case I've seen in gvinum when a disk is dropped... gvinum noticed the device failure and marked all dependencies as stale, and the only problem I had was a mounted stripe. 90% of the time I was able to kill all user processes which were reading or writing to the bad stripe, bring the disk back up, and force remounting the filesystem. Both UFS and GEOM code are buggy here-- sometimes I would get a panic and have to fsck (and resync) terabytes of disks, but often if I wait long enough after killing the user process everything else times out and I'm able to remount the filesystem and continue. I was never stating that the UFS subsystem is robust enough to handle all the failures, but that the GEOM layer should do what it can to keep the box up and we should train UFS and other filesystems how to handle these failures better. But a panic is just not a pretty option. One GEOM provider should not be arrogant enough to say that the box is no longer usable at all. That's like engineering an automobile which locks up the steering wheel and locks the doors and windows after noticing that one tire just went flat-- preventing the driver from attempting to safely pull over and preventing any passengers from exiting the vehicle. -- Rick C. Petty