Date: Tue, 10 Apr 2007 13:14:26 -0500 From: "Rick C. Petty" <rick-freebsd@kiwi-computer.com> To: freebsd-geom@freebsd.org Subject: Re: volume management Message-ID: <20070410181426.GB21036@keira.kiwi-computer.com> In-Reply-To: <461BCF8A.3030307@freebsd.org> References: <evfqtt$n23$1@sea.gmane.org> <20070410111957.GA85578@garage.freebsd.pl> <461B75B2.40201@fer.hr> <20070410114115.GB85578@garage.freebsd.pl> <20070410161445.GA18858@keira.kiwi-computer.com> <20070410162129.GI85578@garage.freebsd.pl> <20070410172604.GA21036@keira.kiwi-computer.com> <461BCC85.2080900@freebsd.org> <20070410174607.GA26432@harmless.hu> <461BCF8A.3030307@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Apr 10, 2007 at 12:55:22PM -0500, Eric Anderson wrote: > > Personally, what I would want to prevent, is having a server go down due > to one file system having an issue, when it is serving (or using) many > more file systems. Exactly my point. The whole machine isn't hosed, just N file systems who use that GEOM provider. > What I want is a blast to my logs, the > erroneous file system to be evicted from further damage (mount read-only > and marked as dirty) and trickle an i/o error to any processes trying to > write to it. Even unmounting it would be ok, but that gets nasty with > NFS servers and other things. This is why I suggested that you propagate down to the GEOM consumers of the bad provider, either disallowing writes (which I don't think is a GEOM option) or removing the device completely... the file systems should be unmounted, etc. I pointed out that this seems already the case I've seen in gvinum when a disk is dropped... gvinum noticed the device failure and marked all dependencies as stale, and the only problem I had was a mounted stripe. 90% of the time I was able to kill all user processes which were reading or writing to the bad stripe, bring the disk back up, and force remounting the filesystem. Both UFS and GEOM code are buggy here-- sometimes I would get a panic and have to fsck (and resync) terabytes of disks, but often if I wait long enough after killing the user process everything else times out and I'm able to remount the filesystem and continue. I was never stating that the UFS subsystem is robust enough to handle all the failures, but that the GEOM layer should do what it can to keep the box up and we should train UFS and other filesystems how to handle these failures better. But a panic is just not a pretty option. One GEOM provider should not be arrogant enough to say that the box is no longer usable at all. That's like engineering an automobile which locks up the steering wheel and locks the doors and windows after noticing that one tire just went flat-- preventing the driver from attempting to safely pull over and preventing any passengers from exiting the vehicle. -- Rick C. Petty
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070410181426.GB21036>