Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 10 Apr 2007 13:14:26 -0500
From:      "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
To:        freebsd-geom@freebsd.org
Subject:   Re: volume management
Message-ID:  <20070410181426.GB21036@keira.kiwi-computer.com>
In-Reply-To: <461BCF8A.3030307@freebsd.org>
References:  <evfqtt$n23$1@sea.gmane.org> <20070410111957.GA85578@garage.freebsd.pl> <461B75B2.40201@fer.hr> <20070410114115.GB85578@garage.freebsd.pl> <20070410161445.GA18858@keira.kiwi-computer.com> <20070410162129.GI85578@garage.freebsd.pl> <20070410172604.GA21036@keira.kiwi-computer.com> <461BCC85.2080900@freebsd.org> <20070410174607.GA26432@harmless.hu> <461BCF8A.3030307@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Apr 10, 2007 at 12:55:22PM -0500, Eric Anderson wrote:
> 
> Personally, what I would want to prevent, is having a server go down due 
> to one file system having an issue, when it is serving (or using) many 
> more file systems.

Exactly my point.  The whole machine isn't hosed, just N file systems who
use that GEOM provider.

> What I want is a blast to my logs, the 
> erroneous file system to be evicted from further damage (mount read-only 
> and marked as dirty) and trickle an i/o error to any processes trying to 
> write to it.  Even unmounting it would be ok, but that gets nasty with 
> NFS servers and other things.

This is why I suggested that you propagate down to the GEOM consumers of
the bad provider, either disallowing writes (which I don't think is a GEOM
option) or removing the device completely...  the file systems should be
unmounted, etc.

I pointed out that this seems already the case I've seen in gvinum when a
disk is dropped...  gvinum noticed the device failure and marked all
dependencies as stale, and the only problem I had was a mounted stripe.
90% of the time I was able to kill all user processes which were reading or
writing to the bad stripe, bring the disk back up, and force remounting the
filesystem.  Both UFS and GEOM code are buggy here-- sometimes I would get
a panic and have to fsck (and resync) terabytes of disks, but often if I
wait long enough after killing the user process everything else times out
and I'm able to remount the filesystem and continue.

I was never stating that the UFS subsystem is robust enough to handle all
the failures, but that the GEOM layer should do what it can to keep the box
up and we should train UFS and other filesystems how to handle these
failures better.

But a panic is just not a pretty option.  One GEOM provider should not be
arrogant enough to say that the box is no longer usable at all.  That's
like engineering an automobile which locks up the steering wheel and locks
the doors and windows after noticing that one tire just went flat--
preventing the driver from attempting to safely pull over and preventing
any passengers from exiting the vehicle.

-- Rick C. Petty



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070410181426.GB21036>