From owner-freebsd-current@FreeBSD.ORG  Mon Jun 29 19:06:09 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4CDB710656D8;
	Mon, 29 Jun 2009 19:06:09 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from fallbackmx08.syd.optusnet.com.au
	(fallbackmx08.syd.optusnet.com.au [211.29.132.10])
	by mx1.freebsd.org (Postfix) with ESMTP id 9A49C8FC38;
	Mon, 29 Jun 2009 19:06:08 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au
	[211.29.132.186])
	by fallbackmx08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	n5THrCej027527; Tue, 30 Jun 2009 03:53:12 +1000
Received: from c122-107-127-32.carlnfd1.nsw.optusnet.com.au
	(c122-107-127-32.carlnfd1.nsw.optusnet.com.au [122.107.127.32])
	by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	n5THr8k7025295
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 30 Jun 2009 03:53:09 +1000
Date: Tue, 30 Jun 2009 03:53:09 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Rick Macklem <rmacklem@uoguelph.ca>
In-Reply-To: <Pine.GSO.4.63.0906291018050.3493@muncher.cs.uoguelph.ca>
Message-ID: <20090630010035.E37426@delplex.bde.org>
References: <Pine.GSO.4.63.0906281955160.5084@muncher.cs.uoguelph.ca>
	<3bbf2fe10906290256x4bfbe263jccef017a557f9410@mail.gmail.com>
	<Pine.GSO.4.63.0906291018050.3493@muncher.cs.uoguelph.ca>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Mailman-Approved-At: Mon, 29 Jun 2009 19:21:47 +0000
Cc: Attilio Rao <attilio@freebsd.org>, freebsd-fs@freebsd.org,
	freebsd-current@freebsd.org
Subject: Re: umount -f implementation
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 29 Jun 2009 19:06:09 -0000

On Mon, 29 Jun 2009, Rick Macklem wrote:

> On Mon, 29 Jun 2009, Attilio Rao wrote:
>> 
>> While that should be real in principle (immediate shutdown of the fs
>> operation and unmounting of the partition) it is totally impossible to
>> have it completely unsleeping, so it can happen that also umount -f
>> sleeps / delays for some times (example: vflush).
>> Currently, umount -f is one of the most complicated thing to handle in
>> our VFS because it puts as requirement that vnodes can be reclaimed in
>> any moment, adding complexity and possibility for races.
>> 
> Yes, agreed. And I like to leave that stuff to more clever chaps than I:-)
>
>> What's the fix for your problem?
>> 
> Well, when I tested it I found that it got stuck in two places, both
> calls to VFS_SYNC(). The first was a
> 	sync();
> right at the beginning of umount.c.
> - All I did for that one is move it to after the code that handles
>  option processing and change it to
> 	if ((fflag & MNT_FORCE) == 0)
> 		sync();
>  so that it isn't done for the "-f" case. (I believe the sync(); call
>  at the beginning of umount is only a performance optimization, so I
>  don't think not doing it for "-f" should break anything.)

OK.  This sync() is probably actually a performance pessimization, since
it syncs all file systems while the internal sync in umount(2) only
syncs the one being unmounted.

> - the second happened just before the VFS_UNMOUNT() call in the
>  umount(2) system call. The code looks like:
> 	if (((mp->mnt_flag & MNT_RDONLY) ||
> 	     (error = VFS_SYNC(mp, MNT_WAIT)) == 0) || (flags & MNT_FORCE) != 
> 0)
>  - Although it was tempting to reverse the order of VFS_SYNC() and the
>    test for MNT_FORCE, I thought that might have a negative impact on
>    other file systems, since it avoided doing the VFS_SYNC(), so...
>
>  - Instead, I just put a check for MNTK_UNMOUNTF at the beginning of
>    nfs_sync(), so that it returns EBUSY for this case instead of getting
>    stuck trying to flush().

OK.  This sync is probably an optimization for correctness, since it
arranges to do as much as possible without forcing.

I checked ffs_mount() and found 2 large bugs, one related:
- in the only case that tends to cause problems, namely the non-readonly
   case, ffs_unmount() does a suspend which calls VOP_SYNC(..., MNT_SUSPEND),
   but after errors from this sync it checks neither MNT_FORCE nor
   error == ENXIO.  I think the usual effect is the same as if the top-level
   unmount() didn't check MNT_FORCE after suspend failure: in problematic
   cases, we have an unrecoverable write, due to the device going away or
   just an i/o error, and this error has probably already occured (only
   in rare cases will it be triggered by unmount).  Then MNT_FORCE is
   essentially unused, and the ENXIO hack is not reached either, and
   unmount usually fails.
- the UFS_EXTATTR case destroys infrastructure before committing to
   succeeding.  It used to be just broken on failure.  Now it uses a
   hack to recover (call a constructor) on failure, but the recovery
   code is not reached in the usual case of failure -- when the
   suspension fails.

ffs_unmount() still seems to have no support for handling unrecoverable
write errors (short of you converting them to ENXIO by removing the
media).  MNT_FORCE only meant FORCECLOSE for it.  I see that old nfs
was similar, and you are now making MNT_FORCE stronger.  I thought
that umount(8)'s man page documented -f being strongly forceful, but
checking it shows that it only documents a weak force like that of
FORCECLOSE (but not precisely enough).  Perhaps a different flag should
be used for strong forcefulness.  Weak forcefulness is still useful and
used for mount -f -u -- for remount we would never want errors in the
file system itself ignored.  This use also shows that the generic
FORCECLOSE code must not ignore errors.

> Assuming that I'm right w.r.t. the "sync();" at the beginning of umount.c,
> it simply ensures that the umount command thread makes it as far as
> VFS_UNMOUNT()->nfs_unmount(), so that the forced dismount proceeds. It
> kills RPCs in progress before doing the vflush() and, since no new RPCs
> can be done once MNTK_UNMOUNTF is set (it is checked at the beginning of
> a request), the vflush() won't actually flush anything to the server.
>
> As such, "umount -f" is pretty well guaranteed to throw away the dirty
> buffers. I believe this is correct behaviour,

This is how I think ffs_mount() should work too -- It should be
responsible for throwing away the dirty buffers, while nothing else
should discard them.  Now the discarding seems to be done by falling
through to g_vfs_done(), except g_vfs_done() is not reached in most
cases (see above).  I don't like this -- at best we lose the opportunity
to print ffs-specific details about what was discarded.  Falling through
only works for ENXIO anyway -- on other errors we should discard the
unwritable buffers in an fs-specific manner so as to write as many of
the writable buffers as possible.

> but it would mean that a
> user/sysadmin that uses "umount -f" for cases where the server is still
> functioning, but slow, will lose data when they probably don't expect to.

A new flag would help for this.

Bruce