From owner-freebsd-current@FreeBSD.ORG Mon Jun 29 19:06:09 2009 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4CDB710656D8; Mon, 29 Jun 2009 19:06:09 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx08.syd.optusnet.com.au (fallbackmx08.syd.optusnet.com.au [211.29.132.10]) by mx1.freebsd.org (Postfix) with ESMTP id 9A49C8FC38; Mon, 29 Jun 2009 19:06:08 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by fallbackmx08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id n5THrCej027527; Tue, 30 Jun 2009 03:53:12 +1000 Received: from c122-107-127-32.carlnfd1.nsw.optusnet.com.au (c122-107-127-32.carlnfd1.nsw.optusnet.com.au [122.107.127.32]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id n5THr8k7025295 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 30 Jun 2009 03:53:09 +1000 Date: Tue, 30 Jun 2009 03:53:09 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Rick Macklem In-Reply-To: Message-ID: <20090630010035.E37426@delplex.bde.org> References: <3bbf2fe10906290256x4bfbe263jccef017a557f9410@mail.gmail.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Mailman-Approved-At: Mon, 29 Jun 2009 19:21:47 +0000 Cc: Attilio Rao , freebsd-fs@freebsd.org, freebsd-current@freebsd.org Subject: Re: umount -f implementation X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Jun 2009 19:06:09 -0000 On Mon, 29 Jun 2009, Rick Macklem wrote: > On Mon, 29 Jun 2009, Attilio Rao wrote: >> >> While that should be real in principle (immediate shutdown of the fs >> operation and unmounting of the partition) it is totally impossible to >> have it completely unsleeping, so it can happen that also umount -f >> sleeps / delays for some times (example: vflush). >> Currently, umount -f is one of the most complicated thing to handle in >> our VFS because it puts as requirement that vnodes can be reclaimed in >> any moment, adding complexity and possibility for races. >> > Yes, agreed. And I like to leave that stuff to more clever chaps than I:-) > >> What's the fix for your problem? >> > Well, when I tested it I found that it got stuck in two places, both > calls to VFS_SYNC(). The first was a > sync(); > right at the beginning of umount.c. > - All I did for that one is move it to after the code that handles > option processing and change it to > if ((fflag & MNT_FORCE) == 0) > sync(); > so that it isn't done for the "-f" case. (I believe the sync(); call > at the beginning of umount is only a performance optimization, so I > don't think not doing it for "-f" should break anything.) OK. This sync() is probably actually a performance pessimization, since it syncs all file systems while the internal sync in umount(2) only syncs the one being unmounted. > - the second happened just before the VFS_UNMOUNT() call in the > umount(2) system call. The code looks like: > if (((mp->mnt_flag & MNT_RDONLY) || > (error = VFS_SYNC(mp, MNT_WAIT)) == 0) || (flags & MNT_FORCE) != > 0) > - Although it was tempting to reverse the order of VFS_SYNC() and the > test for MNT_FORCE, I thought that might have a negative impact on > other file systems, since it avoided doing the VFS_SYNC(), so... > > - Instead, I just put a check for MNTK_UNMOUNTF at the beginning of > nfs_sync(), so that it returns EBUSY for this case instead of getting > stuck trying to flush(). OK. This sync is probably an optimization for correctness, since it arranges to do as much as possible without forcing. I checked ffs_mount() and found 2 large bugs, one related: - in the only case that tends to cause problems, namely the non-readonly case, ffs_unmount() does a suspend which calls VOP_SYNC(..., MNT_SUSPEND), but after errors from this sync it checks neither MNT_FORCE nor error == ENXIO. I think the usual effect is the same as if the top-level unmount() didn't check MNT_FORCE after suspend failure: in problematic cases, we have an unrecoverable write, due to the device going away or just an i/o error, and this error has probably already occured (only in rare cases will it be triggered by unmount). Then MNT_FORCE is essentially unused, and the ENXIO hack is not reached either, and unmount usually fails. - the UFS_EXTATTR case destroys infrastructure before committing to succeeding. It used to be just broken on failure. Now it uses a hack to recover (call a constructor) on failure, but the recovery code is not reached in the usual case of failure -- when the suspension fails. ffs_unmount() still seems to have no support for handling unrecoverable write errors (short of you converting them to ENXIO by removing the media). MNT_FORCE only meant FORCECLOSE for it. I see that old nfs was similar, and you are now making MNT_FORCE stronger. I thought that umount(8)'s man page documented -f being strongly forceful, but checking it shows that it only documents a weak force like that of FORCECLOSE (but not precisely enough). Perhaps a different flag should be used for strong forcefulness. Weak forcefulness is still useful and used for mount -f -u -- for remount we would never want errors in the file system itself ignored. This use also shows that the generic FORCECLOSE code must not ignore errors. > Assuming that I'm right w.r.t. the "sync();" at the beginning of umount.c, > it simply ensures that the umount command thread makes it as far as > VFS_UNMOUNT()->nfs_unmount(), so that the forced dismount proceeds. It > kills RPCs in progress before doing the vflush() and, since no new RPCs > can be done once MNTK_UNMOUNTF is set (it is checked at the beginning of > a request), the vflush() won't actually flush anything to the server. > > As such, "umount -f" is pretty well guaranteed to throw away the dirty > buffers. I believe this is correct behaviour, This is how I think ffs_mount() should work too -- It should be responsible for throwing away the dirty buffers, while nothing else should discard them. Now the discarding seems to be done by falling through to g_vfs_done(), except g_vfs_done() is not reached in most cases (see above). I don't like this -- at best we lose the opportunity to print ffs-specific details about what was discarded. Falling through only works for ENXIO anyway -- on other errors we should discard the unwritable buffers in an fs-specific manner so as to write as many of the writable buffers as possible. > but it would mean that a > user/sysadmin that uses "umount -f" for cases where the server is still > functioning, but slow, will lose data when they probably don't expect to. A new flag would help for this. Bruce