From owner-svn-src-head@FreeBSD.ORG Wed Jun 1 12:14:09 2011 Return-Path: Delivered-To: svn-src-head@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2E4881065670; Wed, 1 Jun 2011 12:14:09 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au [211.29.132.182]) by mx1.freebsd.org (Postfix) with ESMTP id 9E91A8FC21; Wed, 1 Jun 2011 12:14:08 +0000 (UTC) Received: from c122-106-165-191.carlnfd1.nsw.optusnet.com.au (c122-106-165-191.carlnfd1.nsw.optusnet.com.au [122.106.165.191]) by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p51CE4vr031529 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 1 Jun 2011 22:14:05 +1000 Date: Wed, 1 Jun 2011 22:14:04 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Rick Macklem In-Reply-To: <1326411567.44367.1306872929534.JavaMail.root@erie.cs.uoguelph.ca> Message-ID: <20110601210748.R1599@besplex.bde.org> References: <1326411567.44367.1306872929534.JavaMail.root@erie.cs.uoguelph.ca> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: src-committers@FreeBSD.org, svn-src-all@FreeBSD.org, svn-src-head@FreeBSD.org, Robert Watson , Bruce Evans , Kostik Belousov , Rick Macklem Subject: Re: svn commit: r222466 - head/sbin/umount X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Jun 2011 12:14:09 -0000 On Tue, 31 May 2011, Rick Macklem wrote: >> It is only an optimization. Any number of syncs are useless for >> actually syncing the system, since sync(2) only does an async sync (it >> returns without waiting for most writes to complete). As you pointed >> out later in this thread, unmount(2) does a sync that works -- a sync >> sync -- before doing the unmount proper. It does this irrespective of >> the force flag: >> >> % if (((mp->mnt_flag & MNT_RDONLY) || >> % (error = VFS_SYNC(mp, MNT_WAIT)) == 0) || (flags & MNT_FORCE) != 0) >> % error = VFS_UNMOUNT(mp, flags); >> >> The force flag doesn't mean that i/o's are forced to complete. It >> only means that open files are forced to be closed (and a few related >> things). This can be seen here. We wait (potentially forever) for >> any existing writes to complete. Only then do we look at the force >> flag and bale out if the sync failed and the force flag is _not_ set. >> Most i/o problems will cause hangs in this sync, and the force flag >> won't help. But if we somehow get past this sync, and have i/o The sync should actually return an error (after syncing everything that doesn't hang). Perhaps it actually does. Then we have got past the sync. But for nfs, it will tend to hang just detecting the i/o errors (need large timeouts which are not very different from hangs). >> problems, then honoring the force flag and continuing gives even more >> fragility, since we have almost passed the point of being able to give >> up after an error (this point is typically very early in >> VFS_UNMOUNT()). >> > Well, in vgonel(), which is called from vflush() and seems to be what > gets rid of the vnode even when open for forced dismounts, there is: > > /* > * Clear out any buffers associated with the vnode. > * If the flush fails, just toss the buffers. > */ > - followed shortly by > if (vinvalbuf(vp, V_SAVE, 0, 0) != 0) > vinvalbuf(vp, 0, 0, 0); > > It seems to me that this code is willing to give up when it can't write > dirty buffers, but??? That looks like a hack to me (collateral from vfs_bio.c 1.196). In 4.4BSD-Lite, the corresponding code is in vclean() and it is just: % /* % * Clean out any buffers associated with the vnode. % */ % if (flags & DOCLOSE) % vinvalbuf(vp, V_SAVE, NOCRED, p, 0, 0); In 4.4BSD, the buffers would have been tossed at the bwrite() level before we got here. Now bwrite() is more careful, and we are a bit more careful here -- we actually check if vinvalbuf() succeeded and clean up if it didn't. There must be a place that tosses buffers with unrecoverable write errors, and forced unmounts is a good place. But vgonel() is a bad place, since it has a much wider scope than forced unmounts. I think vgonel() is normally called when a vnode is recycled. So we can have a vnode unhappily holding buffers with unrecoverable write errors for several minutes, with syncs trying to flush them every ~30 seconds, but then when the vnode is recycled for some reason the above forcibly tosses its buffers with no further error messages. Similarly for recoverable write errors (maybe an nfs server that is down for longer than the recycling time). > To me, a forced dismount of an NFS mount that just wedges on an > unresponsive server isn't very useful. The only case I can think I agree. My point is that no one tried hard to make forced unmounts very forceful, so making it work is even harder than might first appear. The problem is related to making removed removable devices work properly. > of where a forced dismount that isn't willing to throw away dirty > buffers might be useful would be when an app. (userland process) > is for some reason unkillable and is sitting with an open NFS file. > Is this a problem in practice? (I do know that a mount stuck on an > unresponsive server is a real problem. There have been several PRs, > the most recent being kern/157365.) The server can easily crash while the client has dirty buffers. > I brought up the concept of a "hard forced dismount" at the developer > summit a year ago, but no one seemed to see a difference between that > and what they believed a forced dismount already did. After that, I > thought about it a little and couldn't see much use for a "soft forced > dismount that fails instead of throwing away dirty buffers, but does > close down open files". It's useful to be able to forcibly close all open files, even ones open for writing or with dirty buffers (there can also be dirty buffers for closed files). I think umount -f flag gives this, and I wouldn't want to to also discard dirty buffers on devices where the buffers are hopefully-transiently unwritable. Discarding of unwritable buffers really has little to do with unmount -- if there is 1 with an i/o error that I know is unrecoverable, then I would like to discard it immediately instead of letting the system sit there and try to flush it quite often. > - For example, suppose a userland process is writing a file through > the stdio library: > A - fopen()s a file for writing > B - does fwrite()s of small amounts of data > C - does an fflush() > > If the open is closed by a forced dismount before C, the written data > will be lost, since it isn't even in the kernel yet. If the file is on > a responsive NFS server, it will be written to the server shortly after C. > So, there seems to be a narrow window around C where the data will be lost > for a forced dismount that throws away dirty buffers vs one that doesn't > "but only if the server is responsive". > > In other words, "Is there much difference between ripping the open files > out from under a app. before it does a write syscall vs throwing the data > away when it's in the kernel buffer but not yet written to an NFS server?" Not much, sinc a careful application should do an fsync() as well as the above and not depend on the data reaching the file system until fsync() returns successfully, which it must not do if the kernel discarded the data. [But the tossing in vgonel() seems to break this: most write()s are async so they can't return an error. Errors live in the buffers. But when buffers are tossed, don't we lose all trace of the errors? Then a subsequent fsync won't see any errors and will return success.] > Anyhow, the way I now have the NFS clients, they do a "hard forced dismount" > where they can throw away dirty buffers not yet written to the server. > Recent patches to nfs_sync() make it return without attempting writes for > the forced dismount case and the vgonel() { see above code snippet } throws > away the data when the first vinvalbuf(.., V_SAVE,.) fails. Did you have to change VFS_SYNC() to pass the forced-dismount flag? I don't like that, but see the problem: VFS_SYNC(XXX_WAIT) should normally just try to sync everything, but return an error if this is impossible, but for an nfs server it would take too long to determine that it is impossible. Bruce