From owner-freebsd-fs Tue Aug 6 10:32:24 1996 Return-Path: owner-fs Received: (from root@localhost) by freefall.freebsd.org (8.7.5/8.7.3) id KAA02217 for fs-outgoing; Tue, 6 Aug 1996 10:32:24 -0700 (PDT) Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211]) by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA02212 for ; Tue, 6 Aug 1996 10:32:22 -0700 (PDT) Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id KAA13564; Tue, 6 Aug 1996 10:28:47 -0700 From: Terry Lambert Message-Id: <199608061728.KAA13564@phaeton.artisoft.com> Subject: Re: NFS Diskless Dispare... To: dfr@render.com (Doug Rabson) Date: Tue, 6 Aug 1996 10:28:47 -0700 (MST) Cc: terry@lambert.org, michaelh@cet.co.jp, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org In-Reply-To: from "Doug Rabson" at Aug 6, 96 04:50:33 pm X-Mailer: ELM [version 2.4 PL24] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-fs@freebsd.org X-Loop: FreeBSD.org Precedence: bulk > [moved to freebsd-fs] > > On Mon, 5 Aug 1996, Terry Lambert wrote: > > > What I'm suggesting is that there needs to be both a VFS_VGET and > > a VFS_VPUT (or VFS_VRELE). With the additional per fs release > > mechanism, each FS instance can allocate an inode pool at its > > instantiation (or do it on a per instance basis, the current > > method which makes inode allocation so slow...). > > Not really sure how this would work for filesystems without a flat > namespace? VFS_VGET is not supported for msdosfs, cd9660, nfs and > probably others. Conceptually, it's pretty tribial to support; it's not supported because the stacking is not correctly implemented for these FS's. Look at the /sys/miscfs/nullfs use of VOP_VGET. > Wait a minute. The VOP_LOCK is not there just for vclean to work. If you > took it out, a lot of the VOPs in ufs would break due to unexpected > reentry. The VOP_LOCK is there to ensure that operations which modify the > vnode are properly sequenced even if the process has to sleep during the > operation. That's why the vn_lock would be called. The VOP_LOCK is a transparent veto/allow interface in that case, but that doesn't mean a counting reference isn't held by PID (like it had to be). The actual Lite2 routine for "actual lock" is called lockmgr() and lives in kern_lock.c in the Lite2 sources. Lite2 already moves in this direction -- it just hasn't gone far enough. > > The vnode locking could then be done in common code: > > > > > > vn_lock( vp, flags, p) > > struct vnode *vp; > > int flags; > > struct proc *p; > > { > > /* actual lock*/ > > if( ( st = ...) == SUCCESS) { > > if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) { > > /* lock was vetoed, undo actual lock*/ > > ... > > } > > } > > return( st); > > } > > > > > > The point here is that the lock contention (if any) can be resolved > > without ever hitting the FS itsef in the failure case. > > > > You can't do this for NFS. If you use exclusive locks in NFS and a > server dies, you easily can end up holding onto a lock for the root vnode > until the server reboots. To make it work for NFS, you would have to make > the lock interruptable which forces you to fix code which does not check > the error return from VOP_LOCK all over the place. This is one of the "flags" fields, and it only applies to the NFS client code. Actually, since the NFSnode is not transiently destroyed as a result of server reboot (statelessness *is* a win, no matter what the RFS advocates would have you believe), there isn't a problem with holding the reference. One of the things Sun recommends is not making the mounts on mount points in the root directory; to avoid exactly this scenario (it really doesn't matter in the diskless/dataless case, since you will hang on swap or page-in from image-file-as-swap-store anyway). The root does not need to be locked for the node lookup for the root for a covering node in any case; this is an error in the "node x covers node y" case in the lookup case. You can see that the lookup code documents a race where it frees and relocks the parent node to avoid exactly this scenario, actually. A lock does not need to be held in the lookup for the parent in the NFS lookup case for the mount point traversal. I believe this is an error in the current code. The issue is more interesting in the client case; a reference is not a lock, per se, it's an increment of the reference count. The server holds the lock mid path traversal. This is resolved by setting the "interruptable" flag on the vn_lock into the underlying FS on the server. The easiest way to think of this is in terms of provider interfaces and consumer interfaces. There are many FS provider interfaces. The FS consumer interfaces are the syscall layer (the vfs_subr.c) and the NFS client. This goes hand in hand with the discussion we had about the VOP_READDIR interface needing to be split into "get buffer/reference buffer element" (remember the conversation about killing off the cookie interface about a year ago?!?!). > I hope we are not talking at cross purposes. We are talking about the > vnode lock, not the advisory record locking aren't we? Yes. The VOP_ADVLOCK is also (ideally) a veto interface. This allows lock contention from several processes on the same client to be resolved locally without hitting the wire, and gives a one client pseudo-flock that works without fully implementing the NFS locking code. This is really irrelevant to the VOP_LOCK code, which deals with asserting the lock only in the exception cases. In the NFS client case, the VOP_LOCK and VOP_ADVLOCK are non-null. I didn't show the sleep interface in the vn_lock in the case of the failure. The sleep puts a loop around the "actual lock" code so a sleep occurs above, at the higher code level. Intermediate locks on per layer vnodes (if any are truly needed; see below) are automatically wound and unwound for retry in the blocking case. In the NFS case, the lock is asserted to the underlying FS, and the sleep target is returned to the top of the loop by the FS layer where the contention occurred (basically, a vnodep is returned in the != SUCCESS case (SUCCESS == 0); this is used as the sleep target. If a lock in the NFS server code fails, and it fails for the UFS lock case for the underlying FS, then it should sleep on the UFS vnode being unlocked. The veto interface actually implies a couple of semantic changes; the real implementation would probably be as a NULL lock entry to allow the routine to not be called at all, saving the vnode_if parameter list deconstruction/reconstruction. This allows the substitution of a chaining interface for a file system stacking layer. Now you are probably asking "but how can this work when an intermediate non-NULL layer fans out or in from multiple vnodes?". The union FS case is one of the most interesting cases for this, since what you want to do is conditionally assert a lock on two or more underlying FS's, either of which could have NULL or non-NULL veto code. The reason it is interesting is stack operand collapse in a stacking instance. I could have the following simple case: (syscalls or NFS or AFP or SMB or NetWare kernel server) consumer vn_lock | ^ | ^ v | v | quota layer quota VOP_LOCK (NULL) | ^ | ^ v | v | uid mapping layer uid VOP_LOCK (NULL) | ^ | ^ v | v | FFS FFS VOP_LOCK (NULL) Really, you want to collapse NULL layer entries. But since the stack could be reentered from the top, how can you do this without endangering the locking of terminal nodes based on intermediate nodes? It turns out that the function collapse for the VOP_LOCK's in this case is NULL; but say we replace FFS with the NFS client, where the last layer is non-NULL? We would want to collapse to the NFS VOP_LOCK call, since the intermediate chainings are NULL, but the terminal chaining is not. Similar collapse could remove the uid mapping layer's VOP_LOOKUP, leaving the quota VOP_LOOKUP (which has to be there to hide the quota file and protect it) followed by the FFS VOP_LOOKUP. The call-down chain is abbreviated. This is a general win in the veto interface cases. The only place you are required to propagate is the non-NULL cases, and the non-NULL case will only occur when a fan-out or fan-in of vnodes occurs between layers. Currently collapse is not implemented. Part of the support for collapse without full kernel recompilation on VOP addition was the 0->1 FS instance count changes to the vfs_init.c code and the addition of the structure sizing field in the vnode_if.c generation in my big patch set (where the vnode_if.c generated had the structure vfs_op_descs size computed in the vnode_if.c file. The change did not simply allow the transition from 0->N loadable FS's (part of the necessary work for discardable fallback drivers for the FS, assuming kernel paging at some point in the future), and it did not just allow you to add VFS OPS to the vnode_if without having to recompile all FS modules and LKM's (it's stated intent). The change also allows (with the inclusion of a structure sort, since the init causes a structure copy anyway to get it into a stack instantiation) the simplification of the vnode_if call to eliminate the intermediate functioncall stub: a necessary step towards call graph collapse. You want this so that if you have 10 FS layers in a stack, you only have to call one or two veto functions out of the 10... and if they are all NULL, the one is synthetic anyway. This is a big win in reducing the current code duplication, which you want to do not only to reduce code size, but to make FS's more robust. The common behaviours of FS's *should* be implemented in common code. The Lite2 code recognizes this at the VOP_LOCK level in a primitive fashion by introducing the lockmgr() call, but since the model is not uniformly applied, and deadly-embrace or two caller starvation deadlocks can still occur in the Lite2 model. Going to the next step, a veto model, both increases the code robustness considerably, as well as resolving the state wind/unwind problems inherent in fan out. The fan out problem is *the* problem with the unionfs, at this point. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.