Date: Thu, 8 Aug 1996 18:47:44 +0100 (BST) From: Doug Rabson <dfr@render.com> To: Terry Lambert <terry@lambert.org> Cc: michaelh@cet.co.jp, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org Subject: Re: NFS Diskless Dispare... Message-ID: <Pine.BSI.3.95.960808173810.10082U-100000@minnow.render.com> In-Reply-To: <199608061728.KAA13564@phaeton.artisoft.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 6 Aug 1996, Terry Lambert wrote:
> > [moved to freebsd-fs]
> >
> > On Mon, 5 Aug 1996, Terry Lambert wrote:
> >
> > > What I'm suggesting is that there needs to be both a VFS_VGET and
> > > a VFS_VPUT (or VFS_VRELE). With the additional per fs release
> > > mechanism, each FS instance can allocate an inode pool at its
> > > instantiation (or do it on a per instance basis, the current
> > > method which makes inode allocation so slow...).
> >
> > Not really sure how this would work for filesystems without a flat
> > namespace? VFS_VGET is not supported for msdosfs, cd9660, nfs and
> > probably others.
>
>
> Conceptually, it's pretty tribial to support; it's not supported
> because the stacking is not correctly implemented for these FS's.
> Look at the /sys/miscfs/nullfs use of VOP_VGET.
VFS_VGET is not implemented in NFS because the concept just doesn't apply.
VFS_VGET is only relavent for local filesystems. NFS does have a flat
namespace in terms of filehandles but not one which you could squeeze into
the VFS_VGET interface.
>
> > Wait a minute. The VOP_LOCK is not there just for vclean to work. If you
> > took it out, a lot of the VOPs in ufs would break due to unexpected
> > reentry. The VOP_LOCK is there to ensure that operations which modify the
> > vnode are properly sequenced even if the process has to sleep during the
> > operation.
>
> That's why the vn_lock would be called. The VOP_LOCK is a transparent
> veto/allow interface in that case, but that doesn't mean a counting
> reference isn't held by PID (like it had to be). The actual Lite2
> routine for "actual lock" is called lockmgr() and lives in kern_lock.c
> in the Lite2 sources. Lite2 already moves in this direction -- it just
> hasn't gone far enough.
>
>
> > > The vnode locking could then be done in common code:
> > >
> > >
> > > vn_lock( vp, flags, p)
> > > struct vnode *vp;
> > > int flags;
> > > struct proc *p;
> > > {
> > > /* actual lock*/
> > > if( ( st = ...) == SUCCESS) {
> > > if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) {
> > > /* lock was vetoed, undo actual lock*/
> > > ...
> > > }
> > > }
> > > return( st);
> > > }
> > >
> > >
> > > The point here is that the lock contention (if any) can be resolved
> > > without ever hitting the FS itsef in the failure case.
> > >
> >
> > You can't do this for NFS. If you use exclusive locks in NFS and a
> > server dies, you easily can end up holding onto a lock for the root vnode
> > until the server reboots. To make it work for NFS, you would have to make
> > the lock interruptable which forces you to fix code which does not check
> > the error return from VOP_LOCK all over the place.
>
> This is one of the "flags" fields, and it only applies to the NFS client
> code. Actually, since the NFSnode is not transiently destroyed as a
> result of server reboot (statelessness *is* a win, no matter what the
> RFS advocates would have you believe), there isn't a problem with holding
> the reference.
So the NFS code would degrade the exclusive lock back to a shared lock?
Hmm. I don't think that would work since you can't get the exclusive lock
until all the shared lockers release their locks.
>
> One of the things Sun recommends is not making the mounts on mount
> points in the root directory; to avoid exactly this scenario (it really
> doesn't matter in the diskless/dataless case, since you will hang on
> swap or page-in from image-file-as-swap-store anyway).
It doesn't matter if they are on mount points in root. If a lock is stuck
in a sub-filesystem, then the 'sticking' can propagate across the mount
point.
>
> The root does not need to be locked for the node lookup for the root
> for a covering node in any case; this is an error in the "node x covers
> node y" case in the lookup case. You can see that the lookup code
> documents a race where it frees and relocks the parent node to avoid
> exactly this scenario, actually. A lock does not need to be held
> in the lookup for the parent in the NFS lookup case for the mount
> point traversal. I believe this is an error in the current code.
Have to think about this some more. Are you saying that when lookup is
crossing a mountpoint, it does not need any locks in the parent
filesystem?
>
>
> The issue is more interesting in the client case; a reference is not
> a lock, per se, it's an increment of the reference count. The server
> holds the lock mid path traversal.
>
> This is resolved by setting the "interruptable" flag on the vn_lock
> into the underlying FS on the server.
>
>
> The easiest way to think of this is in terms of provider interfaces
> and consumer interfaces. There are many FS provider interfaces. The
> FS consumer interfaces are the syscall layer (the vfs_subr.c) and the
> NFS client. This goes hand in hand with the discussion we had about
^^^^^^
Do you mean NFS server here?
> the VOP_READDIR interface needing to be split into "get buffer/reference
> buffer element" (remember the conversation about killing off the cookie
> interface about a year ago?!?!).
I remember that. I think I ended up agreeing with you about it. The
details are a bit vague..
> [advlock digression ...]
>
> In the NFS client case, the VOP_LOCK and VOP_ADVLOCK are non-null. I
> didn't show the sleep interface in the vn_lock in the case of the
> failure. The sleep puts a loop around the "actual lock" code so a
> sleep occurs above, at the higher code level. Intermediate locks
> on per layer vnodes (if any are truly needed; see below) are
> automatically wound and unwound for retry in the blocking case.
>
>
> In the NFS case, the lock is asserted to the underlying FS, and the sleep
> target is returned to the top of the loop by the FS layer where the
> contention occurred (basically, a vnodep is returned in the != SUCCESS
> case (SUCCESS == 0); this is used as the sleep target.
>
> If a lock in the NFS server code fails, and it fails for the UFS lock
> case for the underlying FS, then it should sleep on the UFS vnode
> being unlocked.
>
> The veto interface actually implies a couple of semantic changes; the
> real implementation would probably be as a NULL lock entry to allow
> the routine to not be called at all, saving the vnode_if parameter
> list deconstruction/reconstruction.
>
> This allows the substitution of a chaining interface for a file system
> stacking layer.
>
> Now you are probably asking "but how can this work when an intermediate
> non-NULL layer fans out or in from multiple vnodes?".
>
>
> The union FS case is one of the most interesting cases for this, since
> what you want to do is conditionally assert a lock on two or more
> underlying FS's, either of which could have NULL or non-NULL veto code.
> The reason it is interesting is stack operand collapse in a stacking
> instance.
>
> I could have the following simple case:
>
>
> (syscalls or NFS or AFP or SMB or NetWare kernel server)
>
> consumer vn_lock
> | ^ | ^
> v | v |
> quota layer quota VOP_LOCK (NULL)
> | ^ | ^
> v | v |
> uid mapping layer uid VOP_LOCK (NULL)
> | ^ | ^
> v | v |
> FFS FFS VOP_LOCK (NULL)
>
> Really, you want to collapse NULL layer entries. But since the stack
> could be reentered from the top, how can you do this without endangering
> the locking of terminal nodes based on intermediate nodes?
>
> It turns out that the function collapse for the VOP_LOCK's in this
> case is NULL; but say we replace FFS with the NFS client, where the
> last layer is non-NULL?
>
> We would want to collapse to the NFS VOP_LOCK call, since the
> intermediate chainings are NULL, but the terminal chaining is not.
> Similar collapse could remove the uid mapping layer's VOP_LOOKUP,
> leaving the quota VOP_LOOKUP (which has to be there to hide the
> quota file and protect it) followed by the FFS VOP_LOOKUP. The
> call-down chain is abbreviated. This is a general win in the veto
> interface cases. The only place you are required to propagate is
> the non-NULL cases, and the non-NULL case will only occur when a
> fan-out or fan-in of vnodes occurs between layers.
>
> Currently collapse is not implemented. Part of the support for
> collapse without full kernel recompilation on VOP addition was the
> 0->1 FS instance count changes to the vfs_init.c code and the
> addition of the structure sizing field in the vnode_if.c generation
> in my big patch set (where the vnode_if.c generated had the structure
> vfs_op_descs size computed in the vnode_if.c file. The change did
> not simply allow the transition from 0->N loadable FS's (part of
> the necessary work for discardable fallback drivers for the FS,
> assuming kernel paging at some point in the future), and it did not
> just allow you to add VFS OPS to the vnode_if without having to
> recompile all FS modules and LKM's (it's stated intent). The change
> also allows (with the inclusion of a structure sort, since the init
> causes a structure copy anyway to get it into a stack instantiation)
> the simplification of the vnode_if call to eliminate the intermediate
> functioncall stub: a necessary step towards call graph collapse. You
> want this so that if you have 10 FS layers in a stack, you only have
> to call one or two veto functions out of the 10... and if they are
> all NULL, the one is synthetic anyway.
This is interesting. It is similar to the internal driver architecture we
use in our 3D graphics system (was Reality Lab, now Microsoft's Direct3D).
The driver is split up into different modules depending on functionality.
The consumer (Direct3D) has a stack which it pushes driver modules onto
for all the required functionality. This used to be useful for
reconfiguring the stack at runtime to select different rendering
algorithms etc. Direct3D broke that unfortunately but that is another
story.
It communicates with the drivers by sending service calls to the top
driver in the stack. Each service call has a well defined number. If
that module understands the service, it implements it and returns a
result. Otherwise, it passes the service call down to the next driver in
the stack. Some modules override service calls in lower layers and they
typically do their own work and then pass the service onto the next layer
in the stack.
To optimise the system, we added a service call table in the stack head.
When a module is pushed onto the stack, it is called to 'bid' some of its
services into the service call table. Each module in turn going up the
stack puts a function pointer into the table for each of the services it
wants to implement. If it is overriding a lower module, it just
overwrites the pointer.
If you add service calls, nothing needs to recompile (as long as the
service call table is large enough) because the new services just go after
the existing ones.
>
>
> This is a big win in reducing the current code duplication, which you
> want to do not only to reduce code size, but to make FS's more robust.
> The common behaviours of FS's *should* be implemented in common code.
Agreed. The worst candidates at the moment are VOP_LOOKUP, VOP_RENAME and
the general locking/releasing protocol, IMHO.
>
> The Lite2 code recognizes this at the VOP_LOCK level in a primitive
> fashion by introducing the lockmgr() call, but since the model is not
> uniformly applied, and deadly-embrace or two caller starvation deadlocks
> can still occur in the Lite2 model. Going to the next step, a veto
> model, both increases the code robustness considerably, as well as
> resolving the state wind/unwind problems inherent in fan out. The
> fan out problem is *the* problem with the unionfs, at this point.
Well at the moment, I think we have to just grit our teeth and merge in
the lite2 code as it stands. We have to at least try to converge with the
other strains of 4.4, if only to try and share the load of maintaining the
filesystem code. I strongly believe that there should be a consensus
between the different 4.4 groups over FS development or we just end up
with chaos.
--
Doug Rabson, Microsoft RenderMorphics Ltd. Mail: dfr@render.com
Phone: +44 171 734 3761
FAX: +44 171 734 6426
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSI.3.95.960808173810.10082U-100000>
