From owner-freebsd-fs  Tue Aug  6 10:32:24 1996
Return-Path: owner-fs
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id KAA02217
          for fs-outgoing; Tue, 6 Aug 1996 10:32:24 -0700 (PDT)
Received: from phaeton.artisoft.com (phaeton.Artisoft.COM [198.17.250.211])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id KAA02212
          for <freebsd-fs@freebsd.org>; Tue, 6 Aug 1996 10:32:22 -0700 (PDT)
Received: (from terry@localhost) by phaeton.artisoft.com (8.6.11/8.6.9) id KAA13564; Tue, 6 Aug 1996 10:28:47 -0700
From: Terry Lambert <terry@lambert.org>
Message-Id: <199608061728.KAA13564@phaeton.artisoft.com>
Subject: Re: NFS Diskless Dispare...
To: dfr@render.com (Doug Rabson)
Date: Tue, 6 Aug 1996 10:28:47 -0700 (MST)
Cc: terry@lambert.org, michaelh@cet.co.jp, jkh@time.cdrom.com,
        tony@fit.qut.edu.au, freebsd-fs@freebsd.org
In-Reply-To: <Pine.BSI.3.95.960806163307.10082P-100000@minnow.render.com> from "Doug Rabson" at Aug 6, 96 04:50:33 pm
X-Mailer: ELM [version 2.4 PL24]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-fs@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

> [moved to freebsd-fs]
> 
> On Mon, 5 Aug 1996, Terry Lambert wrote:
> 
> > What I'm suggesting is that there needs to be both a VFS_VGET and
> > a VFS_VPUT (or VFS_VRELE).  With the additional per fs release
> > mechanism, each FS instance can allocate an inode pool at its
> > instantiation (or do it on a per instance basis, the current
> > method which makes inode allocation so slow...).
> 
> Not really sure how this would work for filesystems without a flat
> namespace?  VFS_VGET is not supported for msdosfs, cd9660, nfs and
> probably others.


Conceptually, it's pretty tribial to support; it's not supported
because the stacking is not correctly implemented for these FS's.
Look at the /sys/miscfs/nullfs use of VOP_VGET.

> Wait a minute.  The VOP_LOCK is not there just for vclean to work.  If you
> took it out, a lot of the VOPs in ufs would break due to unexpected
> reentry.  The VOP_LOCK is there to ensure that operations which modify the
> vnode are properly sequenced even if the process has to sleep during the
> operation.

That's why the vn_lock would be called.  The VOP_LOCK is a transparent
veto/allow interface in that case, but that doesn't mean a counting
reference isn't held by PID (like it had to be).  The actual Lite2
routine for "actual lock" is called lockmgr() and lives in kern_lock.c
in the Lite2 sources.  Lite2 already moves in this direction -- it just
hasn't gone far enough.


> > The vnode locking could then be done in common code:
> > 
> > 
> > vn_lock( vp, flags, p)
> > struct vnode *vp;
> > int flags;
> > struct proc *p;
> > {
> > 	/* actual lock*/
> > 	if( ( st = ...) == SUCCESS) {
> > 		if( ( st = VOP_LOCK( vp, flags, p)) != SUCCESS) {
> > 			/* lock was vetoed, undo actual lock*/
> > 			...
> > 		}
> > 	}
> > 	return( st);
> > }
> > 
> > 
> > The point here is that the lock contention (if any) can be resolved
> > without ever hitting the FS itsef in the failure case.
> > 
> 
> You can't do this for NFS.  If you use exclusive locks in NFS and a
> server dies, you easily can end up holding onto a lock for the root vnode
> until the server reboots.  To make it work for NFS, you would have to make
> the lock interruptable which forces you to fix code which does not check
> the error return from VOP_LOCK all over the place.

This is one of the "flags" fields, and it only applies to the NFS client
code.  Actually, since the NFSnode is not transiently destroyed as a
result of server reboot (statelessness *is* a win, no matter what the
RFS advocates would have you believe), there isn't a problem with holding
the reference.

One of the things Sun recommends is not making the mounts on mount
points in the root directory; to avoid exactly this scenario (it really
doesn't matter in the diskless/dataless case, since you will hang on
swap or page-in from image-file-as-swap-store anyway).

The root does not need to be locked for the node lookup for the root
for a covering node in any case; this is an error in the "node x covers
node y" case in the lookup case.  You can see that the lookup code
documents a race where it frees and relocks the parent node to avoid
exactly this scenario, actually.  A lock does not need to be held
in the lookup for the parent in the NFS lookup case for the mount
point traversal.  I believe this is an error in the current code.


The issue is more interesting in the client case; a reference is not
a lock, per se, it's an increment of the reference count.  The server
holds the lock mid path traversal.

This is resolved by setting the "interruptable" flag on the vn_lock
into the underlying FS on the server.


The easiest way to think of this is in terms of provider interfaces
and consumer interfaces.  There are many FS provider interfaces.  The
FS consumer interfaces are the syscall layer (the vfs_subr.c) and the
NFS client.  This goes hand in hand with the discussion we had about
the VOP_READDIR interface needing to be split into "get buffer/reference
buffer element" (remember the conversation about killing off the cookie
interface about a year ago?!?!).


> I hope we are not talking at cross purposes.  We are talking about the
> vnode lock, not the advisory record locking aren't we?

Yes.  The VOP_ADVLOCK is also (ideally) a veto interface.  This allows
lock contention from several processes on the same client to be resolved
locally without hitting the wire, and gives a one client pseudo-flock
that works without fully implementing the NFS locking code.

This is really irrelevant to the VOP_LOCK code, which deals with
asserting the lock only in the exception cases.

In the NFS client case, the VOP_LOCK and VOP_ADVLOCK are non-null.  I
didn't show the sleep interface in the vn_lock in the case of the
failure.  The sleep puts a loop around the "actual lock" code so a
sleep occurs above, at the higher code level.  Intermediate locks
on per layer vnodes (if any are truly needed; see below) are
automatically wound and unwound for retry in the blocking case.


In the NFS case, the lock is asserted to the underlying FS, and the sleep
target is returned to the top of the loop by the FS layer where the
contention occurred (basically, a vnodep is returned in the != SUCCESS
case (SUCCESS == 0); this is used as the sleep target.

If a lock in the NFS server code fails, and it fails for the UFS lock
case for the underlying FS, then it should sleep on the UFS vnode
being unlocked.

The veto interface actually implies a couple of semantic changes; the
real implementation would probably be as a NULL lock entry to allow
the routine to not be called at all, saving the vnode_if parameter
list deconstruction/reconstruction.

This allows the substitution of a chaining interface for a file system
stacking layer.

Now you are probably asking "but how can this work when an intermediate
non-NULL layer fans out or in from multiple vnodes?".


The union FS case is one of the most interesting cases for this, since
what you want to do is conditionally assert a lock on two or more
underlying FS's, either of which could have NULL or non-NULL veto code.
The reason it is interesting is stack operand collapse in a stacking
instance.

I could have the following simple case:


	(syscalls or NFS or AFP or SMB or NetWare kernel server)

        consumer          vn_lock
          | ^              | ^
          v |              v |
      quota layer        quota VOP_LOCK (NULL)
          | ^              | ^
          v |              v |
    uid mapping layer    uid VOP_LOCK (NULL)
          | ^              | ^
          v |              v |
          FFS            FFS VOP_LOCK (NULL)

Really, you want to collapse NULL layer entries.  But since the stack
could be reentered from the top, how can you do this without endangering
the locking of terminal nodes based on intermediate nodes?

It turns out that the function collapse for the VOP_LOCK's in this
case is NULL; but say we replace FFS with the NFS client, where the
last layer is non-NULL?

We would want to collapse to the NFS VOP_LOCK call, since the
intermediate chainings are NULL, but the terminal chaining is not.
Similar collapse could remove the uid mapping layer's VOP_LOOKUP,
leaving the quota VOP_LOOKUP (which has to be there to hide the
quota file and protect it) followed by the FFS VOP_LOOKUP.  The
call-down chain is abbreviated.  This is a general win in the veto
interface cases.  The only place you are required to propagate is
the non-NULL cases, and the non-NULL case will only occur when a
fan-out or fan-in of vnodes occurs between layers.

Currently collapse is not implemented.  Part of the support for
collapse without full kernel recompilation on VOP addition was the
0->1 FS instance count changes to the vfs_init.c code and the
addition of the structure sizing field in the vnode_if.c generation
in my big patch set (where the vnode_if.c generated had the structure
vfs_op_descs size computed in the vnode_if.c file.  The change did
not simply allow the transition from 0->N loadable FS's (part of
the necessary work for discardable fallback drivers for the FS,
assuming kernel paging at some point in the future), and it did not
just allow you to add VFS OPS to the vnode_if without having to
recompile all FS modules and LKM's (it's stated intent).  The change
also allows (with the inclusion of a structure sort, since the init
causes a structure copy anyway to get it into a stack instantiation)
the simplification of the vnode_if call to eliminate the intermediate
functioncall stub: a necessary step towards call graph collapse.  You
want this so that if you have 10 FS layers in a stack, you only have
to call one or two veto functions out of the 10... and if they are
all NULL, the one is synthetic anyway.


This is a big win in reducing the current code duplication, which you
want to do not only to reduce code size, but to make FS's more robust.
The common behaviours of FS's *should* be implemented in common code.

The Lite2 code recognizes this at the VOP_LOCK level in a primitive
fashion by introducing the lockmgr() call, but since the model is not
uniformly applied, and deadly-embrace or two caller starvation deadlocks
can still occur in the Lite2 model.  Going to the next step, a veto
model, both increases the code robustness considerably, as well as
resolving the state wind/unwind problems inherent in fan out.  The
fan out problem is *the* problem with the unionfs, at this point.


					Regards,
					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.