From owner-freebsd-fs  Thu Nov 18 15:20:50 1999
Delivered-To: freebsd-fs@freebsd.org
Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20])
	by hub.freebsd.org (Postfix) with ESMTP
	id 8BCD81508E; Thu, 18 Nov 1999 15:20:45 -0800 (PST)
	(envelope-from ezk@shekel.mcl.cs.columbia.edu)
Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15])
	by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id SAA29976;
	Thu, 18 Nov 1999 18:20:44 -0500 (EST)
Received: (from ezk@localhost)
	by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id SAA15756;
	Thu, 18 Nov 1999 18:20:43 -0500 (EST)
Date: Thu, 18 Nov 1999 18:20:43 -0500 (EST)
Message-Id: <199911182320.SAA15756@shekel.mcl.cs.columbia.edu>
X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f
From: Erez Zadok <ezk@cs.columbia.edu>
To: Eivind Eklund <eivind@FreeBSD.ORG>
Cc: Erez Zadok <ezk@cs.columbia.edu>, fs@FreeBSD.ORG
Subject: Re: namei() and freeing componentnames 
In-reply-to: Your message of "Thu, 18 Nov 1999 15:32:20 +0100."
             <19991118153220.E45524@bitbox.follo.net> 
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

In message <19991118153220.E45524@bitbox.follo.net>, Eivind Eklund writes:
> [Note to impatient readers - forward view if included at the bottom of
> this mail]
> 
> On Mon, Nov 15, 1999 at 06:12:09PM -0500, Erez Zadok wrote:
> > In message <19991112000359.A256@bitbox.follo.net>, Eivind Eklund writes:
[...]
> The problem I'm finding with VOP_RELEASEND() is that namei() can
> return two different vps - the dvp (directory vp) and the actual vp
> (inside the directory dvp points at), and that neither of these are
> always available.
> 
> As I am writing the code right now, I am using either of these, with a
> preference for the dvp.  I am considering splitting VOP_RELEASEND()
> into VOP_RELEASEND() and VOP_DRELEASEND(), which takes the different
> VPs as parameters - this will at least give something that is easy to
> search for if we need to change the behaviour somehow.

I found similar "annoying" functionality in Solaris's open() routine.
Sometimes it can return a new dvp, sometimes NULL, and sometimes a copy or
reference to another vnode (I think due to dup() stuff).

From my POV, after having ported stackable templates to several OSs, I found
out that vnode/vfs functions that try to do too much make the life of a
stackable f/s developer harder.  Also, functions that behave differently
under different (input) conditions also make it hard to work with.  The
reason is that stackable file systems have to be layer-independent.  This
means that they have to treat the file system on which they stacked as if
they were the VFS calling that layer, and at the same time they must appear
to the VFS as a low-level f/s.  IOW, a stackable f/s is both a VFS and a
lower-level f/s, and thus have to simulate and act as both.  So whatever
behavior your VFS has before it calls a VOP_* must be simulated accurately
inside the stackable f/s before it calls the lower one.  It is easier to
achieve that when vnode/vfs functions are smaller, simpler, and behave the
same always.

So, I would say that if you think splitting VOP_RELEASEND in two would make
things simpler, go for it here and everywhere else.  The lesson learned from
the Linux vfs (rapid :-) evolution is a good one: after adding more and more
inode/file/dentry/super_block functions, and making them relatively small
and simple, they found ways to push some of that functionality up to the
VFS.

[...]
> Actually, I am reasonably confident that we can do the fixes without
> impacting performance noticably.

That's great!

[...]
> Forward view: I'm undecided on the next step.  Possibilities:
> (1) Change the way locking is specificied to make it feasible to test
>     locking patches properly, and change the assertion generation to
>     generate better assertions.  This will probably require changing

I'm not sure I understand what you mean by assertion generation.

>     VOP_ISLOCKED() to be able to take a process parameter, and return
>     different valued based on wether an exlusive lock is held by that
>     process or by another process.  The present behaviour will be
>     available by passing NULL for this parameter.
> 
>     Presently, running multiple processes does not work properly, as
>     the assertions do not really assert the right things.
> 
>     These changes are necessary to properly debug the use of locks,
>     which I again believe is necessary for stacking layers (which I
>     would like to work in 4.0, but I don't know if I will be able to
>     have ready).

Locks are probably one of the most frustrating things I've had to deal with,
b/c you're rarely told whether the objects passed to you are already locked,
allocated, and if their reference count has been updated, and what, if any,
you have to do with all of these.  FreeBSD is very nice by documenting most
of these conventions in the vnode_if.src file, but Solaris and Linux don't.
I've had to implement a strict un/locking order in my wrapfs templates, to
avoid deadlocks.  Some of that code is so hairy that I dread each time the
(linux) vfs changes and I've got to touch my locking code; that's a sure way
to waste several days debugging that.

Deciding on proper locking is difficult.  In linux, for example, they had
most locking done in the VFS; sounds great at first b/c f/s code doesn't
have to worry about locking objects.  But they found out that to get better
SMP performance, each f/s would have to do its own locking, and so they
pushed some of the locking to be the f/s responsibility.

Locking seems to be stuff that happens all over: part in the VFS, part in
the VM/buffercache, and part inside file systems.  Is there a way to make
locking an explicit part of the vnode interface?  Is there a way to keep
locking in the VFS by default (for simplicity), but allow those f/s that
want to, manage their own locks?  How messy and maintainable such code would
be?

I guess what I'm arguing is for interface flexibility, so we don't have to
revise it again any time soon.

Eivind, if you haven't recently, I suggest you look at some of the stacking
papers (Rosenthal's UI paper, Heidemann, Popek, Skinner/Wong, etc.)
Rosenthal's "requirements" paper succinctly described several important
issues, including atomicity of multi-vnode operations.  Rosenthal suggested
that kernels should have a full-transaction engine, which I think is
eventually necessary, but it's very complex to put it.  The next best thing
is to do some form of safe locking.  Normally each vnode/inode has its own
lock.  Imagine a replicated stackable f/s (replicfs) with fan-out of 3.  So
vnode (V0) at the level of "replicfs" would have access to three
lower-vnodes (V1, V2, V3).  If you want to make a change (say create a file)
in V0, you have to lock V0-V4 at once.  Without vfs support for this,
replicfs would have to enforce ordered locking (such as I've done in wrapfs)
and hope for the best.  If the vfs is smarter, it can help replicfs lock all
4 vnodes at once; or the vfs can allow replicfs to control the locks below
it, and all the vfs has to do is ensure that no one else can lock V1-V3.

I don't have a good answer to this locking issue.  The papers I've cited
describe changes to the vnode interface that simplify locking.  One way they
do that is having only one lock per chain (or stack, or DAG) of stacked file
systems.  So for example, a DAG of stackable f/s is represented by one data
structure that contains locks and other things that are true about the whole
DAG, and then smaller data structures for each node/leaf of the DAG,
containing stuff that's true about that vnode (e.g., operations vector).

> (2) Change the behaviour of VOP_LOOKUP() to "eat as much as you can,
>     and return how much that was" rather than "Eat a single path
>     component; we have already decided what this is."
>     This allows different types of namespaces, and it allows
>     optimizations in VOP_LOOKUP() when several steps in the traversal
>     is inside a single filesystem (and hey - who mounts a
>     new filesystem on every directory they see, anyway?)
> 
>     This change is rather small, and it would be nice to have in 4.0
>     (I want the VFS differences from 4.0 to 5.0 to be as small as
>     possible).
>     It is pretty orthogonal to stacking layers; stacking layers gain
>     the same capabilities as other file systems from it.

Multi-component lookup has always been desirable.  There's one paper by
Duchamp (USENIX '94) on multi-component look in NFS.  I think we should
allow for multi-component lookup as well as the old style "one component at
a time" lookup.  I would argue that the default should still be the old
style.  Someone might want to write a stackable f/s that does special things
as it traverses the pathname of each component.  For example a general
purpose unionfs (one which uses fan-out, unlike the single-stack design in
bsd-4.4) might follow into different underlying directories as it looks up
single components; unionfs has all kinds of interesting semantic issues that
would require more flexibility at lookup time.

Lookup is fairly complex as it is.  If you're going to add multi-component
lookup, then maybe it should be a new vop?  If not a new vop, then make sure
it's added to the current vop_lookup such that a f/s has enough flexibility
to control the type of lookup it wants.  Also, it would be nice if the type
of lookup used can be controlled dynamically by the f/s itself (as opposed
to, say, a mount() flag that sets the lookup type for the duration of the
mount).

> Eivind.

Cheers,
Erez.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message