Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 8 Aug 1996 14:48:28 -0700 (MST)
From:      Terry Lambert <terry@lambert.org>
To:        dfr@render.com (Doug Rabson)
Cc:        terry@lambert.org, michaelh@cet.co.jp, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org
Subject:   Re: NFS Diskless Dispare...
Message-ID:  <199608082148.OAA17616@phaeton.artisoft.com>
In-Reply-To: <Pine.BSI.3.95.960808173810.10082U-100000@minnow.render.com> from "Doug Rabson" at Aug 8, 96 06:47:44 pm

next in thread | previous in thread | raw e-mail | index | archive | help
> > Conceptually, it's pretty tribial to support; it's not supported
> > because the stacking is not correctly implemented for these FS's.
> > Look at the /sys/miscfs/nullfs use of VOP_VGET.
> 
> VFS_VGET is not implemented in NFS because the concept just doesn't apply.
> VFS_VGET is only relavent for local filesystems.  NFS does have a flat
> namespace in terms of filehandles but not one which you could squeeze into
> the VFS_VGET interface.

The flat name space is the nfsnodes, not the file handles.  In the NFS
case, you would simply *not* implement recovery without reallocation.
The allocation time is small compared to the wire time, and the actions
could be interleaved by assuming a success response, with an additional
dealloc overhead for the failure case.

> > > You can't do this for NFS.  If you use exclusive locks in NFS and a
> > > server dies, you easily can end up holding onto a lock for the root vnode
> > > until the server reboots.  To make it work for NFS, you would have to make
> > > the lock interruptable which forces you to fix code which does not check
> > > the error return from VOP_LOCK all over the place.
> > 
> > This is one of the "flags" fields, and it only applies to the NFS client
> > code.  Actually, since the NFSnode is not transiently destroyed as a
> > result of server reboot (statelessness *is* a win, no matter what the
> > RFS advocates would have you believe), there isn't a problem with holding
> > the reference.
> 
> So the NFS code would degrade the exclusive lock back to a shared lock?
> Hmm.  I don't think that would work since you can't get the exclusive lock
> until all the shared lockers release their locks.

You would unhold the lock and set reassert pending availability from
the rpc.mount negotiation suceeding.  Do this by setting up a fake
sleep address.

The trade off is between blocking a process (which you will have to do
anyway) and hanging the kernel.

The locks are local.  The only possible race condition is local stacking
on top of the NFS on the client side.  You can either not allow it, or
you can accept the fact that someone might win the thundering herd race
(in which case you just get delayed a bit), or you can FIFO the request
list with an array and a request entrancy limit when the array is full,
where you degrade to thundering herd to get into the FIFO list.  It's
unlikely that someone will be running hundreds of process from an NFS
server that crashes, and care who gets their page requests satisfied
first.  The delay from misordering is going to be *nothing* compared
with the delay for a network resource which is unavailable long enough
to have the request list fill up.  I think it's a non-problem to unwind
the state, and the collision avoidance is well worth the worst case
being slightly degraded.  Currently in BSD and SunOS, if the server
can't satisfy a page request from one local process, it blocks and the
whole system goes to hell.  This way, only the processes which are
relying on the unreliable resourvce go to hell.  Even so, I still
vote for flagging the NFS mount to force a copy to swap of any file
being used as swap store from an unreliable server.  It's a better
long term soloution anyway.


> > One of the things Sun recommends is not making the mounts on mount
> > points in the root directory; to avoid exactly this scenario (it really
> > doesn't matter in the diskless/dataless case, since you will hang on
> > swap or page-in from image-file-as-swap-store anyway).
> 
> It doesn't matter if they are on mount points in root.  If a lock is stuck
> in a sub-filesystem, then the 'sticking' can propagate across the mount
> point.

Well, yes, I suppose.  There are better ways to fix that; specfically,
lock the node that is covered before you lock the covering node in a
mount point traversal.  The issue is resolved locally after the second
process waiting for the node, without propagating up past the mount point.

I'm more concerned with interaction between multiple mounts of CDROM's on
a changer device.  It's more likely, if you ask me.

Nevertheless, if you are running something off an NFS server, and it can't
run, then it can't run.  FreeBSD is no less graceful about that than any
commercial OS.

> > The root does not need to be locked for the node lookup for the root
> > for a covering node in any case; this is an error in the "node x covers
> > node y" case in the lookup case.  You can see that the lookup code
> > documents a race where it frees and relocks the parent node to avoid
> > exactly this scenario, actually.  A lock does not need to be held
> > in the lookup for the parent in the NFS lookup case for the mount
> > point traversal.  I believe this is an error in the current code.
> 
> Have to think about this some more.  Are you saying that when lookup is
> crossing a mountpoint, it does not need any locks in the parent
> filesystem?

It needs locks on the covered node, but it does not need to propagate
the collision to root.  The only case this fails is when / is NFS mounted
and the server goes down.  You have worse problems at that point, and
hanging for the server to come back up is most likely the right thing to
do in that case anyway.


> > The easiest way to think of this is in terms of provider interfaces
> > and consumer interfaces.  There are many FS provider interfaces.  The
> > FS consumer interfaces are the syscall layer (the vfs_subr.c) and the
> > NFS client.  This goes hand in hand with the discussion we had about
>       ^^^^^^
> Do you mean NFS server here?

Yes, thanks; sorry about that.


> > the VOP_READDIR interface needing to be split into "get buffer/reference
> > buffer element" (remember the conversation about killing off the cookie
> > interface about a year ago?!?!).
> 
> I remember that.  I think I ended up agreeing with you about it.  The
> details are a bit vague..

I saved them; I can forward them if need be.  The details were vague
because I wanted an interface that let me tell it what I wanted back,
but a struct direct only return would be acceptable for an interim
implementation.  That's one that could be broken up without too much
trouble.


[ ... ]

> > Currently collapse is not implemented.  Part of the support for
> > collapse without full kernel recompilation on VOP addition was the
> > 0->1 FS instance count changes to the vfs_init.c code and the
> > addition of the structure sizing field in the vnode_if.c generation
> > in my big patch set (where the vnode_if.c generated had the structure
> > vfs_op_descs size computed in the vnode_if.c file.  The change did
> > not simply allow the transition from 0->N loadable FS's (part of
> > the necessary work for discardable fallback drivers for the FS,
> > assuming kernel paging at some point in the future), and it did not
> > just allow you to add VFS OPS to the vnode_if without having to
> > recompile all FS modules and LKM's (it's stated intent).  The change
> > also allows (with the inclusion of a structure sort, since the init
> > causes a structure copy anyway to get it into a stack instantiation)
> > the simplification of the vnode_if call to eliminate the intermediate
> > functioncall stub: a necessary step towards call graph collapse.  You
> > want this so that if you have 10 FS layers in a stack, you only have
> > to call one or two veto functions out of the 10... and if they are
> > all NULL, the one is synthetic anyway.
> 
> This is interesting.  It is similar to the internal driver architecture we
> use in our 3D graphics system (was Reality Lab, now Microsoft's Direct3D).
> The driver is split up into different modules depending on functionality.
> The consumer (Direct3D) has a stack which it pushes driver modules onto
> for all the required functionality.  This used to be useful for
> reconfiguring the stack at runtime to select different rendering
> algorithms etc.  Direct3D broke that unfortunately but that is another
> story.

I cheated for this; the two "competing" vnode stacking architectures
are Heidemann's (the one we are using) and Rosenthal's (which lost out).
Rosenthal alludes to stack collapes in his Usenix paper on "A file
system stacking architecture".

The general problem with Rosenthal's support is the same problem
Novell was having in their "Advanced File System Design": personal
views.

A personal view allows the FS to have a cannonical form, and each user
can choose his view on the FS.  The problem with this is the same
problem Windows95 has now with desktop themes: support is impossible.
Imagine the user who is told to "drag that icon to the wastebasket to
fix the problem"... he may have a beaker of acid, or a black hole or a
trash compactor, or whatever... there are not Schnelling points in
common that the user and the technical support person agree on so
that they can communicate effectively.

You can steal the personal view idea of a cannonical form for a directory
structure by specifying a cannonization name space for files regardless
of their name in the real name space.  This was the basis of some of
my internationalization work about two years ago (the numeric name
space suggestion that allowed you to rename system critical files like
/etc/passwd to Japanese equivalents and have NIS and login keep working).

Rosenthal needed stack collapse to reduce the memory requirements per
view instance so he could have views at all.


> It communicates with the drivers by sending service calls to the top
> driver in the stack.  Each service call has a well defined number.  If
> that module understands the service, it implements it and returns a
> result.  Otherwise, it passes the service call down to the next driver in
> the stack.  Some modules override service calls in lower layers and they
> typically do their own work and then pass the service onto the next layer
> in the stack.

Yes.  This is exactly how the Heidemann thesis wants the VFS stacking
to work.  It fails because of the way the integration into the Lite
code occurred in a rush as a result of the USL lawsuit and settlement.
Specifically, there's no concept of adding a new VFS OP without rebuilding
FFS (which is used to get the max number of VFS OPs allowed, in the
current FreeBSD/NetBSD code).


> To optimise the system, we added a service call table in the stack head. 
> When a module is pushed onto the stack, it is called to 'bid' some of its
> services into the service call table.  Each module in turn going up the
> stack puts a function pointer into the table for each of the services it
> wants to implement.  If it is overriding a lower module, it just
> overwrites the pointer.

This is not quite the same inheritance model.  Basically, you still need
to be able to call each inferior layer.  Consider the unionfs that unions
two NFS mounts.  Any fan in/fan out layer must be non-null.


> If you add service calls, nothing needs to recompile (as long as the
> service call table is large enough) because the new services just go after
> the existing ones.

Yes.  The vnode_if.c structure sizing is how I ensured the table was large
enough: that's where the table is defined, so it should be where the
size is defined.  The use of the FFS table to do the sizing in the init
was what broke the ability to add service calls dynamically in the
Heidemann code a integrated into 4.4Lite.


> > This is a big win in reducing the current code duplication, which you
> > want to do not only to reduce code size, but to make FS's more robust.
> > The common behaviours of FS's *should* be implemented in common code.
> 
> Agreed.  The worst candidates at the moment are VOP_LOOKUP, VOP_RENAME and
> the general locking/releasing protocol, IMHO.

The lookup is more difficult because of the way directory management
is dependent on file management.  It's not possible to remove the FFS
directory management code and replace it with an override with the
current code arrangement.  Specifically, I can't replace the FFS
directory structure code with a btree (for instance).

The lookup path buffer deallocation patches pushing the deallocation
up into the consumer interface where the allocation took place, was
a move toward severability of the directory interface.  It had a side
effect of moving toward the anility to support multiple name spaces
nf FS's that require it (VFAT/NTFS/UMSDOS/NETWARE/HFS), and of
abstracting the component representation type (for Unicode support
and more internationalization).

This doesn't resolve the seperability problem of the directory code,
but is goes a long way toward freeing up the dependencies to allow
incremental changes.  I seriously dislike the relookup for the rename
code, and think that it needs to be rethought.  Bute seperability
was a necessary first step.


> > The Lite2 code recognizes this at the VOP_LOCK level in a primitive
> > fashion by introducing the lockmgr() call, but since the model is not
> > uniformly applied, and deadly-embrace or two caller starvation deadlocks
> > can still occur in the Lite2 model.  Going to the next step, a veto
> > model, both increases the code robustness considerably, as well as
> > resolving the state wind/unwind problems inherent in fan out.  The
> > fan out problem is *the* problem with the unionfs, at this point.
> 
> Well at the moment, I think we have to just grit our teeth and merge in
> the lite2 code as it stands.  We have to at least try to converge with the
> other strains of 4.4, if only to try and share the load of maintaining the
> filesystem code.  I strongly believe that there should be a consensus
> between the different 4.4 groups over FS development or we just end up
> with chaos.

The Lite 2 code merge *needs* to take place.  I need to spend more time
on it now that I'm good for work+ 1 hour or so a day of sitting.  I'll
susbscribe to that list pretty soon now.

As to maintenance and design... well, I think we have a problem no
matter what we do.  The Heidemann thesis, and the other FICUS documents
are *the* design document, IMO.  The problem is that the current code
in the 4.4 camps does not conform to the design documents.  I think
that no matter what, that needs to be corrected.  Then there are issues
of kludges for the interface design, or for missing technology pieces
that simply have not been considered in the 4.4 code.  The biggest
kludge is that there is no documented bottom-end interface.  We already
have an unresolvable discrepancy because of VM difference.  The second
biggest kludge is the workaround for the directory structure size
differences... the origin of the "cookie" crap in the VOP_READDIR
interface.  NetBSD and FreeBSD solved this problem in a bad way, and
are in fact not interoperable at this point because of that.  Finally,
there's the fact that 4.4 as shipped didn't support kernel module loading
of any kind, and so there was no effort to limit the recompilation
necessary for adding VOP's in the default vfs_init, the vfs_fs_init,
or in the vnode_if method generation code.

Short of starting up CSRG again, I don't see a common source for
solving the Lite/Lite2 VFS problems.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199608082148.OAA17616>