Date: Sat, 23 Jun 2001 13:27:50 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: j mckitrick <jcm@freebsd-uk.eu.org> Cc: Terry Lambert <tlambert@primenet.com>, freebsd-chat@freebsd.org Subject: Re: most complex code in BSD? Message-ID: <3B34FBC6.535C799B@mindspring.com> References: <20010622221453.B64495@dogma.freebsd-uk.eu.org> <200106222143.OAA28673@usr06.primenet.com> <20010623133638.A84446@dogma.freebsd-uk.eu.org>
index | next in thread | previous in thread | raw e-mail
j mckitrick wrote:
> This sounds pretty much to me like a vector table, or even
> thunking. How is it different? Also, if fsfunc49 (Joe
> Walsh reference ;-) exists but the fs doesn't know about
> it, how will it still work? Do you simply leave an empty
> stub that returns without doing anything, or do you direct
> it to a corresponding set of older calls that accomplish
> the same thing?
I didn't tell you the entire story, since this was about
descriptor-based call interfaces, not about the VFS code
in particular, except as an aside about how to understand
what people view as the most complicated part of it.
Note: PHK broke this somewhat when he added "default VOPs";
the intent is that you inherit upward to failure, not to some
default for which there exists a real implementation. It will
still work, but realize that there are some hard stops missing,
where the system will claim an implementation exists, when it
should actually fail.
---
The easiest place to see this is the "ufs" entry points
in the ffs_vnodeop_entries[] table.
What I didn't tell you is that the vnode stacking works
like this:
[ vnode ] -> ,---------.
| private |
| data | -> [ vnode ] -> ,---------.
`---------' | private |
| data |
`---------'
There is a vnodeop_entries vector pointed to by each vnode;
there is a vnode per stacking layer (or per stacking layer
with implementation semantics, if the vectors are properly
collapsed to ensure maximum efficiency: collapse would be
through upward inheritance; more on this later).
This entry vector is create at mount time, so if you have
an fs stacked on top of an FS, you have (simplified) for
each FS an entry for:
read write stat readdir unlink rename
etc.. When you stack, you get an implied "ENOENT", so:
Default ENOENT ENOENT ENOENT ENOENT ENOENT ENOENT
FS1 read write - - - -
FS2 - - stat - - -
FS3 - - - readdir unlink -
Result read write stat readdir unlink ENOENT
The "Result" FS is created by creating a collapsed vector
made up of the topmost implementation you come to.
Now say you had:
Default ENOENT ENOENT ENOENT ENOENT ENOENT ENOENT
FS1 read write - - - -
FS2 - - stat - - -
FS3 read write stat readdir unlink -
Result read write stat readdir unlink ENOENT
How does this work?
The answer is, it's still the topmost implementations:
the collapsed vector contains exactly the same things --
read, write from FS1, stat from FS2, readdir and unlink
from FS3.
The difference is that the FS1 read and write must be
implemented by calling a read and write in an underlying
layer: they are not permitted to do the disk I/O directly.
Similarly, the FS2 stat must be implemented by calling a
lower layer stat.
In this case, FS1 and FS 2 are _stacking VFS layers_, and
FS3 is a _local media VFS layer_. The difference is that
FS3 can't stack on top of another FS.
The FreeBSD code pre-stacks FFS on top of UFS, and that's
the default FFS vnodeopv_entry_sec names fff_vnodeop_entries[].
The layer assembly happens when a VFS stack is instanced,
which happens at mount time. The layers are set up via
a call to vfs_register() (permit me an oversimplification).
The main magic lies in the fact that if a layer doesn't
know about an entry point, it doesn't touch it, it just
passes the descriptor through unmodified to a lower layer,
until someone does know about it, or until it hits the
ENOENT layer, which doesn't pass anything through.
This is why, in the second example, I can call the FS1
read, and it can call the lower layer read, and the FS2
read will just pass it through to the FS3 read, without
causing an error.
Note: It's possible to optimize this by instancing an
FS2+FS3 collapsed vector. This would make the FS1 call
directly to the FS3, skipping a "null" VOP passdown.
FreeBSD doesn't do this, currently, so a "nullfs" will
actually end up adding overhead and vnode pointer (vp)
traversals, and allocating vnodes at the null layer for
encapsulation of the requests to push them down to the
next layer. Very inefficient. 8-(.
So to recap:
1) There is a soft version of a coelesced vector
stack, which lets the consumer see only the top
level entry point for each function.
2) Things are passed by descriptor so that they are
a single (usually "void") pointer, so that even
though an intermediate layer doesn't know about
the operation, it can still pass it on; without
this, it would have to know about the arguments
for the unknown function so that it could pop
them off its call stack and push them back on the
stack to call the underlying layer. This can't
work.
2a) A size and type are included, so that descriptors
can be proxied. This means they can be passed
over a single network pipe, or passed to user
space, to let people develope VFS stacking layers
there, or proxied elsewhere.
3) There is a "hard stop" layer at the bottom.
4) There is a difference between "local media" FS's
and "VFS stacking layer" FS's: a "local media" FS
is very much the same as a "hard stop" layer, in
that the calls can not be pushed down further.
5) The top level represents a consumer (in FreeBSD,
there are two consumers: the system call interface,
and the NFS server code).
6) There are a lot of optimizations that could be
done that FreeBSD isn't doing.
6a) Interior vector collapse, to get rid of null VOPs
in intermediate layers, reduce call overhead, and
reduce resource wastage (e.g. vnodes per layer).
6b) List vector sorting of all known VOPs. This would
permit instantiated stacks to save one lookup, one
dereference, and one function call overhead, per
call.
6c) Optimized direct vector handoff. This would let
a stacking or implementation layer make a VOP
call itself, without the descriptor unpack and
repack that's currently necessary.
It's all a pretty trivial assembly of data structures, if
you understand the linkages. So it's not hard to understand,
it's just not often explained properly for students.
Mostly, students are expected to read the FICUS papers, and
in particular, John Heidemann's Master's Thesis on VFS
stacking architecture in FICUS: this is the same code that
he and UCLA donated to CSRG, and which became the VFS
stacking in BSD4.4, and thus FreeBSD.
-- Terry
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message
home |
help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B34FBC6.535C799B>
