FreeBSD Mail Archives

Date:      Sat, 23 Jun 2001 13:27:50 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        j mckitrick <jcm@freebsd-uk.eu.org>
Cc:        Terry Lambert <tlambert@primenet.com>, freebsd-chat@freebsd.org
Subject:   Re: most complex code in BSD?
Message-ID:  <3B34FBC6.535C799B@mindspring.com>
References:  <20010622221453.B64495@dogma.freebsd-uk.eu.org> <200106222143.OAA28673@usr06.primenet.com> <20010623133638.A84446@dogma.freebsd-uk.eu.org>

j mckitrick wrote:
> This sounds pretty much to me like a vector table, or even
> thunking.  How is it different?  Also, if fsfunc49 (Joe
> Walsh reference  ;-) exists but the fs doesn't know about
> it, how will it still work?  Do you simply leave an empty
> stub that returns without doing anything, or do you direct
> it to a corresponding set of older calls that accomplish
> the same thing?

I didn't tell you the entire story, since this was about
descriptor-based call interfaces, not about the VFS code
in particular, except as an aside about how to understand
what people view as the most complicated part of it.

Note:	PHK broke this somewhat when he added "default VOPs";
the intent is that you inherit upward to failure, not to some
default for which there exists a real implementation.  It will
still work, but realize that there are some hard stops missing,
where the system will claim an implementation exists, when it
should actually fail.

---

The easiest place to see this is the "ufs" entry points
in the ffs_vnodeop_entries[] table.

What I didn't tell you is that the vnode stacking works
like this:

[ vnode ] -> ,---------.
             | private |
             |   data  | -> [ vnode ] -> ,---------.
             `---------'                 | private |
                                         |   data  |
                                         `---------'

There is a vnodeop_entries vector pointed to by each vnode;
there is a vnode per stacking layer (or per stacking layer
with implementation semantics, if the vectors are properly
collapsed to ensure maximum efficiency: collapse would be
through upward inheritance; more on this later).

This entry vector is create at mount time, so if you have
an fs stacked on top of an FS, you have (simplified) for
each FS an entry for:

	read	write	stat	readdir	unlink	rename

etc..  When you stack, you get an implied "ENOENT", so:

Default	ENOENT	ENOENT	ENOENT	ENOENT	ENOENT	ENOENT
FS1	read	write	-	-	-	-
FS2	-	-	stat	-	-	-
FS3	-	-	-	readdir	unlink	-

Result	read	write	stat	readdir	unlink	ENOENT

The "Result" FS is created by creating a collapsed vector
made up of the topmost implementation you come to.

Now say you had:

Default	ENOENT	ENOENT	ENOENT	ENOENT	ENOENT	ENOENT
FS1	read	write	-	-	-	-
FS2	-	-	stat	-	-	-
FS3	read	write	stat	readdir	unlink	-

Result	read	write	stat	readdir	unlink	ENOENT

How does this work?

The answer is, it's still the topmost implementations:
the collapsed vector contains exactly the same things --
read, write from FS1, stat from FS2, readdir and unlink
from FS3.

The difference is that the FS1 read and write must be
implemented by calling a read and write in an underlying
layer: they are not permitted to do the disk I/O directly.
Similarly, the FS2 stat must be implemented by calling a
lower layer stat.

In this case, FS1 and FS 2 are _stacking VFS layers_, and
FS3 is a _local media VFS layer_.  The difference is that
FS3 can't stack on top of another FS.

The FreeBSD code pre-stacks FFS on top of UFS, and that's
the default FFS vnodeopv_entry_sec names fff_vnodeop_entries[].

The layer assembly happens when a VFS stack is instanced,
which happens at mount time.  The layers are set up via
a call to vfs_register() (permit me an oversimplification).

The main magic lies in the fact that if a layer doesn't
know about an entry point, it doesn't touch it, it just
passes the descriptor through unmodified to a lower layer,
until someone does know about it, or until it hits the
ENOENT layer, which doesn't pass anything through.

This is why, in the second example, I can call the FS1
read, and it can call the lower layer read, and the FS2
read will just pass it through to the FS3 read, without
causing an error.

Note: It's possible to optimize this by instancing an
FS2+FS3 collapsed vector.  This would make the FS1 call
directly to the FS3, skipping a "null" VOP passdown.
FreeBSD doesn't do this, currently, so a "nullfs" will
actually end up adding overhead and vnode pointer (vp)
traversals, and allocating vnodes at the null layer for
encapsulation of the requests to push them down to the
next layer.  Very inefficient.  8-(.

So to recap:

1)	There is a soft version of a coelesced	vector
	stack, which lets the consumer see only the top
	level entry point for each function.

2)	Things are passed by descriptor so that they are
	a single (usually "void") pointer, so that even
	though an intermediate layer doesn't know about
	the operation, it can still pass it on; without
	this, it would have to know about the arguments
	for the unknown function so that it could pop
	them off its call stack and push them back on the
	stack to call the underlying layer.  This can't
	work.

2a)	A size and type are included, so that descriptors
	can be proxied.  This means they can be passed
	over a single network pipe, or passed to user
	space, to let people develope VFS stacking layers
	there, or proxied elsewhere.

3)	There is a "hard stop" layer at the bottom.

4)	There is a difference between "local media" FS's
	and "VFS stacking layer" FS's: a "local media" FS
	is very much the same as a "hard stop" layer, in
	that the calls can not be pushed down further.

5)	The top level represents a consumer (in FreeBSD,
	there are two consumers: the system call interface,
	and the NFS server code).

6)	There are a lot of optimizations that could be
	done that FreeBSD isn't doing.

6a)	Interior vector collapse, to get rid of null VOPs
	in intermediate layers, reduce call overhead, and
	reduce resource wastage (e.g. vnodes per layer).

6b)	List vector sorting of all known VOPs.  This would
	permit instantiated stacks to save one lookup, one
	dereference, and one function call overhead, per
	call.

6c)	Optimized direct vector handoff.  This would let
	a stacking or implementation layer make a VOP
	call itself, without the descriptor unpack and
	repack that's currently necessary. 

It's all a pretty trivial assembly of data structures, if
you understand the linkages.  So it's not hard to understand,
it's just not often explained properly for students.

Mostly, students are expected to read the FICUS papers, and
in particular, John Heidemann's Master's Thesis on VFS
stacking architecture in FICUS: this is the same code that
he and UCLA donated to CSRG, and which became the VFS
stacking in BSD4.4, and thus FreeBSD.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B34FBC6.535C799B>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation