From owner-freebsd-hackers  Thu Jan  7 18:49:30 1999
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id SAA17378
          for freebsd-hackers-outgoing; Thu, 7 Jan 1999 18:49:30 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id SAA17373
          for <freebsd-hackers@FreeBSD.ORG>; Thu, 7 Jan 1999 18:49:29 -0800 (PST)
          (envelope-from tlambert@usr01.primenet.com)
Received: (from daemon@localhost)
	by smtp02.primenet.com (8.8.8/8.8.8) id TAA20897;
	Thu, 7 Jan 1999 19:48:59 -0700 (MST)
Received: from usr01.primenet.com(206.165.6.201)
 via SMTP by smtp02.primenet.com, id smtpd020857; Thu Jan  7 19:48:51 1999
Received: (from tlambert@localhost)
	by usr01.primenet.com (8.8.5/8.8.5) id TAA03601;
	Thu, 7 Jan 1999 19:48:50 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199901080248.TAA03601@usr01.primenet.com>
Subject: Re: questions/problems with vm_fault() in Stable
To: dillon@apollo.backplane.com (Matthew Dillon)
Date: Fri, 8 Jan 1999 02:48:50 +0000 (GMT)
Cc: tlambert@primenet.com, dyson@iquest.net, pfgiffun@bachue.usc.unal.edu.co,
        freebsd-hackers@FreeBSD.ORG
In-Reply-To: <199901072306.PAA35328@apollo.backplane.com> from "Matthew Dillon" at Jan 7, 99 03:06:21 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

Now we deal with collapsing of layers:


> :would be another example, where multiple NULLFS instances collapsed
> :to *no* local vnode definitions, and one call boundary.  Instead,
> :you are suggesting that we instance vnodes in each NULLFS layer,
> 
>     You are assuming that these things are collapsable, but very *few* 
>     VFS layers are actually collapseable.  For example, there is no way 
>     you could possibly collapse RAID or encryption layer or a mirroring 
>     mid-layer.  You can't collapse an MFS layer that is file-backed.
>     You can't collapse a mirror.   You can't collapse a VN device due
>     to partition translations.  In all cases the intermediate layers 
>     can be independantly accessed and, in fact, it is *desireable* to
>     have the ability to independantly access them.


The intent of using non-vnode originating layers is twofold:

(1)	It gets rid of the coherency issues we've discussed so far.

(2)	It allows for layer collapse, so that the virtual code path
	ends up being much smaller than the real code path.

The first of these has been nearly discussed to death.  Suffice to
say that coherency problems come from complexity, and not all
complexity has value, in and of itself.


The second is a more intersting posit.

Consider the case of where I stack 500 NULL stacking layers on
top of a mount point.  If each layer transition required a vnode
translation, this would take a very long time.

Well, what's a NULLFS?

The NULLFS is primarily a kludge to allow relocation of directories
in the filesystem hierarchy.  This may at first seem to be a useful
and necessary function.  But in fact it's a function whose utility
grows out of the implementation of directory mapping into the
hierarchy in the first place in the per VFS mount routines.

Because the mapping of a FS into the directory hierarchy is done
at FS mount time, instead of in common upper level code, there
are a number of consequences.  Among these are:

o	You have to treat the root FS mount as a special case;
	this is necessitated by the need to remount root as
	rea/write using a device that may not be the same as
	the device provided in the boot procedure (it may, instead,
	be an alias -- of a different sort than the VM aliases we
	have previously discussed, in this case a device alias --
	which owes more to the implementation of special devices
	as files in a "SPECFS" than it does to necessity).

o	There is an artificial distinction between a root mount
	and an inferior mount point mount.  If FS's were not
	distinguished in this way, but instead kept in a global
	table, then a general (and therefore more reliable) set
	of routines could map from the table into the hierarchy.
	This also means that some FS's can be mounted as inferior
	FS's within the hierarchy, but *can't* be used as the root
	FS.

o	In order to map anying into the directory hierarchy, it
	has to be the root of a VFS instance.  This is because
	to access the mapping mechanism, you must invoke it by
	way of some VFS-specific mount code into which it has
	been embedded.

So if we resolve this, where is the utility of the NULLFS?  It
lies in its ability to act as a sample implementation of a
minimal semantic VFS layer.

Increasing this by requiring a vnode factory in the NULLFS, and
VM alias objects for the underlying VM objects greatly complicates
the minimal implementation.  It also precludes layer collapse,
unless it's predicated on the idea of the "default" VOPS being,
in effect, a NULLFS themselves.


How does collapsing work?

Collapsing does *not* work, as implied in the discussion by Matt,
a rune-time short circuit.

Collapsing is intended to occur at FS mount time.

When an FS is mounted, for every VOP in the descriptor array
defined in the structure in (incorrectly compile-time generated)
vnode_if.c, a VOP descriptor reference is instanced.  [Note: if
these descriptors are sorted, as well, then we can get rid of
two ponter dereferences and a lot of reformatting glue code, as
well, and reference by array offset instead of descriptor pointer
reverse lookup].

For VOP's defined by a VFS, the descriptor is taken from the per
VFS array of descriptors.

For VOP's that *aren't* defined by a VFS, the descriptor is taken
from the underlying VFS upon which it is stacked, and so on, until
it gets to the bottom, where the VOP's that are substituted return
a "not implemented" error.


What does this mean for a stack of 500 NULLFS instances?

What it means is that for most VOPs (all VOPs, if the VFS architecture
wasn't currently screwed up by null_bypass and some ill-considered
direct references to NULLVPTOLOWERVP), the VOP's inhereit from the
bottom-most VFS!

It *also* means that the overhead in figuring this out occurs at
the time the VFS is instanced, *not* at runtime.

So what's the overhead?  499 descriptor dereferences of 1 descriptor
dereference?  No.  1 descriptor dereference through the instanced
VOPS array.

How do we address the objection:

>     Introducing vnodes to the null stacking layer does not change the
>     coherency problems associated with the current VFS layering one 
>     iota.  You are, again, assuming that the coherency issue will be 
>     magically solved by collapsing VFS layers and ignoring the fact
>     that most VFS layers (A) can't be collapsed, and (B) that your coherency
>     solution fails utterly the moment you take a network hop.

We address it by noting that most VFS layers (A) *can* be collapsed,
and (B) that the coherency issues for those that *can't* be collapsed,
like the NFS client VFS, or the OTPFS, *don't* have real aliases, only
virtual aliases.

When the collapse occurs, what happens is that the *intermediate*
no-op VOP's are collapes out, even if they have intervening VOP's
that *can't be collapsed out.

It is this inherent call-graph reduction which makes it worthwhile
to stack a large number of semantic layers in the first place, and
which makes it an error to introduce vnodes to layers which don't
gain any benefit from having direct VM object references and/or
don't need to support semantics for the underlying layers on an
per-object basis (even then, layers that need this, such as an
ACLFS layer, can "cheat" by file-based tunneling to get away from
the requirement; this is, in fact, what UFS does when it puts its
quota file references in the in core superblock on a per FS basis
instead of in a hidden file in each directory on a per-file basis).


> :Works on SunOS.  Works on Solaris.  If you have a source license,
> :or sign non-disclosure, John Heidemann will show you the code.
> 
>     Explain to me how it works rather then point me at three hours worth of
>     research that I have to 'understand' to understand your point. 

No VOP_BYPASS is needed.  Because this is introduced by BSD, BSD
has these problems.

You can see the reasoning (which is no longer valid) for the VOP_BYPAS
in /sys/miscfa/nullfs/nullfs_vnops.c in front of nullfs_bypass().


>     There are already a number of situations where coherency
>     tracking is desireable.  Extending the model across a network
>     tops the list.  Being able to use a coherent mmap() on a common
>     NFS-served partition from N different machines, for example.

The MNFS code for FreeBSD from the David Sarnoff Center already
addresses the issue of distributed cache coherency, and does it
elegantly, without introducing a whole raft of complexity.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message