From owner-freebsd-hackers  Wed Jan  6 19:16:07 1999
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id TAA21549
          for freebsd-hackers-outgoing; Wed, 6 Jan 1999 19:16:07 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id TAA21523
          for <freebsd-hackers@FreeBSD.ORG>; Wed, 6 Jan 1999 19:16:03 -0800 (PST)
          (envelope-from tlambert@usr09.primenet.com)
Received: (from daemon@localhost)
	by smtp02.primenet.com (8.8.8/8.8.8) id UAA05128;
	Wed, 6 Jan 1999 20:15:34 -0700 (MST)
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp02.primenet.com, id smtpd005075; Wed Jan  6 20:15:24 1999
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id UAA10543;
	Wed, 6 Jan 1999 20:15:22 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199901070315.UAA10543@usr09.primenet.com>
Subject: Re: questions/problems with vm_fault() in Stable
To: dillon@apollo.backplane.com (Matthew Dillon)
Date: Thu, 7 Jan 1999 03:15:21 +0000 (GMT)
Cc: tlambert@primenet.com, dyson@iquest.net, pfgiffun@bachue.usc.unal.edu.co,
        freebsd-hackers@FreeBSD.ORG
In-Reply-To: <199901062259.OAA25909@apollo.backplane.com> from "Matthew Dillon" at Jan 6, 99 02:59:19 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

>     I think you have to get away from thinking about 'consumers' and
>     'providers'.  It is precisely this sort of thinking that screwed
>     up the existing VFS design.
> 
>     The best way to abstract a VFS layer is consider that each VFS layer
>     has a 'frontside' and 'backside'.

I think you are confusing the definitioanl aspects of "fronside"
and "backside"; the point of specifying a "consumer" at all is to
define the interface on the top of the module at the top of the
stack or the interface on the bottom of the module at the bottom
of the stack.

These particular modules are singularly uninteresting, as far as
their ability to act as anything other than pigs, where "some pigs
are more equal than others".  They contribute relatively little
to the game, other than acting as "stream head" or "stream tail"
for the interesting parts of the stack.  And of course as a living
history of how the architecture was wedged in wrong in the first
place, as you look through the various usages of "struct fileops"
in the kernel: the pipe code, the socket code, and the vnops code.
Why aren't pipes and sockets vnodes so that the file access
interface can be normallized?  Why can't I cal fcntl() on to
use the F_GETOWN/F_SETOWN interfaces on a FIFO?  Brain damage.


>     The VFS layer should make no
>     assumptions whatsoever as to who attaches to it on the frontside,
>     and who it is attached to on the backside.

Fine and dandy, if you can tell me the answers to the following
questions:

1)	The system call layer makes VFS calls.  How can I stack a
	VFS *on top of* the system call layer?

2)	The NFS server VFS makes RPC calls.  How can I stack a
	VFS *under* the NFS server VFS?

The problem exists in streams as well.  Somewhere, there has to be a
stream head.  And on the other end, somewhere there has to be a driver.


>     If you really want, you could consider a 'consumer' to be the VFS
>     layer's backside and a 'provider' to be the SAME VFS layer's frontside.
>     So a VFS layer's backside 'consumer' is linked to another VFS layer's
>     frontside 'provider'.  And so forth.  But don't try to 'type' a VFS
>     layer -- it doesn't work.  It was precisely that sort of thinking
>     that required something like the MFS filesystem, which blurs distinctions,
>     to be a major hack in existing kernels.

I'm not trying to 'type' a VFS layer.  The problem is that some
idiot (who was right) thought it's be faster to implement block
access in FS's that need block access, instead of creating a generic
"stream tail" that implemented the buffer cache interface.

If they had done that, then the VOP_GETPAGES/VOP_PUTPAGES would
directly access the VOP_GETBLOCKRANGE/VOP_PUTBLOCKRANGE of the
underling tail, and FFS could stack on top of it, and "stack"
on top of other FS's (although it would only use a subset of
the operations, which would pretty much result in it doing the
same thing as if it wasn't stacked, unless you implemented
VOP_GETBLOCKRANGE/VOP_PUTBLOCKRANGE, for example to implement
"vinum" as a stacking layer).


>     The only way to do cache coherency through a multi-layered VFS design
>     is to extend the vm_object model.  You *cannot* require that a VM
>     system use VOP_GETPAGES or VOP_PUTPAGES whenever it wants to verify
>     the validity of a page it already has in the cache.  If a page is sitting
>     in the cache accessible to someone, that someone should be able to use
>     the page immediately.  This is why a two-way cache coherency protocol
>     is so necessary, so things that effect coherency can be propogated
>     back up through the layers rather then through hacks.  Requiring the
>     GET/PUTPAGES interface to be used in a cache case destroys the efficiency
>     of the cache and, also, makes it virtually impossible to implement async
>     I/O.  The VFS layer, as it stands, cannot do async I/O - the struct buf
>     mechanisms 'sorta' does it, but it isn't really async due to the huge
>     number of places where the system can block even before it returns a bp.

OK.  You are considering the case where I have two vnodes pointing
to the same page, and I invalidate the page in the underlying vnode,
and asking "how do I make the reference in the upper vnode go away?",
right?

The way you "make the reference in the upper vnode go away" is by
not putting a blessed reference there in the first place.  Problem
solved.  No coherency problem because the problem page is not
cached in two places.

The page's validity is known by whether or not it's valid bit is
set.  What you *do* have to do is go through the routines for
VOP_GETPAGES/VOP_PUTPAGES if you want to change the status of a
page that you are addressing via a vnode reference, through one
or more stacking layers which may choose to translate that reference.

More formally, you can't make a page accessed this way appear
without doing a VOP_GETPAGES or disappear without a VOP_PUTPAGES.

And that's the purpose in life of the vnode pager.


>     An extended vm_object and cache coherency model would, for example, 
>     allow something like MFS, VN, or VINUM to be implemented almost trivially
>     and definitely more efficiently, even unto having filesystems relocate
>     underlying storage on the fly.

You could implement these things rather trivially as it is, if the
bottom end VFS was a variable granularity block store instead of a
"file system" that managed its blocks directly, with the caveat that
stacking something that managed block layout on anything other than
a variable granularity block store layer would be pretty darn useless,
since it would never invoke an inferior VOP that implemented policy.

Of course, you're aiming at your foot if you do this.  Consider an FS
that implements ACL's via a VOP_ACL, and manages its own block layout,
and then stack something like FFS (that *doesn't* implement a VOP_ACL)
on top of that.  Now call a system call that calls VOP_ACL, and watch
it spam your FFS contents under you, as it acts unexpectedly.

If you insist on seperating the block management into a stacking layer,
then you will *at least* have to 'type' the stacking layers to avoid
stacking a block-management-only consumer on top of another similar
consumer to avoid direct block manipulation by an otherwise unprotected
call.  I think this would be a bad precedent, though I do like the
idea of the buffer cache interface being represented as a variable
granularity block store.  But then, that's what devices are for.


Say you don't buy this argument.  OK, then what VFS does the NFS
client VFS stacking layer stack on top of?  It doesn't stack on
top of the buffer cache.

You're stuck implementing all of the service interfaces in the
entire system as VOP's.  Not a nice thing.


Now the "head" is another intersting issue.  In streams, the head
is exported as a device.  But in VFS stacking, the "head" is implicitly
abstracted via system calls.

This isn't really a bad thing, but it allows kernel engineers to do
stupid things, like treating system calls that consume the VFS interface
as if they were somehow something special, compared to an NFS server
or a SAMBA server or an AppleTalk server, or some VFS stacking layer
that consumes a VFS interface.


In general, I have to say that I think you are setting yourself
up for some hairy problems; at some point, you will have to
make a design compromise, and if you don't go into it with this
idea in your head in the first place, it's going to be a surprise
instead of something you planned.  Probably a nasty surprise.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message