From owner-freebsd-chat  Sat Apr  3 13:16:43 1999
Delivered-To: freebsd-chat@freebsd.org
Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134])
	by hub.freebsd.org (Postfix) with ESMTP id 7FF2514BD6
	for <freebsd-chat@FreeBSD.ORG>; Sat,  3 Apr 1999 13:16:38 -0800 (PST)
	(envelope-from tlambert@usr04.primenet.com)
Received: (from daemon@localhost)
	by smtp04.primenet.com (8.8.8/8.8.8) id PAA27540;
	Sat, 3 Apr 1999 15:29:03 -0700 (MST)
Received: from usr04.primenet.com(206.165.6.204)
 via SMTP by smtp04.primenet.com, id smtpd027528; Sat Apr  3 15:28:58 1999
Received: (from tlambert@localhost)
	by usr04.primenet.com (8.8.5/8.8.5) id OAA11040;
	Sat, 3 Apr 1999 14:14:33 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199904032114.OAA11040@usr04.primenet.com>
Subject: Re: Linux vs. FreeBSD: The Storage Wars
To: toor@dyson.iquest.net (John S. Dyson)
Date: Sat, 3 Apr 1999 21:14:32 +0000 (GMT)
Cc: tlambert@primenet.com, dyson@iquest.net, hamellr@dsinw.com,
	unknown@riverstyx.net, freebsd-chat@FreeBSD.ORG
In-Reply-To: <199904010647.BAA19102@dyson.iquest.net> from "John S. Dyson" at Apr 1, 99 01:47:20 am
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-chat@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > Any putative efficiency penalties (granting their existance for the sake
> > of discussion) would be paid only by the stacking layers themselves, and
> > as it currently doesn't work, you aren't going to be paying an efficiency
> > penalty for anything you currently use.
> >
> > So efficiency is a NULL argument.
>
> It cannot be a NULL argument, because continual polishing the t*rd isn't
> really solving the problem.

And Occam's Razor implies that "anything that works is better than
anything that doesn't".

As long as people cling to an "evolutionary, not revolutionary"
mentality to keep themselves within thei comfort zone, you are going
to be able to address architectural modifications until after you
have a working architecture to modify.

What you may view as "polishing the turd" is in fact the minimal set
of work necessary to get to a point where evolution is possible.  I
personally prefer revolution, since it moves things ahead a hell of
a lot faster, but I'm not in charge.


> > IF VM alias objects are to be introduced (and that's a big mother "if",
> > in my opinion), it should only be done *after* it is proven, using
> > formal analysis methods, that unintentional aliases have been rendered
> > impossible.
>
> The current VM backing scheme is correct and needs only minor extension.
> In fact, the VM backing is natural (e.g. copy on write), whilst the
> current VFS layering doesn't handle the needed semantics for coherency
> without lots of traversal of the layers.

I don't buy this traversal argument, and I don't buy the coherency
argument.

VFS stacking layers are translational, functional, or semantic.

The semantic and functional layers don't need local cache, and can
instead refer to the underlying layers instead.  These two classes
make up the *vast* majority of what you would ever want to do with
VFS stacking.

The translational layers need local cache.  For things like a
cryptographic or compressing FS, the local cache only *correlates*
to, not replicates the contents of, the underlying layer.  This
means that the coherency issue is one of synchronization, and can
not be eliminated by hand waving or by fiat.

This leaves the small subset of translational layers in which the
translations is (A) linear, (B) scoped to page-sized objects, and
(C) scoped to page aligned boundaries (e.g., it leaves out useful
linear translations such as an FS that represents the tracks on
audio CD's as files).

As far as I can tell, this set contains the single element "MFS
backed by a vnode object as swap store".


> Bottom line, the VM backing already DOES work, or nothing in the system
> would work.

Bottom line: FreeBSD machines in the field are expercing a statistically
rare type of file corruption that has common characteristics, and
FreeBSD versions includinging and prior to FreeBSD 2.2.6-RELEASE (*not*
-STABLE) did not experience this corruption.

Below the bottom line: It ain't gremlins, and it ain't pilot error
on factory sealed units that don't offer shell access of any kind.


> > The only way I see clear for this to happen is if they don't both
> > exist in the code at the same time.
>
> Yep, get rid of the unintentional VFS layering bugs, by taking advantage
> of the already needed VM layering for any kind of reasonable VM behavior.

I'm referring to the existing VM bugs needing resolution before new
VM bugs are (potentially) introduced by design changes.


> That VM stuff is there anyway, so why muck it up with a parallel, and
> semantically incorrect (or inefficient) structure?  The VM layering already
> has the needed mechanisms for handling shared (and modified) memory
> "repositories."

Files do not exist merely to serve the VM system.  Files have abstract
existance as modelled objects in user programs and ANSI C.  Relatively,
the VM has exposure as "the thing that makes mmap and sbrk work".

Maybe there should be an entirely new, VM centric programming paradigm,
but until one is ratified by an IEEE committee and given a POSIX ID
number, I won't hold my breath.


> By constraining onself to the current VFS layering, it simply complicates
> the system with two different kinds of layering schemes.  Don't forget
> that sometimes generalization of a problem, simplifies it -- and the
> VFS scheme is TOO conventionally-file oriented, and not very oriented
> towards data.

You mean like simultaneously having pty's, sockets, pipes, and FIFO's
complicates things?

Or like having CAM layering complicates things?

I don't proget about problem generalization, but you should not forget
that the major thorns in the VFS side are the *lack* of generalization
in the BSD4.4 kernel with regard to file descriptor objects mapping to
vnodes.


> The "file" abstraction is too specific.  I admit that the VM schemes
> need to be better documented for those who haven't read the MACH
> (and the new deamon book) information, but once the underlying principles
> are understood, it is clear that files are a paradigm that are too
> focused towards one kind of thinking.

I understand that everyone's baby is the central and most important
point of having a kernel in the first place.  For some people, that's
a realtime scheduler, and everything should be deterministic.

For me, I'd at least like to move forward on the VFS front without
having to reimplement everything in such a way as is most convenient
for the VM system.  I'm personally more concerned with dealing with
the VM issues that exist, and being able to leverage VFS work taking
place on other platforms.  The second most important thing to me in
this context is portability of VFS code between operating systems
platforms, with the first being that the VFS code actually *work* on
one or more platforms in the first place so that VFS actually takes
place somewhere (can't leverage something without a fulcrum and a
lever).


> Such new documentation would mostly
> be a repeat of already available materials anyway.

But it would be specific, and free of all irrelevent and extraneous
information that would get in the way of someone attempting to move
forward while taking the constraints of the past into account.


> As soon as a "file" is abstracted to "memory objects", then things become
> easier.

For people who lioke  to treat everything as a memory object.  Too
bad user space code has to deal with files.


> A memory object can reside anywhere, and have all kinds of
> inheritance attributes, and interrelations.  (A file can also, but the
> scheme as presented in 4.4BSD VFS doesn't do so -- and to expand the
> notion of file to what I call "memory objects", changes the current
> layering code so severely as to make it better to almost start over.)

I think inheritance of attributes is inherently evil.  I prefer
inheritance of semantics.  If you inherit attributes, there is no
accessor or mutator functions which are capable of being hooked
to provide other interesting interpositions and notifications.

It is more useful to me, writing a graphical file manager, for example,
to be able to ask for notification when a directory changes, than it
is for my MFS to be ten times instead of six times faster than the
current MFS.


> The Heidemann framework is a good document on the needed semantics
> from a file standpoint, but addresses weakly the issues of the memory
> objects (be they in memory, on disk, or across a network.)

That's because a memory object is a cached copy of data from somewhere
else.


> With correct
> protocols, the "memory object" scheme actually does what the programmer
> expects.  The current VFS layering framework only very weakly handle the
> issues of the "data containers" or "memory objects" themselves.  The
> non-bidirectional nature of the current layering also forgets the 
> forward movement of OS design.  (Of course, if every I/O call or access
> to memory traverses the entire chain, then the current framework might work.)

That's ridiculous.  One of the design tenets, which Rosenthal didn't
address in his vesrion of vnode stacking, is that null layers may be
collapsed out.

The current implementation is suffering in this regard *only* because
there are default VOPs that are actually expected to do something
other than return "not implemented".  This is an error because of
the case where you add a new VOP, and there is no longer a default VOP
for every VOP: it assumes you recompile, then reinstance, everything.
This flies in the face of the documented, intended, architecture.


> The memory oriented approaches eliminate (or at least handle) the aliasing
> and local caching issues correctly.

This is really, really irrelevent.  It assumes that all but a tiny
fraction of all possible file systems will be using local media.  If
you don't solve the whole caching issue, you haven't solved the
caching issue.

The correct place for a coherency proptocol is in a seperate module
that can, if need be, extended to encompass coherency between
disparately located cache instances and/or clusters of such instances.
This is obvious from the Sarnoff work in that regard.

Most of you cache problems are stemming from an assumption that
copies of the pages are hung off of multiple vnodes in a stack,
and that this occurs because vnodes are vm_object_t containers.

This is wrong from three perspecives:

1)	You should *ask* the FS what the correct vm_object_t is,
	instead of dereferencing it out of a vnode directly.

2)	Not all vnodes represent storage objects; many of them are
	containers for semantics, not data.

3)	Those which do contain data *still* don't necessarily
	represent storage objects, but instead, translations of
	storage objects.

It's *wrong* to turn a vnode into an alias for a vm_object_t.


> The original 4.4/2 framework was so bad,
> that even local mmaped objects are only weakly coherent (actually not even
> that), let alone any other caching in the pipeline.  With the memory
> schemes, the problem solves itself (with only minor consideration for the
> additional expected file semantics.)  It is only the proper implementation
> of VM coherency that the current code works local to a given vnode.  It
> is only a small VM extension, and definition for use, to make an entire
> layered scheme work.

I'll believe this when msync(2) becomes #define msync(x,y,z) /* nothing*/.


> By reworking the entire VFS layering scheme (still looking somewhat
> like the current implementation, but properly abstracted) the entire
> solution (instead of a hack solution) can be made available.

I agree that the current code needs changes.  I just don't think
it's a good idea to trade one experiment that doesn't work because the
effort hasn't been expended to put in the fixes for another experiment
that still not guaranteed to work, and still doesn't have the effort
behind it.

I believe the correct approach is to stabilize the existing code in
line with it's original design document.


> Remember, both FILE and MEMORY data needs to be presented to the user, and
> FILE data is a narrow picture of memory.  MEMORY can easily be made more
> specific by presenting it as a file -- however expanding the semantics of
> a file to memory is more complex (especially with sharing.)   When a
> conversion to MEMORY from FILE and back again, has to be done at every
> layer, then a scheme is going to be very inefficient or complex.  If the
> abstraction is kept as memory at each layer, then complexities are lessened.

First off, file data is a hell of a lot more prevalent than memory
data.  Everything on the entire net can be represented as file data,
and even if you made a heroic effort, you simply don't have the bits
to represent it as memory.

Memory data must be viewed as cached copies of file data, with the
potential that there is no file data backing a particular object
(in other words, it's a cached copy of very volatile file data,
and only the cached copy is useful).

Second, as pointed out above, you only have to do the conversion
at layers where you imply a relationship between a vnode (file data)
and a vm_object_t (memory data).  This relationship need not be
implied *everywhere*; it need only be implied for:

1)	The terminal backing vnode itself

2)	Translational layers, where the translational layer vnode
	itself contains a cached version of the translated data
	from the terminal backing object

(Semantic layers, obviously, don't need vm_object_t's at all, since
they don't have persistant data).

In the first case, there's not a coherency issue because there's
no "ghosts".  In the second case, there's a coherency issue, but
it's an issue that *must* be handled by layer itself, since it
contains the code that understands the translation process (and its
reverse, if any).

This means that if you want to implement a cryptographic or a
compressing FS, *you have no choice* but to implement the code
for VOP_{GET|PUT}PAGES, and let the VFS layer be responsible
for the semantics.  The same goes for VFS layers like the NFS client.

The problems *only* occur when you *insist* that a non-terminal vnode
have a vm_object_t.  It is this insistance which is flawed, not the
architecture that can't serve this insistance.


> Since each layer might have to present a memory image (either as caching
> or mmap), then with a file representation, each layer has to do the "hard"
> conversion (given the anachronisitic file-only abstraction.)  There is NO cost
> in keeping the abstraction as memory as long as possible in the chain.  If
> a conversion is needed at machine boundaries, it might be possible to
> avoid the file abstraction entirely, and create a (MEMORY <--> SOCKET)
> protocol directly.   (It might not be needed to create and use a more complex
> (MEMORY <--> NFS <--> SOCKET) thing.)

This is an assumption that I believe has very little basis: that most
intermediate VFS stacking layers will have exposure in the filesystem
namespace.

You would *not* want this for cryptographic, ACL, file forking, auditing,
monitoring, activating, event generating, quota enforcing, versioning,
etc., etc.  layers for VFS's underneath them.  It would totally defeat
their utility.

The only cases where I can see this being useful are union, transparent,
overlay, whiteout, and similar situations.  These situations are relatively
unique for several reasons:

o	They generally apply to multiple exposure of *the same layer*
	in different places in the namespace, and thus there is no
	cache coherency issues.

o	Where they *do* expose both the VFS and the VFS it is stacked
	upon to the namespace, the stacked VFS is a hybrid semantic
	and translational VFS layer.

The interesting thing to note about hybrids is that they store their
data using either structural or namespace escapes.  In either case,
their data objects are effectively "exclusive use".

It would be a trivial extension to allow an upper layer to tell a lower
layer to "guard these objects from change by anyone but me".


Do I want to rewrite the VFS layering so that the bottom end is not
an interface to the VM system at all, but instead a VFS representation
of a variable granularity block store, which itself interfaces to the
VM on behalf of all other VFS layers?  Yes.

Do I *need* to do this?  For some applications, it'd be damn convenient
to have the VFS architecture be symmertic for *all* VFS's, but it's not
a requirement to make the current architecture useful.


> > > Why do you put words in my mouth about doubling inode size?  Straw man...
> > 
> > You are mentioning ACL's.  The most current FS ACL work is being done
> > in NetBSD (not FreeBSD).  I thought you were referencing a modern
> > research project when you referenced ACL's.  My mistake.
>
> Yep...  By assuming what I have been thinking about, it shows that arguments
> about such might be misguided.

Sorry.  I was recently approached about the work.  I should have looked
for zebras when I heard the ACL hoofbeats.  ;-).


> > Fie.  You are the one who originally posted about seeing years of work
> > frittered away.  I am not prepared to repeat that journey; it is a fool's
> > quest.
>
> Fallacious argument -- you aren't the author of the original code or
> those changes are you?  The author of the code apparently accepted the
> changes.  (In fact, the changes were also compatible with other users
> and developers on the codebase.)

I'm the author of at least two instances of the fixes that resolve the
VFS stacking issues, both of which were "frittered away".


[ ... on storing 64 bits of seconds for ctime, atime, and mtime ... ]

> I suggest coming up with a solution then.  Of course, I suggest that
> UFS/ODS2 needs to be thought through.  By taking micro pot-shots doesn't
> really solve the problem (or the other problems that needed to be solved
> in the shorter term.)

Here's a soloution:

1)	Back out the use of those fields for nanoseconds
2)	Zero the coopted fields.  This is the painful part, but
	you can use a bit in fs_unused_1 if you have to, or you
	could be truly sneaky and use inode 1.  My preference
	would be adding an fs_flag value.  However it's done,
	pennance for the sin must be paid.
3)	Take a spare field for nanoseconds for *just* mtime
4)	Modify make to look there instead of the mtime element of
	the coopted fields
5)	Modify the quota code to put in a header containing a
	magic number and a version number in the frigging file
	so that we can tell the difference between old and new
	files at mount time.
6)	Make time_t an int64_t instead of an int32_t in the new
	file format, and eat the performance hit with the old file
	format.
7)	While you are at it, remove the 32 bit limitation on the
	block count.
8)	Add an option to quota(1) to upgrade the file.
9)	Know in your heart of hearts that quotas should be implemented
	as a VFS stacking layer using a namespace override so that
	quotas can work on any filesystem, not just UFS.
10)	Find someone willing to commit the code before you write it.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message