From owner-freebsd-chat  Thu Sep 28 18:16:41 2000
Delivered-To: freebsd-chat@freebsd.org
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
	by hub.freebsd.org (Postfix) with ESMTP id 6670837B422
	for <chat@FreeBSD.ORG>; Thu, 28 Sep 2000 18:16:12 -0700 (PDT)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.9.3/8.9.3) id SAA29308;
	Thu, 28 Sep 2000 18:14:45 -0700 (MST)
Received: from usr05.primenet.com(206.165.6.205)
 via SMTP by smtp03.primenet.com, id smtpdAAArma4j5; Thu Sep 28 18:14:35 2000
Received: (from tlambert@localhost)
	by usr05.primenet.com (8.8.5/8.8.5) id SAA06192;
	Thu, 28 Sep 2000 18:15:58 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200009290115.SAA06192@usr05.primenet.com>
Subject: Re: SGI releases XFS under GPL
To: jrs@enteract.com (John Sconiers)
Date: Fri, 29 Sep 2000 01:15:58 +0000 (GMT)
Cc: blk@skynet.be (Brad Knowles), wjv@cityip.co.za (Johann Visagie),
	chat@FreeBSD.ORG
In-Reply-To: <Pine.NEB.3.96.1000928091033.6947A-100000@shell-1.enteract.com> from "John Sconiers" at Sep 28, 2000 09:18:23 AM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-chat@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > 	However, XFS doesn't have "softupdates", and I don't know of any 
> > way to apply something like "softupdates" to it.  And for what we're 
> > doing, I'm not sure how much it matters to us to have something like 
> > Veritas VxFS on our machines if that meant we'd have to give up 
> > "softupdates".
> > 	All-in-all, I'm just not sure if the overall net change would be 
> > a positive or a negative, and for whom.
> 
> Can you please explain the difference in XFS and softupdates and why soft
> updates would be more desireable than a journaling file system.  I
> understand what XFS is but based on your comments I get the feeling that
> I have the wrong impression of what softupdates is and how it
> performs.  I know there are papers on the subject(s).  Any one got a link?

Here is a link to an abstrct that has links to the cover sheet,
the paper, and "Appendix A", which is the sanitized source code
from their SVR4 implemenetation:

	http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/

Soft updates ensures metadata consistency; that's all it is
supposed to do, and that's all it does.

It has the same safety as synchronous metadata mounts, but can
operate within 6% of memory speed; in some cases, this turns
out to be better than pure async mounts, since it tends to
write gather operations which reverse themselves, such as the
creation of a file followed by its deletion, such as you might
see during a compile.  Async would still hit the disk twice,
whereas soft updates woud hit the disk zero times for the same
set of operations.

Soft updates is tied heavily into graph theory, and the idea
that FS operations can be broken down into synchronization
events.  A synchronization event must be completed before the
next synchronization attempt is permitted.

In traditional systems, this has been guaranteed by stalling
all events until the synchronization event has been completed;
this is a "synchronous mount".

A later system, patented by USL, uses a technique called "DOW"
or "Delayed Ordered Writes".  This technique only stalls the
pipeline on metadata related synchronization events; that is,
one metadata synchronization event must complete before the
next is permitted, by asynchornous events other than that are
permitted.  It gains its speed increase from delaying writes
between synchronization events.  This gives it the ability to
effectively do implicit write gathering of non-metadata writes,
which can also be gathered with related metadata writes at the
stall point.  This method is superior to async and to simple
write gathering, since it does not violate NFS or POSIX
semantics on guarantees of things like timestamp updates with
regard to async data writes and updating the file modification
time, etc..

NB:	ReiserFS uses this same technique in order to implement
	their logging; I personally believe that this is an
	infringement of the USL patent.

Soft updates maintains a dependency graph of metadata events;
the bebefit of doing this is to ensure you can stall a write
to metadata to ensure proper ordering.  But unlike the DOW
technique, because the graph is fully known to the system,
rather than implied by the stall, the stall will on affect
dependent metadata writes.  This means that if I have two sets
of operations going on to, for example, create two files in
two different directories, simultaneously, where DOW would
stall one operation until the other has completed, soft updates
will not result in a stall of the second operation.  A stall
will still occur on a directory that has multiple operations
occurring simultaneously (or sequentially, such as a create
followed by a rename) in the same directory entry block, and
the directory modification timestamp update will also be
serialized.  But on a heavily loaded machine, each process
will have what is called "locality of reference", which is just
a way to say "most programs operate on independent data sets,
and so don't ever conflict with each other on their operations"


A common misconception about soft updates is that you can get
the same failure recovery that you would get from journalling
or logging.  In theory, it looks like you could, since in the
event of a power failure, for example, the only thing that
would be out of date is the cylider group bitmaps, and the way
that they are out of date is by having some blocks within the
cylinder group marked as allocated, when the metadata state at
the time of the crash had not been committed.  With this true,
you could scan the disk in the background, locking a cylinder
group at a time to clean the bitmap, and unlocking it when you
are done.  Locality of reference means that you will probably
stall some programs for a tiny amount of time, if they are
intent on doing I/O within that cylinder group being fixed,
but this is a far cry from waiting a long time for 36G of disk
to be scanned in detail.

The flaw with this theory is that a power failure is not the
only type of crash you could have, and running after any
crash that can corrupt any portion of the disk (e.g. most
disks corrupt sectors if power is lost during a write, and
in the evnt of a kernel panic, you don't know what data was
corrupted in core, then erroneously written to disk before
the actual panic, etc.), puts you at risk of further disk
corruption and user space software failures.  In the worst
case, the crash was the result of a hardware failure of the
disk subsystem (disk, controller, cables, terminator, etc.).
So it is impossible to recover without an exterior log of
the events leading up to the crash (this is how the WAFL
file system from Network Appliance works: it uses an NVRAM
intention log).


Going further down the soft updates road, there's really no
reason to assume that the UFS and FFS pieces are the only
thing in the dependency tree.  The shape of the dependency
tree was "frozen" when soft updates was coded, but this
need not have been.  There is actually no good reason that
the dependency tree shape should be static; indeed, the
system only knows that it's traversing pointers; how they
got in the arrangement they are in, the system doesn't care.
Neither is the argument that the graph would take more
memory than it currently takes; the shorthand structures in
use in soft updates today could remain the same: they describe
inter-node ordering relationships along an edge between nodes.

This means that, should the approach be generalized, which
would take a small amount of work, then it could work between
stacking layers.  At mount time, a node-node dependency
resolver could be registered into the graph, at the same
time the node relationships are registered, by virtue of the
mount.  This would let you do some marvelous things, such as
seperating out the quota into a stacking layer, without
losing soft updates capability (the inter-layer boundary is
otherwise an implied synchronization point, which is global
in scope, turning the soft updates approach into the DOW
approach, for all intents and purposes).  Or you could use
artificial dependencies to export a transactioning interface
to user space database applications.  Or you could propagate
dependency relationships across a network connection layer,
between machines, and do FS clustering.  The possibilities
are really huge.

I've talked with Yale and Kirk, and Greg about generalizing
this before; the thing that stopped me from doing it on my
own was the license on Kirk's code making it so that I
might be unable to grant license to the code without Kirk
also granting license.  Now that that has changed, I will
probably put it on my projects lists, after two or three
others near the top, since I think it's important to
pursue this approach, since it opens so many avenues for
additional research and technological progress.


In any case, since you're familiar with XFS, you should be
able to see that metadata integrity is one aspect of XFS, and
is one aspect of soft updates, but the technologies could in
fact complement each other tremendously: they are future
partners, not competitors, since in the majority of cases,
what they bring to the table, other than metadata integrity
guarantees, is non-overlapping.  Indeed, the XFS metadata
integrity could probably be sped up considerably, if only
through benefit of the implicit write gathering of soft
updates (something that can't happen with XFS as it is).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-chat" in the body of the message