From owner-freebsd-fs  Sun Feb  1 13:37:01 1998
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id NAA10531
          for freebsd-fs-outgoing; Sun, 1 Feb 1998 13:37:01 -0800 (PST)
          (envelope-from owner-freebsd-fs@FreeBSD.ORG)
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA10514;
          Sun, 1 Feb 1998 13:36:58 -0800 (PST)
          (envelope-from tlambert@usr09.primenet.com)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.8.8/8.8.8) id OAA25034;
	Sun, 1 Feb 1998 14:27:06 -0700 (MST)
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp03.primenet.com, id smtpd025020; Sun Feb  1 14:27:03 1998
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id OAA29406;
	Sun, 1 Feb 1998 14:26:58 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <199802012126.OAA29406@usr09.primenet.com>
Subject: Re: Filesystem hacking
To: mikk0022@maroon.tc.umn.edu
Date: Sun, 1 Feb 1998 21:26:58 +0000 (GMT)
Cc: tlambert@primenet.com, Matthew.Alton@anheuser-busch.com, fs@FreeBSD.ORG,
        hackers@FreeBSD.ORG
In-Reply-To: <199802011945.NAA18654@x115-105.reshalls.umn.edu> from "mikk0022@maroon.tc.umn.edu" at Feb 1, 98 01:45:58 pm
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org
X-To-Unsubscribe: mail to majordomo@FreeBSD.org "unsubscribe freebsd-fs"

> OK, let's see if I get this -- a log-structured filesystem stores
> file data and metadata only in its log.  A journaling filesystem
> stores data, metadata, *and operations* in the log?

A log structure filesystem logs writes.  The youngets write claiming
to contain some data wins.  The data and metadata are written seperately,
but tagged as being together for write that span more than one logging
region.  Otherwise, they are written together.  The "valid" stamp is
the last thing written to the log.  On reboot, the logs are scanned
to "find out where everything is".  Oldest date wins, so long as it has
a "valid" stamp.  For multilog transactions, the "valid stamp is delayed
for two or more writes.  Logged data does not map 1:1 to user data;
for example, you may log several otherwise unrelated writes in the same
log.  Thus if you fail before a valid stamp, the transactions are rolled
back.  It is the only thing you can do.

A journallying filesystem journals what it is going to do, and that it
has done it.  If the power goes out after it has stated a full intention,
and it hasn't written that it has done it, all full intentions can be rolled
forward, not just back.

Because the transaction relationship is held in the journal, not in the
FS data or metadata, a journalling FS can expose a transactioning
interface to user space to allow a program to group intentions.


> >A Journalling FS also allows you to expose a transactioning interface
> >to allow you to group transactions.  
> 
> Is there a portable way to do this?

Yes.  But it requires a graphical soloution to the soft updates problem;
most soloutions to the soft updates problem are hand optimized for how
they resolve node-node conflicts, and hand-coded to problem they attempt
to solve.  Kirk McKusicks port of the Ganger/Patt code is an example of
this.

To expose a transactioning mechanism, you must allow the specification
of soft dependencies *between* stacking layers.  Effectively, this means
you run a warshal's algorithm on the event/operation nodes that make
up the graph of relationships

Example 1: I have an event called "delete file".  To delete the file, I
must perform serveral operations:

1)	Delete the blocks hung off the inode
2)	Zero the inode
3)	Remove the directory entry referring to the inode

These operations are "nodes" in a graph, and the meta-operation (or
event) is "delete a file"; it constitutes an edge of three nodes.

This means that there are two dependency vectors in the graph, and
that they have a relationship defined by the connecting node.

I'll spare you the details of the dependency resoloution algorithm;
you can read Kirk McKusick's code from OpenBSD (not for commercial
use without license), or you can read the Ganger/Patt paper.


Now, to expose a transactioning system, you need a general algorithm
for defining dependencies.  Effectively, when you stack your NULLFS
layer with the new VOP (or ioctl) "TRANSACT", which takes the arguments
"begin", "commit", and "abort".  What this means is that you must
expose the internals of your graph so that you can register your
own "dependency resolvers".

Example 2: I have a transaction that requires the deletion of two files.

To do this, I must perform the same three operations above, but I must
do it twice, and I must abort both operations if I can't complete one.

I must also be able to roll-back/roll-forward the operations if the
system fails and comes back up.

How do I do this?

The transaction layer is not really NULL.  When you begin a transaction,
it uses a namespace escaped file to journal all subsequent VOP's, and
to make subsequent requests *appear* to get the new data.  It can do
this several ways, but the most probable one is to use renames and
copies, because renames are atomic.  This will allow it to support the
transactions on any underlying FS.

Alternately, it can journal the transaction by hooking into the
dependency queueing mechanism to ensure that the transaction is
committed in vector-terminal order.

Example 3: A transaction consisting of two events, one containing two
nodes, and one containing three: a log write followed by a delete.

The transaction system creates an imaginary edge between the terminal
node of the log write (the update of the inode size) and the terminal
node of the delete (the update of the directory entry block).  This
edge is a dynamic soft dependency; it is created on the fly, and it
will go away after the event (transaction) it represents has completed.

It's pretty obvious that the dependency resolvers need to consider the
action nodes seperately, and not as an "edge" unit -- in other words,
not like the Ganger/Patt code -- for this to be able to work.  This is
because the transaction layer can't resolve dependencies between edges,
only between nodes.  This damages some optimizations, mostly having to
do with compresion of dependency enumerations (McKusick uses bitmaps
for this; technically, you could still do this, but on a node, not an
edge, granularity.  I *believe* that the losses would be minimal, but
I don't have a proof-of-concept implementation of both [yet] to be
able to benchmark the monolitic vs. the nodal implementations against
each other).

So yes, it's possible to do generically, if you think of FS's in
terms of events, operations, and dependencies between operations,
and if you have soft update technology, and if you implement it
correctly.

Probably, it's easier at this point to just write a JFS and not worry
about making everything work the way it should.  Going the other way
would probably be worthy of a PhD to somone looking for a Doctoral
project.  ;-).


[ ...automatic aggregation of volumes... ]

> >CCD can do this as well.
> 
> How?  I thought CCD built the logical volumes from /etc/ccd.conf
> on bootup.  Thus, moving around disks would require editing
> /etc/ccd.conf on bootup.  XLV does this all automatically.
> 
> Or has this been added in -current ?

No, but it is a trivial hack to add it.  Julian's "slice" and "devfs"
code provide "volume arrival" events.  These events do not propagate
as far as I want them to (an arriving device with no claimers would
be handed to the VFS for mount in my model), but you could easily
put CCD in as a slice type handler.

When a device "arrives", you would read the first block and see if it
has a "CCDLABEL"; if it does, CCD swallows it.  The label contains:

1)	A magic number ("CCDLABEL" is as good as any) that means "I
	am a member of a volume set".
2)	A volume set identifier; probably just a unique-across-volumes
	32 bit identifier.
3)	A count of devices in the volume.
4)	An ID indicating which of the devices in #3 the current device
	represents.

When CCD get N of N devices, it exports "/dev/ccd0".

It would probably be a good idea to have a "no more devices" event
propagated to CCD as well; this would let it recognize a failed
RAID-5 subassembly (for example), but it's not absolutely necessary.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.