From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 01:13:07 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8CE9B1065672;
	Tue,  1 Apr 2008 01:13:07 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from mail.bitblocks.com (bitblocks.com [64.142.15.60])
	by mx1.freebsd.org (Postfix) with ESMTP id 22F4E8FC12;
	Tue,  1 Apr 2008 01:13:07 +0000 (UTC)
	(envelope-from bakul@bitblocks.com)
Received: from bitblocks.com (localhost.bitblocks.com [127.0.0.1])
	by mail.bitblocks.com (Postfix) with ESMTP id 2A4875B41;
	Mon, 31 Mar 2008 18:13:06 -0700 (PDT)
To: Matthew Dillon <dillon@apollo.backplane.com>
In-reply-to: Your message of "Mon, 31 Mar 2008 16:55:07 PDT."
	<200803312355.m2VNt7ZY030308@apollo.backplane.com> 
Date: Mon, 31 Mar 2008 18:13:05 -0700
From: Bakul Shah <bakul@bitblocks.com>
Message-Id: <20080401011306.2A4875B41@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	Alfred Perlstein <alfred@freebsd.org>, qpadla@gmail.com,
	arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 01:13:07 -0000

On Mon, 31 Mar 2008 16:55:07 PDT Matthew Dillon <dillon@apollo.backplane.com>  wrote:
>     There is one issue here and that is the transactional nature of most
>     filesystem operations.  For example, if you append to a file with
>     a write() you are not only writing new data to the backing store, you
>     are also updating the on-disk inode's st_size field.  Those are two
>     widely separated pieces of information which must be transactionally
>     bound --- all or nothing.  In this regard the crash recovery code needs
>     to understand that they are bound and either recover the whole
>     transaction or none of it. 
> 
>     Once you get to that point you also have to worry about
>     interdependancies between transactions... a really sticky issue
>     that is the reason for a huge chunk of softupdate's complexity in
>     UFS.  Basically you can wind up with interdependant transactions
>     which must be properly recovered.  An example would be doing a write()
>     and then doing another write().  The second write() cannot be recovered
>     unless the first can also be recovered.  Separate transactions, but with
>     a dependancy.

My instinct is to not combine transactions.  That is, every
data write results in a sequence: {data, [indirect blocks],
inode, ..., root block}.  Until the root block is written to
the disk this is not a "commited" transaction and can be
thrown away.  In a Log-FS we always append on write; we never
overwrite any data/metadata so this is easy and the FS state
remains consistent.  FFS overwrites blocks so all this gets
far more complicated.  Sort of like the difference between
reasoning about functional programs & imperative programs!

Now, it may be possible to define certain rules that allows
one to combine transactions.  For instance,

    write1(block n), write2(block n) == write2(block n)
    write(block n of file f1), delete file f1 == delete file f1

etc. That is, as long as write1 & associated metadata writes
are not flushed to the disk, and a later write (write2) comes
along, the earlier write (write1) can be thrown away.  [But I
have no idea if this is worth doing or even doable!]

This is reminiscent of the bottom up rewrite system (BURS)
used in some code generators (such as lcc's). The idea is the
same here: replace a sequence of operations with an
equivalent but lower cost sequence.

>     Such interdependancies can become arbitrarily complex the longer
>     you leave meta-data changes unflushed.  The question ultimately becomes
>     whether the recovery code can deal with the complexity or not.  If not
>     you may be forced to flush rather then create the interdependancy.
> 
>     HAMMER has precisely this issue with it's UNDO sequencing.  The complexity
>     is in the algorith, but not the (fairly small) amount of time it takes
>     to actually perform the recovery operation.

I don't understand the complexity. Basically your log should
allow you to define a functional programming abstraction --
where you never overwrite any data/metadata for any active
transactions and so reasoning becomes easier. [But may be we
should take any hammer discussion offline]