From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 18:07:58 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B7E111065671
	for <arch@freebsd.org>; Tue,  1 Apr 2008 18:07:58 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 994DC8FC1A
	for <arch@freebsd.org>; Tue,  1 Apr 2008 18:07:58 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31I7iFF039980;
	Tue, 1 Apr 2008 11:07:44 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31I7g8I039974;
	Tue, 1 Apr 2008 11:07:42 -0700 (PDT)
Date: Tue, 1 Apr 2008 11:07:42 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200804011807.m31I7g8I039974@apollo.backplane.com>
To: Bakul Shah <bakul@bitblocks.com>
References: <20080401011306.2A4875B41@mail.bitblocks.com>
Cc: Christopher Arnold <chris@arnold.se>, Martin Fouts <mfouts@danger.com>,
	Alfred Perlstein <alfred@freebsd.org>, qpadla@gmail.com,
	arch@freebsd.org, Poul-Henning Kamp <phk@phk.freebsd.dk>
Subject: Re: Flash disks and FFS layout heuristics 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 18:07:58 -0000

:My instinct is to not combine transactions.  That is, every
:data write results in a sequence: {data, [indirect blocks],
:inode, ..., root block}.  Until the root block is written to
:the disk this is not a "commited" transaction and can be
:thrown away.  In a Log-FS we always append on write; we never
:overwrite any data/metadata so this is easy and the FS state
:remains consistent.  FFS overwrites blocks so all this gets
:far more complicated.  Sort of like the difference between
:reasoning about functional programs & imperative programs!
:
:Now, it may be possible to define certain rules that allows
:one to combine transactions.  For instance,
:
:    write1(block n), write2(block n) == write2(block n)
:    write(block n of file f1), delete file f1 == delete file f1
:
:etc. That is, as long as write1 & associated metadata writes
:are not flushed to the disk, and a later write (write2) comes
:along, the earlier write (write1) can be thrown away.  [But I
:have no idea if this is worth doing or even doable!]

    This is a somewhat different problem, one that is actually fairly
    easy to solve in larger systems because operating systems tend to
    want to cache everything.  So really what is going on is that
    your operations (until you fsync()) are being cached in system memory
    and are not immediately committed to the underlying storage.  Because
    of that, overwrites and deletions can simply destroy the related
    cache entities in system memory and never touch the disk.

    Ultimately you have to flush something to disk, and that is where
    the transactional atomicy and ordering issues start popping up.

:This is reminiscent of the bottom up rewrite system (BURS)
:used in some code generators (such as lcc's). The idea is the
:same here: replace a sequence of operations with an
:equivalent but lower cost sequence.

    What it comes down to is how expensive do you want your fsync()
    to be?  You can always commit everything down to the root block
    and your recovery code can always throw everything away until it
    finds a good root block, and avoid the whole issue, but if you do
    things that way then fsync() becomes an extremely expensive call to
    make.  Certain applications, primarily database applications, really
    depend on having an efficient fsync().

    Brute force is always simpler, but not necessarily always desireable.

:...
:>     is in the algorith, but not the (fairly small) amount of time it takes
:>     to actually perform the recovery operation.
:
:I don't understand the complexity. Basically your log should
:allow you to define a functional programming abstraction --
:where you never overwrite any data/metadata for any active
:transactions and so reasoning becomes easier. [But may be we
:should take any hammer discussion offline]

     The complexity is there because a filesystem is actually a multi-layer
     entity.  One has a storage topology which must be indexed in some manner,
     but one also has the implementation on top of that storage topology
     which has its own consistency requirements.

     For example, UFS stores inodes in specific places and has bitmaps for
     allocating data blocks and blockmaps to access data from its inodes.

     But UFS also has to maintain the link count for a file, the st_size
     field in the inode, the directory entry in the directory, and so forth.
     Certain operations require multiple filesystem entities to be adjusted
     as one atomic operation.  For example removing a file requires the
     link count in the inode to be decremented and the entry in the directory
     to be removed.

     Undo logs are very good at describing the low level entity, allowing you
     to undo changes in time order, but undo logs need additional logic to
     recognize groups of transactions which must be recovered or thrown away
     as a single atomic entity, or which depend on each other.

     One reason why it's a big issue is that portions of those transactions
     can be committed to disk out of order.  The recovery code has to
     recognize that dependant pieces are not present even if other 
     micro-transactions have been fully committed.

     Taking UFS as an example:  UFS's fsck can clean up link counts and
     directory entries, but has no concept of lost file data so you can
     wind up with an inode specifying a 100K file which after recovery
     is actually full of zero's (or garbage) instead of the 100K of data
     that was written to it.  That is an example of a low level recovery
     operation that is unable to take into account high level dependancies.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>