From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:20:08 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EB06F1065673;
	Mon, 31 Mar 2008 22:20:08 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id CA0508FC14;
	Mon, 31 Mar 2008 22:20:08 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMJlOU029241;
	Mon, 31 Mar 2008 15:19:47 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMJlkT029240;
	Mon, 31 Mar 2008 15:19:47 -0700 (PDT)
Date: Mon, 31 Mar 2008 15:19:47 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312219.m2VMJlkT029240@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:20:09 -0000

:True, but when you're working with a part that does ECC in HW, you're
:stuck with the ECC it does.
    
    Well, I suppose if you can't get access to the original data *OR* the
    HW ECC code (to undo the broken correction) you would wind up with an
    uncorrectable double error (A failed ECC correction always makes things
    worse rather then better, particularly a 1 bit correct / 2 bit detect
    hamming code).  If you DO have access to the original data or the
    HW ECC code you can undo the failed correction and then ignore the
    hardware ECC and do your own in the auxillary storage.

    I can see why people would want the hardware to do the data validation,
    it's a performance issue in many respects, but if the hardware is only
    doing a simple ECC and doesn't do a separate CRC it will do more harm
    then good and simply cannot be depended upon for anything.

:...
: (some stuff I reordered later one below so I can get this out of the way)
:...
:>     For example, the simple addition of some front-end non-volatile
:cache,
:> such as a dime-cap-backed static ram, would have a very serious effect
:> on any such filesystem design.
:
:Yes.  However since the phone market makes such a change very unlikely,
:because of cost pressures, it's not one we take into consideration.

    For flash storage systems competitive with hard drive storage, 
    that is any flash storage device of significant size (e.g. 64GB-1TB or
    more), the incremental cost of adding non-volatile front-end cache ram
    is going to be in the $1-$3 range.  If that vendor can then advertise
    a performance advantage over their competitors they can easily price
    the drive to match the incremental cost.  Very easily.  This is what
    happened to the HD market.

    The economics that drive front-end cache implementations for flash SSD
    are going to be the same economics that drive front-end cache
    implementations for hard drives.  All hard drives these days have at
    least 8MB of sector cache (and many have 32MB or more), so the writing
    is on the wall.  Any filesystem you design which does not take into
    account a front-end cache is going to be obsolete in probably less then
    2 years (if not already).

    For the phone market?  You mean small flash storage devices?  Performance
    is almost irrelevant there and you certainly do not need a front-end
    cache, or much of one.  A sector translation model for a small flash
    storage device (< 2G) is utterly trivial to implement but so is the
    log-append model.  There is going to be a huge scaling difference
    between the two when you get into large amounts of storage.

:Well, those of us who are shipping devices with flash parts in them have
:a somewhat different view on that, which is why I've worked on three
:NAND specific file systems in the last four years. Two of those are in
:use in shipping devices, and are expected to be in use for five or more
:years.

    Three in five years?  Is that an illustration of my point with regards
    to flash filesystem design?  Ok, that was a joke :-)

    But I don't think we can count small flash storage systems.  Both models
    devolve into trivialities when you are managing small amounts of
    flash storage.

:...
:reference).
:
:You've talked yourself into pretty much the same mistake that led to
:jffs2, which turned out to be a terrible idea.
:
:>     DragonFly's HAMMER has pretty good write-locality of=20
:> reference but still does random updates for B-Tree indices and things
:like=20
:> the mtime and atime fields.  It also uses numerous blockmaps that
:could=20
:> make direct use of a flash sector-mapping translation layer (1).  It=20
:> might be adaptable.
:
:You are pretty much describing the data structures that have made jffs2
:such a poor performer.
:...
:Works OK for NOR. Has interesting problems, mainly with maintaining the
:block number map reliabily in storage, when attempted on NAND.
:...
:The devil in the details of your naming scheme turns out to be managing
:the name translation information within the NAND storage itself. This is
:the source of significant performance problems in jffs2, for example,
:and have a huge amount of code complexity in the commercial system I
:work with.

    Again, I am not familiar with jffs2 but you are painting a very broad
    brush that is more then likely an issue specifically with the jffs2
    design and not the concept of using named blocks in general.

    I understand where you are coming from.  Regardless of the model you
    use you have to index the data somehow.

    What you are advocating is a filesystem which uses an absolute sector
    referencing scheme.  Any change made to the filesystem requires a new
    page to essentially be appended to the flash storage.  In order to
    properly index the information and maintain the filesystem topology
    you also have to recopy *ALL* pages containing references to the
    updated absolute sector in order to repoint them to the new
    absolute sector.  The root of your filesystem winds up being the last
    page appended, in simple terms.

    While some modifications to this scheme are possible, it's pretty much
    the way you have to do things if you use that model.

    I really understand that model, and it has the advantage of simplicity
    but it also has some severe disadvantages when used as a general
    purpose filesystem (verses an embedded filesystem), not the least of
    which is that a single update can result in a chain reaction that
    requires considerably more write bandwidth, considerably more garbage
    collection, and some extra (but probably minor) wear of the flash.

    In contrast, if a filesystem is referencing named blocks and you have
    to move a block (either due to an error or a modification of that
    block through normal filesystem activity), NO changes need to be made
    to those elements of the filesystem that pointed to the block that got
    moved.  All you have to do is append the new block that is renaming the
    old one, which includes the name (aka 64 bit quantity) in its auxillary
    data area, and cache the change in the translation in system memory
    until you decide to flush out the named block index (which I will
    describe a bit later on)... that's non critical information, by the
    way, and does not have to be synchronously in order to be crash
    recoverable.  Write bandwidth is greatly reduced, particularly
    because when using a named block you only have to flush the actual
    modified page to the flash and nothing else other then a topological
    rollup record (which I will describe a bit later on).

    This works particularly well with a filesystem designed to use named
    blocks because there are *NO* indirect blockmaps to reference data or
    inodes in the filesystem.  An absolute-sector-based filesystem has
    blockmaps, e.g. to locate a block in a file.  In a named-block
    filesystem the blockmap *IS* the named block.  That is, the 'name' of
    the named block is effectively the inode number and file block number
    combined into one 64 bit (or larger) key.

    Let me be clear about this distinction.  In a filesystem that references
    absolute sectors an append to a file requires (typically) updating a
    blockmap of absolute sectors which in turns requires the blockmap block
    to be rewritten, along with any reference to it and so on and so forth
    up the chain.   In a named-block filesystem appending to a file simply
    means writing out a new named block.  The filesystem itself has NO
    concept of a blockmap... the blockmap is built into the sector translation
    layer.

    In other words, a filesystem using the named-block model is not any
    more complex then a filesystem using an absolute sector numbering model,
    and a filesystem using the named-block model is far easier from the
    perspective of caching changes in system memory without requiring a
    sync to flash for crash recovery.  That is a huge deal.

    Now is there some work involved with making the named block translations
    efficient?  Yes, there is some... but it is really not much more complex
    then the work involved in an absolute-sector-based filesystem which
    must index files, directories, and so forth within the filesystem
    itself.

    In particular, when using a named-block model you still have to
    occassionally flush out the translation topology to the flash media
    Since this topology references physical block numbers it, in fact, uses
    exactly the SAME mechanism that the absolute filesystem model used
    to maintain its topology.  In other words, no more complex then the
    absolute filesystem model.  

    The big difference is that the translation topology does not have
    to be written synchronously and the frequency of the rollup writes
    is based ENTIRELY on how much system memory you are willing to 
    dedicate to caching topological changes.  E.G. if you dedicate, say, 100KB
    of system memory you can store the topology for, say, 3200 filesystem
    updates (using 32 byte structures) before you have to 'flush' it to
    the flash.  A filesystem based on absolute blocks pretty much has to
    cache the related (modified) blocks in memory which are far larger, and
    thus must flush them to storage far more frequently.  But translations
    are tiny little records...  10's of gigabytes worth of updates can be
    cached in a small amount of system memory.  The translation topology
    does NOT have to be synced to disk on fsync() because all the
    information can be recovered when the filesystem is mounted after
    a reboot.  That is critical.

    Going back to the absolute filesystem model, such a filesystem does have
    the advantage of locality of reference (that is, not so much seek-wise
    which is irrelevant for flash but more from the point of view of being
    able to chain to the desired information).  A filesystem using the
    named-block model must lookup the block, typically in a global index,
    which means it must maintain a cache of pointers into its translation
    tree.  This is a little more expensive when looking up inodes but,
    actually, I use a very similar scheme in HAMMER (which has a global
    B-Tree for everything) and the caching required is so simple that it just
    becomes a NOP.  Just storing a cached absolute sector number in the
    in-memory inode structure for use as a starting point when looking up
    elements related to that inode winds up being no less efficient then
    embedding a blockmap in an inode as you see in a more typical
    filesystem.


    In anycase, I hope this clarifies the issues.  I really do understand
    where you are coming from, the simplicity of chaining the physical
    topology cannot be denied, and I like the elegance, but I hope I've
    shown that it is not actually simplifying the overall design much
    over a named block scheme, and that there are some fairly severe
    issues that can crop up that are complete non-issues when using a named
    block scheme.

    Long winded, I know.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>