From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 31 22:54:38 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DA727106568B;
	Mon, 31 Mar 2008 22:54:37 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id BCBD58FC1A;
	Mon, 31 Mar 2008 22:54:37 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMsQqB029550;
	Mon, 31 Mar 2008 15:54:26 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMsPqZ029549;
	Mon, 31 Mar 2008 15:54:25 -0700 (PDT)
Date: Mon, 31 Mar 2008 15:54:25 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200803312254.m2VMsPqZ029549@apollo.backplane.com>
To: "Martin Fouts" <mfouts@danger.com>
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312006.m2VK6Aom028133@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D0A@EXCHANGE.danger.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 31 Mar 2008 22:54:38 -0000

:>=20
:>     No matter what you do you have to index the information=20
:> *SOMEWHERE*.
:
:And NAND devices have a *SOMEWHERE* that makes them different than other
:persistent storage devices in ways that make them interesting to do file
:systems for.
:
:It's not _that_ you have to scan the NAND, by the way, it's _when_ you
:scan the NAND that has the major impact on performance.

   I know where you are coming from there.  The flash filesystem I did for
   our telemetry product (20 years ago, which is still in operation today)
   uses a named-block translation scheme but simply builds the topology out
   in main memory when the filesystem is mounted.  These are small flash
   devices, two 1 MBytes NOR chips if I remember right.  It just scans
   the translation table which is just a linear array and bulids the
   topology in ram, which takes maybe a few milliseconds to do on boot
   and after that, zero cost.

   Of course, that was for a small flash device so I could get away with it.
   And it was NOR so the translation table was all in one place and could
   be trivially scanned and updated.

   I have a similar issue in HAMMER.  HAMMER is designed as a multi-terrabyte
   filesystem.  HAMMER isn't a flash filesystem but it effectively uses a
   naming mechanic to locate inodes and data, so the problem is similar.
   I was really worried about this mechanic as compared to, say, UFS, where
   the absolute location of the on-disk inode can be directly calculated
   from the inode number.

   HAMMER has to look the inode number up in the global B-Tree.  Even though
   it's a 15-way B-Tree (meaning it is fairly shallow and good locality of
   reference in the buffer cache), I was really worried about performance
   so I implemented a B-Tree pointer cache in the in-memory inode
   structure.  So, e.g. if you lookup a filename in a directory the directory
   inode cached a pointer into the B-Tree 'near' the directory inode element,
   and another for the most recent inode number it looked up.  These cached
   pointers then served as a heuristical starting point for the B-Tree
   lookup to locate the file in the directory and the inode number.

   Well, to shorten the description... the overhead of having to do the
   lookup turned out to not matter at all with the cache in place.  Even
   better, since an inode's data blocks (and other information) is also
   indexed in the global B-Tree, the same cache also served for accesses
   into the file or directory itself.  Whatever overhead might have been
   added from having to lookup the inode was completely covered by the
   savings of not having to run through a multi-level blockmap like FFS
   does.  In many cases a B-Tree search is so well localized that it doesn't
   even have to leave the node.

   (p.s. when I talk about localization here, I'm talking about in-memory
   disk cache, not seek localization).

   In anycase, this is why I just don't worry about named-block translations.
   If one had a filesystem-layer blockmap AND named-block translations it
   could be pretty nasty due to the updating requirements.  But if the
   filesystem revolves entirely around named-block translations and did
   not implement any blockmaps of its own, the only thing that happens is
   that some overheads that used to be in one part of the filesystem are
   now in another part instead, resulting in a net zero.

   HAMMER actually does implement a couple of blockmaps in addition to
   its global B-Tree, but in the case of HAMMER the blockmap is mapping
   *huge* physical translations... 8MB per block.  They aren't named blocks
   like the B-Tree, but instead a virtualized address space designed to
   localize records, B-Tree nodes, large data blocks, and small data blocks.
   It's a different sort of blockmap then what one typically hears described
   for a filesystem... really more of an allocation space.  I do this for
   several implementation reasons most specifically because HAMMER is
   designed for a hard disk and seek locality of reference is important,
   but also so information can be relocated in 8MB chunks to be able to
   add and remove physical storage.  If I were reimplementing HAMMER as
   a flash filesystem (which I am NOT doing), I would probably do away
   with the blockmap layer entirely since seek locality of reference is
   not needed for a flash filesystem, and the global B-Tree would serve
   directly as the named-block topology.  Kinda cool to think about.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>