From owner-freebsd-arch@FreeBSD.ORG  Tue Apr  1 05:30:00 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 80DA91065674;
	Tue,  1 Apr 2008 05:30:00 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from mx.danger.com (wall.danger.com [216.220.212.140])
	by mx1.freebsd.org (Postfix) with ESMTP id 669398FC2D;
	Tue,  1 Apr 2008 05:30:00 +0000 (UTC)
	(envelope-from mfouts@danger.com)
Received: from danger.com (exchange3.danger.com [10.0.1.7])
	by mx.danger.com (Postfix) with ESMTP id F11E7414D25;
	Mon, 31 Mar 2008 22:27:41 -0700 (PDT)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Date: Mon, 31 Mar 2008 22:27:50 -0700
Message-ID: <B95CEC1093787C4DB3655EF330984818051D0F@EXCHANGE.danger.com>
In-Reply-To: <200803312219.m2VMJlkT029240@apollo.backplane.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Flash disks and FFS layout heuristics
Thread-Index: AciTfnsz5JkHfGFbRHmgUX7Qxg+vbgANl4/w
References: <20080330231544.A96475@localhost>
	<200803310135.m2V1ZpiN018354@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D03@EXCHANGE.danger.com>
	<200803312125.29325.qpadla@gmail.com>
	<200803311915.m2VJFSoR027593@apollo.backplane.com>
	<B95CEC1093787C4DB3655EF330984818051D09@EXCHANGE.danger.com>
	<200803312219.m2VMJlkT029240@apollo.backplane.com>
From: "Martin Fouts" <mfouts@danger.com>
To: "Matthew Dillon" <dillon@apollo.backplane.com>
Cc: Christopher Arnold <chris@arnold.se>, arch@freebsd.org, qpadla@gmail.com,
	freebsd-arch@freebsd.org
Subject: RE: Flash disks and FFS layout heuristics
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 01 Apr 2008 05:30:00 -0000

> -----Original Message-----
> From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20
> Sent: Monday, March 31, 2008 3:20 PM
>=20
> For flash storage systems competitive with hard drive storage,=20

In embedded systems, it's RAM that flash storage competes with, not hard

drive storage.

SSD is a completely different engineering problem.

> For the phone market?  You mean small flash storage=20
> devices?  Performance is almost irrelevant there

Actually, we're very performance sensitive in this area, and getting
more so as audio and video demands grow.

> Three in five years?  Is that an illustration of my point=20
> with regards to flash filesystem design?  Ok, that was a joke :-)
>=20

It's illustrative of my changing career. Three different filel sytems
for three different products. ;)

> But I don't think we can count small flash storage systems.  Both
models
> devolve into trivialities when you are managing small amounts of
> flash storage.

I don't know who your "we" is, but *my* "we" counts small flash storage
systems as rather critical.

And the 'trivialities' aren't so trivial when you have to maintain
reliability in the face of easily removable batteries.

> Again, I am not familiar with jffs2 but you are painting=20
> a very broad brush that is more then likely an issue specifically
> with the jffs2 design and not the concept of using named blocks in
> general.

That's the assumption that led from jffs1 to jffs2. It's an incorrect
assumption.

> What you are advocating is a filesystem which uses an=20
> absolute sector referencing scheme.

I haven't actually advocated anything. Merely pointed out problems.  But
no, the scheme that we're currently using doesn't use the sort of
absolute sector referencing scheme you're suggesting below.

> Any change made to the filesystem requires a new
> page to essentially be appended to the flash storage.  In order to
> properly index the information and maintain the=20
> filesystem topology  you also have to recopy *ALL* pages containing=20
> references to the updated absolute sector in order to repoint them=20
> to the new absolute sector.

Sorry, no.  Doesn't work like that at all. This is, after all, computer
science, and indirection is your friend.

> I really understand that model, and it has the advantage=20

I'm sure you do. It's not the one we're using though.

> I really do understand where you are coming from, the=20
> simplicity of chaining the physical topology cannot be denied,
> and I like the elegance, but I hope I've
> shown that it is not actually simplifying the overall design much
> over a named block scheme, and that there are some fairly severe
> issues that can crop up that are complete non-issues when=20
> using a named block scheme.

All you've really shown is that the difference between theory and
practice, as usual, remains larger in practice than in theory.

You have made it painfully clear that you are immersed in large scale
file systems, an area I left behind a decade ago when I abandoned my
work on CUE at HP Labs. It is a fascinating and difficult area, and I
heartily approve of experimentation in it. It also has almost no
engineering tradeoffs in common with persistent storage for battery
powered devices.

In summary, then: NAND devices are critical to CE products, especially
so-called convergent devices, in which there is no hard disk and
persistent storage takes the form of an embedded NAND device and zero or
more removable NAND devices.  Power issues are critical and performance
is becoming more so as the devices become more complex. Reliability of
the file systems on these devices is also critical.  The usual technique
of disk optimization performance (throw more ram at in in order to
cache) is unavailable, the usual hardware need for optimization (seek
and rotational latency) are not present, and the peculiarities of NAND,
most notably the size of the erase unit compared to the size of the
write unit, the existence of the spare area, and the much higher bit
error rates than either disk or ram experience, coupled with those
requirements lead to a need for NAND-specific file systems on such
devices.

Experience has shown that brute force approaches based on flash
translation layers work, but are inefficient and overly complex.
Attempts to use generalized NOR file systems in NAND tend to have
significant performance problems because of the cost maintaining the
embedded data structures, such as b-trees, that replaced the more
straightfoward data structures of earlier more linear file system
designs.

Experience has also shown that the file system needs to expose
transaction semantics to the application, and that leaving bad block
handling to a translation layer (even a block naming scheme) leads to
performance problems consequent to garbage collection, which is
inevitable in devices that have such large erase units.