From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 05:30:00 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 80DA91065674; Tue, 1 Apr 2008 05:30:00 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from mx.danger.com (wall.danger.com [216.220.212.140]) by mx1.freebsd.org (Postfix) with ESMTP id 669398FC2D; Tue, 1 Apr 2008 05:30:00 +0000 (UTC) (envelope-from mfouts@danger.com) Received: from danger.com (exchange3.danger.com [10.0.1.7]) by mx.danger.com (Postfix) with ESMTP id F11E7414D25; Mon, 31 Mar 2008 22:27:41 -0700 (PDT) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Date: Mon, 31 Mar 2008 22:27:50 -0700 Message-ID: In-Reply-To: <200803312219.m2VMJlkT029240@apollo.backplane.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Flash disks and FFS layout heuristics Thread-Index: AciTfnsz5JkHfGFbRHmgUX7Qxg+vbgANl4/w References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312219.m2VMJlkT029240@apollo.backplane.com> From: "Martin Fouts" To: "Matthew Dillon" Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 05:30:00 -0000 > -----Original Message----- > From: Matthew Dillon [mailto:dillon@apollo.backplane.com]=20 > Sent: Monday, March 31, 2008 3:20 PM >=20 > For flash storage systems competitive with hard drive storage,=20 In embedded systems, it's RAM that flash storage competes with, not hard drive storage. SSD is a completely different engineering problem. > For the phone market? You mean small flash storage=20 > devices? Performance is almost irrelevant there Actually, we're very performance sensitive in this area, and getting more so as audio and video demands grow. > Three in five years? Is that an illustration of my point=20 > with regards to flash filesystem design? Ok, that was a joke :-) >=20 It's illustrative of my changing career. Three different filel sytems for three different products. ;) > But I don't think we can count small flash storage systems. Both models > devolve into trivialities when you are managing small amounts of > flash storage. I don't know who your "we" is, but *my* "we" counts small flash storage systems as rather critical. And the 'trivialities' aren't so trivial when you have to maintain reliability in the face of easily removable batteries. > Again, I am not familiar with jffs2 but you are painting=20 > a very broad brush that is more then likely an issue specifically > with the jffs2 design and not the concept of using named blocks in > general. That's the assumption that led from jffs1 to jffs2. It's an incorrect assumption. > What you are advocating is a filesystem which uses an=20 > absolute sector referencing scheme. I haven't actually advocated anything. Merely pointed out problems. But no, the scheme that we're currently using doesn't use the sort of absolute sector referencing scheme you're suggesting below. > Any change made to the filesystem requires a new > page to essentially be appended to the flash storage. In order to > properly index the information and maintain the=20 > filesystem topology you also have to recopy *ALL* pages containing=20 > references to the updated absolute sector in order to repoint them=20 > to the new absolute sector. Sorry, no. Doesn't work like that at all. This is, after all, computer science, and indirection is your friend. > I really understand that model, and it has the advantage=20 I'm sure you do. It's not the one we're using though. > I really do understand where you are coming from, the=20 > simplicity of chaining the physical topology cannot be denied, > and I like the elegance, but I hope I've > shown that it is not actually simplifying the overall design much > over a named block scheme, and that there are some fairly severe > issues that can crop up that are complete non-issues when=20 > using a named block scheme. All you've really shown is that the difference between theory and practice, as usual, remains larger in practice than in theory. You have made it painfully clear that you are immersed in large scale file systems, an area I left behind a decade ago when I abandoned my work on CUE at HP Labs. It is a fascinating and difficult area, and I heartily approve of experimentation in it. It also has almost no engineering tradeoffs in common with persistent storage for battery powered devices. In summary, then: NAND devices are critical to CE products, especially so-called convergent devices, in which there is no hard disk and persistent storage takes the form of an embedded NAND device and zero or more removable NAND devices. Power issues are critical and performance is becoming more so as the devices become more complex. Reliability of the file systems on these devices is also critical. The usual technique of disk optimization performance (throw more ram at in in order to cache) is unavailable, the usual hardware need for optimization (seek and rotational latency) are not present, and the peculiarities of NAND, most notably the size of the erase unit compared to the size of the write unit, the existence of the spare area, and the much higher bit error rates than either disk or ram experience, coupled with those requirements lead to a need for NAND-specific file systems on such devices. Experience has shown that brute force approaches based on flash translation layers work, but are inefficient and overly complex. Attempts to use generalized NOR file systems in NAND tend to have significant performance problems because of the cost maintaining the embedded data structures, such as b-trees, that replaced the more straightfoward data structures of earlier more linear file system designs. Experience has also shown that the file system needs to expose transaction semantics to the application, and that leaving bad block handling to a translation layer (even a block naming scheme) leads to performance problems consequent to garbage collection, which is inevitable in devices that have such large erase units.