From owner-freebsd-arch@FreeBSD.ORG Sun Mar 30 20:16:56 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D0B9E106564A for ; Sun, 30 Mar 2008 20:16:56 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id A0FC58FC25 for ; Sun, 30 Mar 2008 20:16:56 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2UKGuqg015128; Sun, 30 Mar 2008 13:16:56 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2UKGuZA015127; Sun, 30 Mar 2008 13:16:56 -0700 (PDT) Date: Sun, 30 Mar 2008 13:16:56 -0700 (PDT) From: Matthew Dillon Message-Id: <200803302016.m2UKGuZA015127@apollo.backplane.com> To: Kirk McKusick , arch@freebsd.org References: <200803292353.m2TNrCOW094875@chez.mckusick.com> Cc: Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 30 Mar 2008 20:16:57 -0000 :You should try running your experiment using ZFS. Because it is a :non-overwriting filesystem, it might work better with flash. : : Kirk McKusick I'm assuming ZFS still has to update indices and indirect blocks, though, which is the primary source for random updates in all filesystems. The right way to deal with flash is *NOT* to require that the filesystem be smart about flash storage, but instead to implement an intermediate storage layer which linearizes the writes to flash and removes all random erases from the critical path. This also causes erasures to be evenly spread out on the flash unit and *GREATLY* extends the life of the flash device (to the point where you can just treat it as a disk and not have to worry about wearing out cells). I wrote precisely that 20 years ago for the flash filesystem I built for our telemetry RTUs. Of course, 20 years ago flash devices were much smaller, only 1-4MB per chip. But the concept is sound and with proper design can be implemented for much larger devices. Basically the general idea is as follows: Break the flash into three pieces: Two sector translation tables and one bulk storage area. Whenever a modification is made that involves transitioning bits from 0->1 (1->0 doesn't need an erase cycle) instead of erasing the flash sector all you do is allocate a new flash sector, append an entry to the translation table, and write the data out to the new flash sector. The logical block is now renumbered. You cache (some or all of) the translation table in-memory for fast access. * Appends to the translation table only involve 1->0 transitions. You don't even have to zero-out the old translation but can use it for crash recovery purposes. Thus no erasures are needed until the table becomes full. * Any non-trivial overwrites append a new sector, again involving only 1->0 transitions and requiring no erasures. * When the translation table becomes full you repack it into the second translation table (which then becomes the primary table), and erase the previous table. You ping-pong the tables (that's why there are two). * Bulk space can be allocated linearly until the flash becomes full, then erased/repacked (you also switch to the alternate translation table when doing the repacking of the bulk space). This can be a little tricky but as long as you leave one erase-sector's worth of space available you can always repack the flash without any possibility of losing data. This latter operation is the most expensive but once some space is freed up it is possible to pack simultaniously with running new ops, or to repack continuously as a background operation when space is tight, as long as you don't get twisted up with a full translation table. The only 'hard' bit about this design is you need to come up with a translation table topology that works for large flash devices. My flash filesystem of long ago just used a linear array and cached the whole thing with a hash table in memory, so it didn't require a sophisticated topology on-flash. But for a large flash device you probably need something a bit more sophisticated that still does not involve erase cycles in the critical path. The critical point, however, is that the on-flash translation table does NOT need to be optimal because you can mirror or cache elements of it in-memory. In anycase, that's really the only acceptable way to do a flash filesystem and still be able to guarantee proper wear characteristics for the flash cells. -Matt Matthew Dillon