From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:20:08 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EB06F1065673; Mon, 31 Mar 2008 22:20:08 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id CA0508FC14; Mon, 31 Mar 2008 22:20:08 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMJlOU029241; Mon, 31 Mar 2008 15:19:47 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMJlkT029240; Mon, 31 Mar 2008 15:19:47 -0700 (PDT) Date: Mon, 31 Mar 2008 15:19:47 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312219.m2VMJlkT029240@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:20:09 -0000 :True, but when you're working with a part that does ECC in HW, you're :stuck with the ECC it does. Well, I suppose if you can't get access to the original data *OR* the HW ECC code (to undo the broken correction) you would wind up with an uncorrectable double error (A failed ECC correction always makes things worse rather then better, particularly a 1 bit correct / 2 bit detect hamming code). If you DO have access to the original data or the HW ECC code you can undo the failed correction and then ignore the hardware ECC and do your own in the auxillary storage. I can see why people would want the hardware to do the data validation, it's a performance issue in many respects, but if the hardware is only doing a simple ECC and doesn't do a separate CRC it will do more harm then good and simply cannot be depended upon for anything. :... : (some stuff I reordered later one below so I can get this out of the way) :... :> For example, the simple addition of some front-end non-volatile :cache, :> such as a dime-cap-backed static ram, would have a very serious effect :> on any such filesystem design. : :Yes. However since the phone market makes such a change very unlikely, :because of cost pressures, it's not one we take into consideration. For flash storage systems competitive with hard drive storage, that is any flash storage device of significant size (e.g. 64GB-1TB or more), the incremental cost of adding non-volatile front-end cache ram is going to be in the $1-$3 range. If that vendor can then advertise a performance advantage over their competitors they can easily price the drive to match the incremental cost. Very easily. This is what happened to the HD market. The economics that drive front-end cache implementations for flash SSD are going to be the same economics that drive front-end cache implementations for hard drives. All hard drives these days have at least 8MB of sector cache (and many have 32MB or more), so the writing is on the wall. Any filesystem you design which does not take into account a front-end cache is going to be obsolete in probably less then 2 years (if not already). For the phone market? You mean small flash storage devices? Performance is almost irrelevant there and you certainly do not need a front-end cache, or much of one. A sector translation model for a small flash storage device (< 2G) is utterly trivial to implement but so is the log-append model. There is going to be a huge scaling difference between the two when you get into large amounts of storage. :Well, those of us who are shipping devices with flash parts in them have :a somewhat different view on that, which is why I've worked on three :NAND specific file systems in the last four years. Two of those are in :use in shipping devices, and are expected to be in use for five or more :years. Three in five years? Is that an illustration of my point with regards to flash filesystem design? Ok, that was a joke :-) But I don't think we can count small flash storage systems. Both models devolve into trivialities when you are managing small amounts of flash storage. :... :reference). : :You've talked yourself into pretty much the same mistake that led to :jffs2, which turned out to be a terrible idea. : :> DragonFly's HAMMER has pretty good write-locality of=20 :> reference but still does random updates for B-Tree indices and things :like=20 :> the mtime and atime fields. It also uses numerous blockmaps that :could=20 :> make direct use of a flash sector-mapping translation layer (1). It=20 :> might be adaptable. : :You are pretty much describing the data structures that have made jffs2 :such a poor performer. :... :Works OK for NOR. Has interesting problems, mainly with maintaining the :block number map reliabily in storage, when attempted on NAND. :... :The devil in the details of your naming scheme turns out to be managing :the name translation information within the NAND storage itself. This is :the source of significant performance problems in jffs2, for example, :and have a huge amount of code complexity in the commercial system I :work with. Again, I am not familiar with jffs2 but you are painting a very broad brush that is more then likely an issue specifically with the jffs2 design and not the concept of using named blocks in general. I understand where you are coming from. Regardless of the model you use you have to index the data somehow. What you are advocating is a filesystem which uses an absolute sector referencing scheme. Any change made to the filesystem requires a new page to essentially be appended to the flash storage. In order to properly index the information and maintain the filesystem topology you also have to recopy *ALL* pages containing references to the updated absolute sector in order to repoint them to the new absolute sector. The root of your filesystem winds up being the last page appended, in simple terms. While some modifications to this scheme are possible, it's pretty much the way you have to do things if you use that model. I really understand that model, and it has the advantage of simplicity but it also has some severe disadvantages when used as a general purpose filesystem (verses an embedded filesystem), not the least of which is that a single update can result in a chain reaction that requires considerably more write bandwidth, considerably more garbage collection, and some extra (but probably minor) wear of the flash. In contrast, if a filesystem is referencing named blocks and you have to move a block (either due to an error or a modification of that block through normal filesystem activity), NO changes need to be made to those elements of the filesystem that pointed to the block that got moved. All you have to do is append the new block that is renaming the old one, which includes the name (aka 64 bit quantity) in its auxillary data area, and cache the change in the translation in system memory until you decide to flush out the named block index (which I will describe a bit later on)... that's non critical information, by the way, and does not have to be synchronously in order to be crash recoverable. Write bandwidth is greatly reduced, particularly because when using a named block you only have to flush the actual modified page to the flash and nothing else other then a topological rollup record (which I will describe a bit later on). This works particularly well with a filesystem designed to use named blocks because there are *NO* indirect blockmaps to reference data or inodes in the filesystem. An absolute-sector-based filesystem has blockmaps, e.g. to locate a block in a file. In a named-block filesystem the blockmap *IS* the named block. That is, the 'name' of the named block is effectively the inode number and file block number combined into one 64 bit (or larger) key. Let me be clear about this distinction. In a filesystem that references absolute sectors an append to a file requires (typically) updating a blockmap of absolute sectors which in turns requires the blockmap block to be rewritten, along with any reference to it and so on and so forth up the chain. In a named-block filesystem appending to a file simply means writing out a new named block. The filesystem itself has NO concept of a blockmap... the blockmap is built into the sector translation layer. In other words, a filesystem using the named-block model is not any more complex then a filesystem using an absolute sector numbering model, and a filesystem using the named-block model is far easier from the perspective of caching changes in system memory without requiring a sync to flash for crash recovery. That is a huge deal. Now is there some work involved with making the named block translations efficient? Yes, there is some... but it is really not much more complex then the work involved in an absolute-sector-based filesystem which must index files, directories, and so forth within the filesystem itself. In particular, when using a named-block model you still have to occassionally flush out the translation topology to the flash media Since this topology references physical block numbers it, in fact, uses exactly the SAME mechanism that the absolute filesystem model used to maintain its topology. In other words, no more complex then the absolute filesystem model. The big difference is that the translation topology does not have to be written synchronously and the frequency of the rollup writes is based ENTIRELY on how much system memory you are willing to dedicate to caching topological changes. E.G. if you dedicate, say, 100KB of system memory you can store the topology for, say, 3200 filesystem updates (using 32 byte structures) before you have to 'flush' it to the flash. A filesystem based on absolute blocks pretty much has to cache the related (modified) blocks in memory which are far larger, and thus must flush them to storage far more frequently. But translations are tiny little records... 10's of gigabytes worth of updates can be cached in a small amount of system memory. The translation topology does NOT have to be synced to disk on fsync() because all the information can be recovered when the filesystem is mounted after a reboot. That is critical. Going back to the absolute filesystem model, such a filesystem does have the advantage of locality of reference (that is, not so much seek-wise which is irrelevant for flash but more from the point of view of being able to chain to the desired information). A filesystem using the named-block model must lookup the block, typically in a global index, which means it must maintain a cache of pointers into its translation tree. This is a little more expensive when looking up inodes but, actually, I use a very similar scheme in HAMMER (which has a global B-Tree for everything) and the caching required is so simple that it just becomes a NOP. Just storing a cached absolute sector number in the in-memory inode structure for use as a starting point when looking up elements related to that inode winds up being no less efficient then embedding a blockmap in an inode as you see in a more typical filesystem. In anycase, I hope this clarifies the issues. I really do understand where you are coming from, the simplicity of chaining the physical topology cannot be denied, and I like the elegance, but I hope I've shown that it is not actually simplifying the overall design much over a named block scheme, and that there are some fairly severe issues that can crop up that are complete non-issues when using a named block scheme. Long winded, I know. -Matt Matthew Dillon