From owner-freebsd-arch@FreeBSD.ORG Mon Mar 31 22:54:38 2008 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DA727106568B; Mon, 31 Mar 2008 22:54:37 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id BCBD58FC1A; Mon, 31 Mar 2008 22:54:37 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m2VMsQqB029550; Mon, 31 Mar 2008 15:54:26 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m2VMsPqZ029549; Mon, 31 Mar 2008 15:54:25 -0700 (PDT) Date: Mon, 31 Mar 2008 15:54:25 -0700 (PDT) From: Matthew Dillon Message-Id: <200803312254.m2VMsPqZ029549@apollo.backplane.com> To: "Martin Fouts" References: <20080330231544.A96475@localhost> <200803310135.m2V1ZpiN018354@apollo.backplane.com> <200803312125.29325.qpadla@gmail.com> <200803311915.m2VJFSoR027593@apollo.backplane.com> <200803312006.m2VK6Aom028133@apollo.backplane.com> Cc: Christopher Arnold , arch@freebsd.org, qpadla@gmail.com, freebsd-arch@freebsd.org Subject: RE: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2008 22:54:38 -0000 :>=20 :> No matter what you do you have to index the information=20 :> *SOMEWHERE*. : :And NAND devices have a *SOMEWHERE* that makes them different than other :persistent storage devices in ways that make them interesting to do file :systems for. : :It's not _that_ you have to scan the NAND, by the way, it's _when_ you :scan the NAND that has the major impact on performance. I know where you are coming from there. The flash filesystem I did for our telemetry product (20 years ago, which is still in operation today) uses a named-block translation scheme but simply builds the topology out in main memory when the filesystem is mounted. These are small flash devices, two 1 MBytes NOR chips if I remember right. It just scans the translation table which is just a linear array and bulids the topology in ram, which takes maybe a few milliseconds to do on boot and after that, zero cost. Of course, that was for a small flash device so I could get away with it. And it was NOR so the translation table was all in one place and could be trivially scanned and updated. I have a similar issue in HAMMER. HAMMER is designed as a multi-terrabyte filesystem. HAMMER isn't a flash filesystem but it effectively uses a naming mechanic to locate inodes and data, so the problem is similar. I was really worried about this mechanic as compared to, say, UFS, where the absolute location of the on-disk inode can be directly calculated from the inode number. HAMMER has to look the inode number up in the global B-Tree. Even though it's a 15-way B-Tree (meaning it is fairly shallow and good locality of reference in the buffer cache), I was really worried about performance so I implemented a B-Tree pointer cache in the in-memory inode structure. So, e.g. if you lookup a filename in a directory the directory inode cached a pointer into the B-Tree 'near' the directory inode element, and another for the most recent inode number it looked up. These cached pointers then served as a heuristical starting point for the B-Tree lookup to locate the file in the directory and the inode number. Well, to shorten the description... the overhead of having to do the lookup turned out to not matter at all with the cache in place. Even better, since an inode's data blocks (and other information) is also indexed in the global B-Tree, the same cache also served for accesses into the file or directory itself. Whatever overhead might have been added from having to lookup the inode was completely covered by the savings of not having to run through a multi-level blockmap like FFS does. In many cases a B-Tree search is so well localized that it doesn't even have to leave the node. (p.s. when I talk about localization here, I'm talking about in-memory disk cache, not seek localization). In anycase, this is why I just don't worry about named-block translations. If one had a filesystem-layer blockmap AND named-block translations it could be pretty nasty due to the updating requirements. But if the filesystem revolves entirely around named-block translations and did not implement any blockmaps of its own, the only thing that happens is that some overheads that used to be in one part of the filesystem are now in another part instead, resulting in a net zero. HAMMER actually does implement a couple of blockmaps in addition to its global B-Tree, but in the case of HAMMER the blockmap is mapping *huge* physical translations... 8MB per block. They aren't named blocks like the B-Tree, but instead a virtualized address space designed to localize records, B-Tree nodes, large data blocks, and small data blocks. It's a different sort of blockmap then what one typically hears described for a filesystem... really more of an allocation space. I do this for several implementation reasons most specifically because HAMMER is designed for a hard disk and seek locality of reference is important, but also so information can be relocated in 8MB chunks to be able to add and remove physical storage. If I were reimplementing HAMMER as a flash filesystem (which I am NOT doing), I would probably do away with the blockmap layer entirely since seek locality of reference is not needed for a flash filesystem, and the global B-Tree would serve directly as the named-block topology. Kinda cool to think about. -Matt Matthew Dillon