From owner-freebsd-arch@FreeBSD.ORG Tue Apr 1 18:07:58 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B7E111065671 for ; Tue, 1 Apr 2008 18:07:58 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.freebsd.org (Postfix) with ESMTP id 994DC8FC1A for ; Tue, 1 Apr 2008 18:07:58 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.14.1/8.14.1) with ESMTP id m31I7iFF039980; Tue, 1 Apr 2008 11:07:44 -0700 (PDT) Received: (from dillon@localhost) by apollo.backplane.com (8.14.1/8.13.4/Submit) id m31I7g8I039974; Tue, 1 Apr 2008 11:07:42 -0700 (PDT) Date: Tue, 1 Apr 2008 11:07:42 -0700 (PDT) From: Matthew Dillon Message-Id: <200804011807.m31I7g8I039974@apollo.backplane.com> To: Bakul Shah References: <20080401011306.2A4875B41@mail.bitblocks.com> Cc: Christopher Arnold , Martin Fouts , Alfred Perlstein , qpadla@gmail.com, arch@freebsd.org, Poul-Henning Kamp Subject: Re: Flash disks and FFS layout heuristics X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 01 Apr 2008 18:07:58 -0000 :My instinct is to not combine transactions. That is, every :data write results in a sequence: {data, [indirect blocks], :inode, ..., root block}. Until the root block is written to :the disk this is not a "commited" transaction and can be :thrown away. In a Log-FS we always append on write; we never :overwrite any data/metadata so this is easy and the FS state :remains consistent. FFS overwrites blocks so all this gets :far more complicated. Sort of like the difference between :reasoning about functional programs & imperative programs! : :Now, it may be possible to define certain rules that allows :one to combine transactions. For instance, : : write1(block n), write2(block n) == write2(block n) : write(block n of file f1), delete file f1 == delete file f1 : :etc. That is, as long as write1 & associated metadata writes :are not flushed to the disk, and a later write (write2) comes :along, the earlier write (write1) can be thrown away. [But I :have no idea if this is worth doing or even doable!] This is a somewhat different problem, one that is actually fairly easy to solve in larger systems because operating systems tend to want to cache everything. So really what is going on is that your operations (until you fsync()) are being cached in system memory and are not immediately committed to the underlying storage. Because of that, overwrites and deletions can simply destroy the related cache entities in system memory and never touch the disk. Ultimately you have to flush something to disk, and that is where the transactional atomicy and ordering issues start popping up. :This is reminiscent of the bottom up rewrite system (BURS) :used in some code generators (such as lcc's). The idea is the :same here: replace a sequence of operations with an :equivalent but lower cost sequence. What it comes down to is how expensive do you want your fsync() to be? You can always commit everything down to the root block and your recovery code can always throw everything away until it finds a good root block, and avoid the whole issue, but if you do things that way then fsync() becomes an extremely expensive call to make. Certain applications, primarily database applications, really depend on having an efficient fsync(). Brute force is always simpler, but not necessarily always desireable. :... :> is in the algorith, but not the (fairly small) amount of time it takes :> to actually perform the recovery operation. : :I don't understand the complexity. Basically your log should :allow you to define a functional programming abstraction -- :where you never overwrite any data/metadata for any active :transactions and so reasoning becomes easier. [But may be we :should take any hammer discussion offline] The complexity is there because a filesystem is actually a multi-layer entity. One has a storage topology which must be indexed in some manner, but one also has the implementation on top of that storage topology which has its own consistency requirements. For example, UFS stores inodes in specific places and has bitmaps for allocating data blocks and blockmaps to access data from its inodes. But UFS also has to maintain the link count for a file, the st_size field in the inode, the directory entry in the directory, and so forth. Certain operations require multiple filesystem entities to be adjusted as one atomic operation. For example removing a file requires the link count in the inode to be decremented and the entry in the directory to be removed. Undo logs are very good at describing the low level entity, allowing you to undo changes in time order, but undo logs need additional logic to recognize groups of transactions which must be recovered or thrown away as a single atomic entity, or which depend on each other. One reason why it's a big issue is that portions of those transactions can be committed to disk out of order. The recovery code has to recognize that dependant pieces are not present even if other micro-transactions have been fully committed. Taking UFS as an example: UFS's fsck can clean up link counts and directory entries, but has no concept of lost file data so you can wind up with an inode specifying a 100K file which after recovery is actually full of zero's (or garbage) instead of the 100K of data that was written to it. That is an example of a low level recovery operation that is unable to take into account high level dependancies. -Matt Matthew Dillon