From owner-freebsd-fs Sun Feb 1 13:37:01 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id NAA10531 for freebsd-fs-outgoing; Sun, 1 Feb 1998 13:37:01 -0800 (PST) (envelope-from owner-freebsd-fs@FreeBSD.ORG) Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA10514; Sun, 1 Feb 1998 13:36:58 -0800 (PST) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp03.primenet.com (8.8.8/8.8.8) id OAA25034; Sun, 1 Feb 1998 14:27:06 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp03.primenet.com, id smtpd025020; Sun Feb 1 14:27:03 1998 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id OAA29406; Sun, 1 Feb 1998 14:26:58 -0700 (MST) From: Terry Lambert Message-Id: <199802012126.OAA29406@usr09.primenet.com> Subject: Re: Filesystem hacking To: mikk0022@maroon.tc.umn.edu Date: Sun, 1 Feb 1998 21:26:58 +0000 (GMT) Cc: tlambert@primenet.com, Matthew.Alton@anheuser-busch.com, fs@FreeBSD.ORG, hackers@FreeBSD.ORG In-Reply-To: <199802011945.NAA18654@x115-105.reshalls.umn.edu> from "mikk0022@maroon.tc.umn.edu" at Feb 1, 98 01:45:58 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org X-To-Unsubscribe: mail to majordomo@FreeBSD.org "unsubscribe freebsd-fs" > OK, let's see if I get this -- a log-structured filesystem stores > file data and metadata only in its log. A journaling filesystem > stores data, metadata, *and operations* in the log? A log structure filesystem logs writes. The youngets write claiming to contain some data wins. The data and metadata are written seperately, but tagged as being together for write that span more than one logging region. Otherwise, they are written together. The "valid" stamp is the last thing written to the log. On reboot, the logs are scanned to "find out where everything is". Oldest date wins, so long as it has a "valid" stamp. For multilog transactions, the "valid stamp is delayed for two or more writes. Logged data does not map 1:1 to user data; for example, you may log several otherwise unrelated writes in the same log. Thus if you fail before a valid stamp, the transactions are rolled back. It is the only thing you can do. A journallying filesystem journals what it is going to do, and that it has done it. If the power goes out after it has stated a full intention, and it hasn't written that it has done it, all full intentions can be rolled forward, not just back. Because the transaction relationship is held in the journal, not in the FS data or metadata, a journalling FS can expose a transactioning interface to user space to allow a program to group intentions. > >A Journalling FS also allows you to expose a transactioning interface > >to allow you to group transactions. > > Is there a portable way to do this? Yes. But it requires a graphical soloution to the soft updates problem; most soloutions to the soft updates problem are hand optimized for how they resolve node-node conflicts, and hand-coded to problem they attempt to solve. Kirk McKusicks port of the Ganger/Patt code is an example of this. To expose a transactioning mechanism, you must allow the specification of soft dependencies *between* stacking layers. Effectively, this means you run a warshal's algorithm on the event/operation nodes that make up the graph of relationships Example 1: I have an event called "delete file". To delete the file, I must perform serveral operations: 1) Delete the blocks hung off the inode 2) Zero the inode 3) Remove the directory entry referring to the inode These operations are "nodes" in a graph, and the meta-operation (or event) is "delete a file"; it constitutes an edge of three nodes. This means that there are two dependency vectors in the graph, and that they have a relationship defined by the connecting node. I'll spare you the details of the dependency resoloution algorithm; you can read Kirk McKusick's code from OpenBSD (not for commercial use without license), or you can read the Ganger/Patt paper. Now, to expose a transactioning system, you need a general algorithm for defining dependencies. Effectively, when you stack your NULLFS layer with the new VOP (or ioctl) "TRANSACT", which takes the arguments "begin", "commit", and "abort". What this means is that you must expose the internals of your graph so that you can register your own "dependency resolvers". Example 2: I have a transaction that requires the deletion of two files. To do this, I must perform the same three operations above, but I must do it twice, and I must abort both operations if I can't complete one. I must also be able to roll-back/roll-forward the operations if the system fails and comes back up. How do I do this? The transaction layer is not really NULL. When you begin a transaction, it uses a namespace escaped file to journal all subsequent VOP's, and to make subsequent requests *appear* to get the new data. It can do this several ways, but the most probable one is to use renames and copies, because renames are atomic. This will allow it to support the transactions on any underlying FS. Alternately, it can journal the transaction by hooking into the dependency queueing mechanism to ensure that the transaction is committed in vector-terminal order. Example 3: A transaction consisting of two events, one containing two nodes, and one containing three: a log write followed by a delete. The transaction system creates an imaginary edge between the terminal node of the log write (the update of the inode size) and the terminal node of the delete (the update of the directory entry block). This edge is a dynamic soft dependency; it is created on the fly, and it will go away after the event (transaction) it represents has completed. It's pretty obvious that the dependency resolvers need to consider the action nodes seperately, and not as an "edge" unit -- in other words, not like the Ganger/Patt code -- for this to be able to work. This is because the transaction layer can't resolve dependencies between edges, only between nodes. This damages some optimizations, mostly having to do with compresion of dependency enumerations (McKusick uses bitmaps for this; technically, you could still do this, but on a node, not an edge, granularity. I *believe* that the losses would be minimal, but I don't have a proof-of-concept implementation of both [yet] to be able to benchmark the monolitic vs. the nodal implementations against each other). So yes, it's possible to do generically, if you think of FS's in terms of events, operations, and dependencies between operations, and if you have soft update technology, and if you implement it correctly. Probably, it's easier at this point to just write a JFS and not worry about making everything work the way it should. Going the other way would probably be worthy of a PhD to somone looking for a Doctoral project. ;-). [ ...automatic aggregation of volumes... ] > >CCD can do this as well. > > How? I thought CCD built the logical volumes from /etc/ccd.conf > on bootup. Thus, moving around disks would require editing > /etc/ccd.conf on bootup. XLV does this all automatically. > > Or has this been added in -current ? No, but it is a trivial hack to add it. Julian's "slice" and "devfs" code provide "volume arrival" events. These events do not propagate as far as I want them to (an arriving device with no claimers would be handed to the VFS for mount in my model), but you could easily put CCD in as a slice type handler. When a device "arrives", you would read the first block and see if it has a "CCDLABEL"; if it does, CCD swallows it. The label contains: 1) A magic number ("CCDLABEL" is as good as any) that means "I am a member of a volume set". 2) A volume set identifier; probably just a unique-across-volumes 32 bit identifier. 3) A count of devices in the volume. 4) An ID indicating which of the devices in #3 the current device represents. When CCD get N of N devices, it exports "/dev/ccd0". It would probably be a good idea to have a "no more devices" event propagated to CCD as well; this would let it recognize a failed RAID-5 subassembly (for example), but it's not absolutely necessary. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.