From owner-freebsd-chat Thu Sep 28 18:16:41 2000 Delivered-To: freebsd-chat@freebsd.org Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (Postfix) with ESMTP id 6670837B422 for ; Thu, 28 Sep 2000 18:16:12 -0700 (PDT) Received: (from daemon@localhost) by smtp03.primenet.com (8.9.3/8.9.3) id SAA29308; Thu, 28 Sep 2000 18:14:45 -0700 (MST) Received: from usr05.primenet.com(206.165.6.205) via SMTP by smtp03.primenet.com, id smtpdAAArma4j5; Thu Sep 28 18:14:35 2000 Received: (from tlambert@localhost) by usr05.primenet.com (8.8.5/8.8.5) id SAA06192; Thu, 28 Sep 2000 18:15:58 -0700 (MST) From: Terry Lambert Message-Id: <200009290115.SAA06192@usr05.primenet.com> Subject: Re: SGI releases XFS under GPL To: jrs@enteract.com (John Sconiers) Date: Fri, 29 Sep 2000 01:15:58 +0000 (GMT) Cc: blk@skynet.be (Brad Knowles), wjv@cityip.co.za (Johann Visagie), chat@FreeBSD.ORG In-Reply-To: from "John Sconiers" at Sep 28, 2000 09:18:23 AM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > However, XFS doesn't have "softupdates", and I don't know of any > > way to apply something like "softupdates" to it. And for what we're > > doing, I'm not sure how much it matters to us to have something like > > Veritas VxFS on our machines if that meant we'd have to give up > > "softupdates". > > All-in-all, I'm just not sure if the overall net change would be > > a positive or a negative, and for whom. > > Can you please explain the difference in XFS and softupdates and why soft > updates would be more desireable than a journaling file system. I > understand what XFS is but based on your comments I get the feeling that > I have the wrong impression of what softupdates is and how it > performs. I know there are papers on the subject(s). Any one got a link? Here is a link to an abstrct that has links to the cover sheet, the paper, and "Appendix A", which is the sanitized source code from their SVR4 implemenetation: http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/ Soft updates ensures metadata consistency; that's all it is supposed to do, and that's all it does. It has the same safety as synchronous metadata mounts, but can operate within 6% of memory speed; in some cases, this turns out to be better than pure async mounts, since it tends to write gather operations which reverse themselves, such as the creation of a file followed by its deletion, such as you might see during a compile. Async would still hit the disk twice, whereas soft updates woud hit the disk zero times for the same set of operations. Soft updates is tied heavily into graph theory, and the idea that FS operations can be broken down into synchronization events. A synchronization event must be completed before the next synchronization attempt is permitted. In traditional systems, this has been guaranteed by stalling all events until the synchronization event has been completed; this is a "synchronous mount". A later system, patented by USL, uses a technique called "DOW" or "Delayed Ordered Writes". This technique only stalls the pipeline on metadata related synchronization events; that is, one metadata synchronization event must complete before the next is permitted, by asynchornous events other than that are permitted. It gains its speed increase from delaying writes between synchronization events. This gives it the ability to effectively do implicit write gathering of non-metadata writes, which can also be gathered with related metadata writes at the stall point. This method is superior to async and to simple write gathering, since it does not violate NFS or POSIX semantics on guarantees of things like timestamp updates with regard to async data writes and updating the file modification time, etc.. NB: ReiserFS uses this same technique in order to implement their logging; I personally believe that this is an infringement of the USL patent. Soft updates maintains a dependency graph of metadata events; the bebefit of doing this is to ensure you can stall a write to metadata to ensure proper ordering. But unlike the DOW technique, because the graph is fully known to the system, rather than implied by the stall, the stall will on affect dependent metadata writes. This means that if I have two sets of operations going on to, for example, create two files in two different directories, simultaneously, where DOW would stall one operation until the other has completed, soft updates will not result in a stall of the second operation. A stall will still occur on a directory that has multiple operations occurring simultaneously (or sequentially, such as a create followed by a rename) in the same directory entry block, and the directory modification timestamp update will also be serialized. But on a heavily loaded machine, each process will have what is called "locality of reference", which is just a way to say "most programs operate on independent data sets, and so don't ever conflict with each other on their operations" A common misconception about soft updates is that you can get the same failure recovery that you would get from journalling or logging. In theory, it looks like you could, since in the event of a power failure, for example, the only thing that would be out of date is the cylider group bitmaps, and the way that they are out of date is by having some blocks within the cylinder group marked as allocated, when the metadata state at the time of the crash had not been committed. With this true, you could scan the disk in the background, locking a cylinder group at a time to clean the bitmap, and unlocking it when you are done. Locality of reference means that you will probably stall some programs for a tiny amount of time, if they are intent on doing I/O within that cylinder group being fixed, but this is a far cry from waiting a long time for 36G of disk to be scanned in detail. The flaw with this theory is that a power failure is not the only type of crash you could have, and running after any crash that can corrupt any portion of the disk (e.g. most disks corrupt sectors if power is lost during a write, and in the evnt of a kernel panic, you don't know what data was corrupted in core, then erroneously written to disk before the actual panic, etc.), puts you at risk of further disk corruption and user space software failures. In the worst case, the crash was the result of a hardware failure of the disk subsystem (disk, controller, cables, terminator, etc.). So it is impossible to recover without an exterior log of the events leading up to the crash (this is how the WAFL file system from Network Appliance works: it uses an NVRAM intention log). Going further down the soft updates road, there's really no reason to assume that the UFS and FFS pieces are the only thing in the dependency tree. The shape of the dependency tree was "frozen" when soft updates was coded, but this need not have been. There is actually no good reason that the dependency tree shape should be static; indeed, the system only knows that it's traversing pointers; how they got in the arrangement they are in, the system doesn't care. Neither is the argument that the graph would take more memory than it currently takes; the shorthand structures in use in soft updates today could remain the same: they describe inter-node ordering relationships along an edge between nodes. This means that, should the approach be generalized, which would take a small amount of work, then it could work between stacking layers. At mount time, a node-node dependency resolver could be registered into the graph, at the same time the node relationships are registered, by virtue of the mount. This would let you do some marvelous things, such as seperating out the quota into a stacking layer, without losing soft updates capability (the inter-layer boundary is otherwise an implied synchronization point, which is global in scope, turning the soft updates approach into the DOW approach, for all intents and purposes). Or you could use artificial dependencies to export a transactioning interface to user space database applications. Or you could propagate dependency relationships across a network connection layer, between machines, and do FS clustering. The possibilities are really huge. I've talked with Yale and Kirk, and Greg about generalizing this before; the thing that stopped me from doing it on my own was the license on Kirk's code making it so that I might be unable to grant license to the code without Kirk also granting license. Now that that has changed, I will probably put it on my projects lists, after two or three others near the top, since I think it's important to pursue this approach, since it opens so many avenues for additional research and technological progress. In any case, since you're familiar with XFS, you should be able to see that metadata integrity is one aspect of XFS, and is one aspect of soft updates, but the technologies could in fact complement each other tremendously: they are future partners, not competitors, since in the majority of cases, what they bring to the table, other than metadata integrity guarantees, is non-overlapping. Indeed, the XFS metadata integrity could probably be sped up considerably, if only through benefit of the implicit write gathering of soft updates (something that can't happen with XFS as it is). Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message