From owner-freebsd-fs Wed Feb 7 14:10:17 2001 Delivered-To: freebsd-fs@freebsd.org Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133]) by hub.freebsd.org (Postfix) with ESMTP id CDBAC37B401 for ; Wed, 7 Feb 2001 14:09:57 -0800 (PST) Received: (from daemon@localhost) by smtp03.primenet.com (8.9.3/8.9.3) id PAA13495; Wed, 7 Feb 2001 15:06:59 -0700 (MST) Received: from usr08.primenet.com(206.165.6.208) via SMTP by smtp03.primenet.com, id smtpdAAAq5aisA; Wed Feb 7 15:06:49 2001 Received: (from tlambert@localhost) by usr08.primenet.com (8.8.5/8.8.5) id PAA25657; Wed, 7 Feb 2001 15:09:43 -0700 (MST) From: Terry Lambert Message-Id: <200102072209.PAA25657@usr08.primenet.com> Subject: Re: Design a journalled file system To: zzhang@cs.binghamton.edu (Zhiui Zhang) Date: Wed, 7 Feb 2001 22:09:43 +0000 (GMT) Cc: freebsd-fs@FreeBSD.ORG In-Reply-To: from "Zhiui Zhang" at Feb 06, 2001 04:15:45 PM X-Mailer: ELM [version 2.5 PL2] MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > I am considering the design of a journalled file system in FreeBSD. I > think each transaction corresponds to a file system update operation and > will therefore consists of a list of modified buffers. The important > thing is that these buffers should not be written to disk until they have > been logged into the log area. To do so, we need to pin these buffers in > memory for a while. The concept should be simple, but I run into a problem > which I have no idea how to solve it: > > If you access a lot of files quickly, some vnodes will be reused. These > vnodes can contain buffers that are still pinned in the memory because of > the write-ahead logging constraints. After a vnode is gone, we have > no way to recover its buffers. Note that whenever we need a new vnode, we > are in the process of creating a new file. At this point, we can not flush > the buffers to the log area. The result is a deadlock. > > I could make copies of the buffers that are still pinned, but that incurs > memory copy and need buffer headers, which is also a rare resource. > > The design is similar to ext3fs of linux (they do not seem to have a vnode > layer and they use device + physical block number instead of vnode + > logical block number to index buffers, which, I guess, means that buffers > can exist after the inode is gone). I know Mckusick has a paper on > journalling FFS, but I just want to know if this design can work or not. Soft updates provides this guarantee. It's one approach. If you look at the Ganger/Patt paper, it's pretty obvious that the soloution to the graph dependency problem could be generalized. This would let you externalize hooks into the graph, so that you yould have dependencies span stacking layers, or so that you could externalize a transation interface to user space, or so that you could implement a distributed cache coherency protocol, over a network transport, on the bottom end. In the limit, though, it means that you should think of an FS in terms of a set of ordered metadata and data transactions, and then simply ensure that transactions are handled in sufficient order ("sufficient" means that FFS can lose data, but never become inconsistant; a journalled FS would not have this luxury). For journalling, this is a slightly tougher problem, since you must include the idea of data consistency, not just metadata consistency, but the problem is not insoluable. Starting from first principles, you should look at the transactions you intend to support. You should probably _not_ commit to a storage paradigm (e.g. "... similar to ext3fs of Linux ... "), until _after_ you have mapped out the operations, and what they imply about conflict domains (e.g. several objects in one disk block, or one page, which is what leads to much of the complexity of the FFS soft updates implementation). Probably the first thing you will notice is that the VOP_ABORT semantics are horribly broken: I noticed the same thing, when looking at implementing a writeable NTFS for Windows 95/98/2000, using the Heidemann framework ported from FreeBSD. I would say that you were also constrained by POSIX guaranteed semantics, though it would be convenient to be able to turn most of these off, to avoid vnode/data seeks, though this is an anecdotal conclusion from some recent literature (don't trust it until you can conclude what the effect will be under non-single-threaded FS load). NB: I was unable to convince either Ganger or McKusick of the idea of generalization, where on mount you register conflict resolvers into a dependency graph, which you maintain as stacking is done and undone, and VOPs are added and removed. Both cited different reasons for objecting. Kirk objected to what he saw as a larger in-core dependency accounting storage requirement. IMO, Kirk's reasons were not really correct, since any given dependency could be expressed and resolved using the same structures. I was unable to provide a proof of concept due to license issues, which I very well understand Kirk wanting to enforce at the time. Gregory had different objections, which I laid off to familiarity with graph theory (you _can_ maintain a running accounting of transitive colsure over a graph, particularly one that doesn't change except on mount or unmount), but I wouldn't dismiss either of them on the basis of their gut feelings (I trust mine, but they trust theirs, which is right for them to do). That aside, even if you don't do a generalized implementation, the approach of considering an FS in terms of transactions (events) is still sound, and I think most modern FS researchers would agree with the approach, even if they did not agree on implementation. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message