From owner-freebsd-fs Wed Feb 7 15:41: 0 2001 Delivered-To: freebsd-fs@freebsd.org Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18]) by hub.freebsd.org (Postfix) with ESMTP id D787B37B503 for ; Wed, 7 Feb 2001 15:40:40 -0800 (PST) Received: from onyx (onyx.cs.binghamton.edu [128.226.140.171]) by bingnet2.cc.binghamton.edu (8.11.2/8.11.2) with ESMTP id f17NeWI21997; Wed, 7 Feb 2001 18:40:32 -0500 (EST) Date: Wed, 7 Feb 2001 18:40:21 -0500 (EST) From: Zhiui Zhang X-Sender: zzhang@onyx To: Terry Lambert Cc: freebsd-fs@FreeBSD.ORG Subject: Re: Design a journalled file system In-Reply-To: <200102072209.PAA25657@usr08.primenet.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org Thanks for your email! Even if I think I have a fairly good understanding of the FFS code (not soft-update) by actually studying/modifying the code, I still have a long way to go to understand the bigger picture which you have described. -Zhihui On Wed, 7 Feb 2001, Terry Lambert wrote: > > I am considering the design of a journalled file system in FreeBSD. I > > think each transaction corresponds to a file system update operation and > > will therefore consists of a list of modified buffers. The important > > thing is that these buffers should not be written to disk until they have > > been logged into the log area. To do so, we need to pin these buffers in > > memory for a while. The concept should be simple, but I run into a problem > > which I have no idea how to solve it: > > > > If you access a lot of files quickly, some vnodes will be reused. These > > vnodes can contain buffers that are still pinned in the memory because of > > the write-ahead logging constraints. After a vnode is gone, we have > > no way to recover its buffers. Note that whenever we need a new vnode, we > > are in the process of creating a new file. At this point, we can not flush > > the buffers to the log area. The result is a deadlock. > > > > I could make copies of the buffers that are still pinned, but that incurs > > memory copy and need buffer headers, which is also a rare resource. > > > > The design is similar to ext3fs of linux (they do not seem to have a vnode > > layer and they use device + physical block number instead of vnode + > > logical block number to index buffers, which, I guess, means that buffers > > can exist after the inode is gone). I know Mckusick has a paper on > > journalling FFS, but I just want to know if this design can work or not. > > Soft updates provides this guarantee. It's one approach. > > If you look at the Ganger/Patt paper, it's pretty obvious that > the soloution to the graph dependency problem could be generalized. > > This would let you externalize hooks into the graph, so that you > yould have dependencies span stacking layers, or so that you could > externalize a transation interface to user space, or so that you > could implement a distributed cache coherency protocol, over a > network transport, on the bottom end. > > > In the limit, though, it means that you should think of an FS in > terms of a set of ordered metadata and data transactions, and then > simply ensure that transactions are handled in sufficient order > ("sufficient" means that FFS can lose data, but never become > inconsistant; a journalled FS would not have this luxury). > > For journalling, this is a slightly tougher problem, since you > must include the idea of data consistency, not just metadata > consistency, but the problem is not insoluable. > > Starting from first principles, you should look at the transactions > you intend to support. You should probably _not_ commit to a > storage paradigm (e.g. "... similar to ext3fs of Linux ... "), > until _after_ you have mapped out the operations, and what they > imply about conflict domains (e.g. several objects in one disk > block, or one page, which is what leads to much of the complexity > of the FFS soft updates implementation). > > Probably the first thing you will notice is that the VOP_ABORT > semantics are horribly broken: I noticed the same thing, when > looking at implementing a writeable NTFS for Windows 95/98/2000, > using the Heidemann framework ported from FreeBSD. > > I would say that you were also constrained by POSIX guaranteed > semantics, though it would be convenient to be able to turn most > of these off, to avoid vnode/data seeks, though this is an anecdotal > conclusion from some recent literature (don't trust it until you > can conclude what the effect will be under non-single-threaded FS > load). > > > NB: I was unable to convince either Ganger or McKusick of the idea > of generalization, where on mount you register conflict resolvers > into a dependency graph, which you maintain as stacking is done and > undone, and VOPs are added and removed. Both cited different > reasons for objecting. Kirk objected to what he saw as a larger > in-core dependency accounting storage requirement. IMO, Kirk's > reasons were not really correct, since any given dependency could > be expressed and resolved using the same structures. I was unable > to provide a proof of concept due to license issues, which I very > well understand Kirk wanting to enforce at the time. Gregory had > different objections, which I laid off to familiarity with graph > theory (you _can_ maintain a running accounting of transitive > colsure over a graph, particularly one that doesn't change except > on mount or unmount), but I wouldn't dismiss either of them on > the basis of their gut feelings (I trust mine, but they trust > theirs, which is right for them to do). > > That aside, even if you don't do a generalized implementation, the > approach of considering an FS in terms of transactions (events) is > still sound, and I think most modern FS researchers would agree with > the approach, even if they did not agree on implementation. > > > Terry Lambert > terry@lambert.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message