From owner-freebsd-fs  Wed Feb  7 15:41: 0 2001
Delivered-To: freebsd-fs@freebsd.org
Received: from bingnet2.cc.binghamton.edu (bingnet2.cc.binghamton.edu [128.226.1.18])
	by hub.freebsd.org (Postfix) with ESMTP id D787B37B503
	for <freebsd-fs@FreeBSD.ORG>; Wed,  7 Feb 2001 15:40:40 -0800 (PST)
Received: from onyx (onyx.cs.binghamton.edu [128.226.140.171])
	by bingnet2.cc.binghamton.edu (8.11.2/8.11.2) with ESMTP id f17NeWI21997;
	Wed, 7 Feb 2001 18:40:32 -0500 (EST)
Date: Wed, 7 Feb 2001 18:40:21 -0500 (EST)
From: Zhiui Zhang <zzhang@cs.binghamton.edu>
X-Sender: zzhang@onyx
To: Terry Lambert <tlambert@primenet.com>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: Design a journalled file system
In-Reply-To: <200102072209.PAA25657@usr08.primenet.com>
Message-ID: <Pine.SOL.4.21.0102071833210.3918-100000@onyx>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


Thanks for your email! Even if I think I have a fairly good understanding
of the FFS code (not soft-update) by actually studying/modifying the code,
I still have a long way to go to understand the bigger picture which you
have described.

-Zhihui

On Wed, 7 Feb 2001, Terry Lambert wrote:

> > I am considering the design of a journalled file system in FreeBSD. I
> > think each transaction corresponds to a file system update operation and
> > will therefore consists of a list of modified buffers.  The important
> > thing is that these buffers should not be written to disk until they have
> > been logged into the log area. To do so, we need to pin these buffers in
> > memory for a while. The concept should be simple, but I run into a problem
> > which I have no idea how to solve it:
> > 
> > If you access a lot of files quickly, some vnodes will be reused.  These
> > vnodes can contain buffers that are still pinned in the memory because of
> > the write-ahead logging constraints.  After a vnode is gone, we have
> > no way to recover its buffers. Note that whenever we need a new vnode, we
> > are in the process of creating a new file. At this point, we can not flush
> > the buffers to the log area.  The result is a deadlock.
> > 
> > I could make copies of the buffers that are still pinned, but that incurs
> > memory copy and need buffer headers, which is also a rare resource.
> > 
> > The design is similar to ext3fs of linux (they do not seem to have a vnode
> > layer and they use device + physical block number instead of vnode +
> > logical block number to index buffers, which, I guess, means that buffers
> > can exist after the inode is gone). I know Mckusick has a paper on
> > journalling FFS, but I just want to know if this design can work or not.
> 
> Soft updates provides this guarantee.  It's one approach.
> 
> If you look at the Ganger/Patt paper, it's pretty obvious that
> the soloution to the graph dependency problem could be generalized.
> 
> This would let you externalize hooks into the graph, so that you
> yould have dependencies span stacking layers, or so that you could
> externalize a transation interface to user space, or so that you
> could implement a distributed cache coherency protocol, over a
> network transport, on the bottom end.
> 
> 
> In the limit, though, it means that you should think of an FS in
> terms of a set of ordered metadata and data transactions, and then
> simply ensure that transactions are handled in sufficient order
> ("sufficient" means that FFS can lose data, but never become
> inconsistant; a journalled FS would not have this luxury).
> 
> For journalling, this is a slightly tougher problem, since you
> must include the idea of data consistency, not just metadata
> consistency, but the problem is not insoluable.
> 
> Starting from first principles, you should look at the transactions
> you intend to support.  You should probably _not_ commit to a
> storage paradigm (e.g. "... similar to ext3fs of Linux ... "),
> until _after_ you have mapped out the operations, and what they
> imply about conflict domains (e.g. several objects in one disk
> block, or one page, which is what leads to much of the complexity
> of the FFS soft updates implementation).
> 
> Probably the first thing you will notice is that the VOP_ABORT
> semantics are horribly broken: I noticed the same thing, when
> looking at implementing a writeable NTFS for Windows 95/98/2000,
> using the Heidemann framework ported from FreeBSD.
> 
> I would say that you were also constrained by POSIX guaranteed
> semantics, though it would be convenient to be able to turn most
> of these off, to avoid vnode/data seeks, though this is an anecdotal
> conclusion from some recent literature (don't trust it until you
> can conclude what the effect will be under non-single-threaded FS
> load).
> 
> 
> NB: I was unable to convince either Ganger or McKusick of the idea
> of generalization, where on mount you register conflict resolvers
> into a dependency graph, which you maintain as stacking is done and
> undone, and VOPs are added and removed.  Both cited different
> reasons for objecting.  Kirk objected to what he saw as a larger
> in-core dependency accounting storage requirement.  IMO, Kirk's
> reasons were not really correct, since any given dependency could
> be expressed and resolved using the same structures.  I was unable
> to provide a proof of concept due to license issues, which I very
> well understand Kirk wanting to enforce at the time.  Gregory had
> different objections, which I laid off to familiarity with graph
> theory (you _can_ maintain a running accounting of transitive
> colsure over a graph, particularly one that doesn't change except
> on mount or unmount), but I wouldn't dismiss either of them on
> the basis of their gut feelings (I trust mine, but they trust
> theirs, which is right for them to do).
> 
> That aside, even if you don't do a generalized implementation, the
> approach of considering an FS in terms of transactions (events) is
> still sound, and I think most modern FS researchers would agree with
> the approach, even if they did not agree on implementation.
> 
> 
> 					Terry Lambert
> 					terry@lambert.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message