From owner-freebsd-fs  Wed Feb  7 14:10:17 2001
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp03.primenet.com (smtp03.primenet.com [206.165.6.133])
	by hub.freebsd.org (Postfix) with ESMTP id CDBAC37B401
	for <freebsd-fs@FreeBSD.ORG>; Wed,  7 Feb 2001 14:09:57 -0800 (PST)
Received: (from daemon@localhost)
	by smtp03.primenet.com (8.9.3/8.9.3) id PAA13495;
	Wed, 7 Feb 2001 15:06:59 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp03.primenet.com, id smtpdAAAq5aisA; Wed Feb  7 15:06:49 2001
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id PAA25657;
	Wed, 7 Feb 2001 15:09:43 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200102072209.PAA25657@usr08.primenet.com>
Subject: Re: Design a journalled file system
To: zzhang@cs.binghamton.edu (Zhiui Zhang)
Date: Wed, 7 Feb 2001 22:09:43 +0000 (GMT)
Cc: freebsd-fs@FreeBSD.ORG
In-Reply-To: <Pine.SOL.4.21.0102061544230.6584-100000@opal> from "Zhiui Zhang" at Feb 06, 2001 04:15:45 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> I am considering the design of a journalled file system in FreeBSD. I
> think each transaction corresponds to a file system update operation and
> will therefore consists of a list of modified buffers.  The important
> thing is that these buffers should not be written to disk until they have
> been logged into the log area. To do so, we need to pin these buffers in
> memory for a while. The concept should be simple, but I run into a problem
> which I have no idea how to solve it:
> 
> If you access a lot of files quickly, some vnodes will be reused.  These
> vnodes can contain buffers that are still pinned in the memory because of
> the write-ahead logging constraints.  After a vnode is gone, we have
> no way to recover its buffers. Note that whenever we need a new vnode, we
> are in the process of creating a new file. At this point, we can not flush
> the buffers to the log area.  The result is a deadlock.
> 
> I could make copies of the buffers that are still pinned, but that incurs
> memory copy and need buffer headers, which is also a rare resource.
> 
> The design is similar to ext3fs of linux (they do not seem to have a vnode
> layer and they use device + physical block number instead of vnode +
> logical block number to index buffers, which, I guess, means that buffers
> can exist after the inode is gone). I know Mckusick has a paper on
> journalling FFS, but I just want to know if this design can work or not.

Soft updates provides this guarantee.  It's one approach.

If you look at the Ganger/Patt paper, it's pretty obvious that
the soloution to the graph dependency problem could be generalized.

This would let you externalize hooks into the graph, so that you
yould have dependencies span stacking layers, or so that you could
externalize a transation interface to user space, or so that you
could implement a distributed cache coherency protocol, over a
network transport, on the bottom end.


In the limit, though, it means that you should think of an FS in
terms of a set of ordered metadata and data transactions, and then
simply ensure that transactions are handled in sufficient order
("sufficient" means that FFS can lose data, but never become
inconsistant; a journalled FS would not have this luxury).

For journalling, this is a slightly tougher problem, since you
must include the idea of data consistency, not just metadata
consistency, but the problem is not insoluable.

Starting from first principles, you should look at the transactions
you intend to support.  You should probably _not_ commit to a
storage paradigm (e.g. "... similar to ext3fs of Linux ... "),
until _after_ you have mapped out the operations, and what they
imply about conflict domains (e.g. several objects in one disk
block, or one page, which is what leads to much of the complexity
of the FFS soft updates implementation).

Probably the first thing you will notice is that the VOP_ABORT
semantics are horribly broken: I noticed the same thing, when
looking at implementing a writeable NTFS for Windows 95/98/2000,
using the Heidemann framework ported from FreeBSD.

I would say that you were also constrained by POSIX guaranteed
semantics, though it would be convenient to be able to turn most
of these off, to avoid vnode/data seeks, though this is an anecdotal
conclusion from some recent literature (don't trust it until you
can conclude what the effect will be under non-single-threaded FS
load).


NB: I was unable to convince either Ganger or McKusick of the idea
of generalization, where on mount you register conflict resolvers
into a dependency graph, which you maintain as stacking is done and
undone, and VOPs are added and removed.  Both cited different
reasons for objecting.  Kirk objected to what he saw as a larger
in-core dependency accounting storage requirement.  IMO, Kirk's
reasons were not really correct, since any given dependency could
be expressed and resolved using the same structures.  I was unable
to provide a proof of concept due to license issues, which I very
well understand Kirk wanting to enforce at the time.  Gregory had
different objections, which I laid off to familiarity with graph
theory (you _can_ maintain a running accounting of transitive
colsure over a graph, particularly one that doesn't change except
on mount or unmount), but I wouldn't dismiss either of them on
the basis of their gut feelings (I trust mine, but they trust
theirs, which is right for them to do).

That aside, even if you don't do a generalized implementation, the
approach of considering an FS in terms of transactions (events) is
still sound, and I think most modern FS researchers would agree with
the approach, even if they did not agree on implementation.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message