From owner-freebsd-fs  Mon Feb 12 12:51:55 2001
Delivered-To: freebsd-fs@freebsd.org
Received: from deliverator.sgi.com (deliverator.sgi.com [204.94.214.10])
	by hub.freebsd.org (Postfix) with ESMTP id B4A2D37B491
	for <freebsd-fs@FreeBSD.ORG>; Mon, 12 Feb 2001 12:51:50 -0800 (PST)
Received: from ledzep.americas.sgi.com (ledzep.americas.sgi.com [137.38.226.97]) by deliverator.sgi.com (980309.SGI.8.8.8-aspam-6.2/980310.SGI-aspam) via ESMTP id MAA03252; Mon, 12 Feb 2001 12:50:26 -0800 (PST)
	mail_from (cattelan@thebarn.com)
Received: from gibble.americas.sgi.com (gibble.americas.sgi.com [128.162.195.80]) by ledzep.americas.sgi.com (SGI-SGI-8.9.3/americas-smart-nospam1.1) with ESMTP id OAA82176; Mon, 12 Feb 2001 14:51:27 -0600 (CST)
Received: from thebarn.com (localhost [127.0.0.1])
	by gibble.americas.sgi.com (8.11.0/8.11.0) with ESMTP id f1CKoR029400;
	Mon, 12 Feb 2001 14:50:27 -0600
Message-ID: <3A884C91.56037FFE@thebarn.com>
Date: Mon, 12 Feb 2001 14:50:26 -0600
From: Russell Cattelan <cattelan@thebarn.com>
X-Mailer: Mozilla 4.76 [en] (X11; U; Linux 2.4.1-XFS i686)
X-Accept-Language: en
MIME-Version: 1.0
To: Zhiui Zhang <zzhang@cs.binghamton.edu>
Cc: freebsd-fs@FreeBSD.ORG
Subject: Re: Design a journalled file system
References: <Pine.SOL.4.21.0102121516200.13995-100000@opal>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

Zhiui Zhang wrote:

> On Mon, 12 Feb 2001, Russell Cattelan wrote:
>
> > > Another difficulty is that if several transactions are in progress at the
> > > same time, we must remember which metadata buffers are modified by which
> > > transactions. When we copy/rename the buffer, we must inform those
> > > transactions the fact that we did the copy/rename.  The buffers modified
> > > by one transaction must be flushed at the same time.
>
> Thanks for your reply. I mean if a transaction locks down all the metadata
> (e.g., bitmap blocks) it modified until it commits, then there is no
> problem (but this reduces concurrency). Otherwise, the same metadata
> blocks can contain modifications done by more than one transaction.

This really isn't a problem... meta data buffers have to be "pinned" but not
necessarily locked. A meta data buffer can be modified many times without
having to be written out to disk, take for example the super block, this will
get flushed out to disk occasionally but since it is being modified so often
most changes never get flushed. A log of each of those changes will
be in every transaction that touched the super block, but the super
block doesn't have to be written out every time.
The primary goal is to have a consistent file system not to be able
to rollback every change that happens.

> I do
> not know how XFS solves this problem.  Since XFS uses B+ tree, I guess
> that locking can be done in a hierarchy way easily to avoid deadlock.
> But in FFS, the bitmap blocks has no relationship with each other. Locking
> the bitmap blocks in FFS in arbitrary order can cause deadlock, I guess.
>
> IBM JFS seems to use incore log implemented as page cache. XFS has
> pagebuf.  I expect that is something similar to IBM's page cache.
>
> > Hmm I'm not sure what the problem is here.
> > A transaction log entry will log all changes necessary to complete
> > that transaction, even if it involves multiple meta data objects, which is
> > almost always does.
> > In the event of a crash and  subsequent replay of the log: the recovery code
> > will make sure all the meta data on the disk is consistent with the log.
> > If one meta data write happened but the another one didn't the recovery
> > code only updates the  one that didn't complete.
> >
> > What is the size of the disk block container on bsd buf_t's ?
> > if they are 64bit we shouldn't have a problem... simply use absolution disk
> > addressing for meta data items.
> > Why would we need  to copy a meta data buf_t?
> >
>
> In sys/buf.h of FreeBSD, it has:
>
>    daddr_t b_lblkno;               /* Logical block number. */
>    daddr_t b_blkno;                /* Underlying physical block number. */
>
> Both are 32-bit integer. I am not sure why it is not 64-bit. Maybe it has

> something to do with merged buffer cache.

Ok good so we have a spot to store the absolute block number... good.
Assuming these are in units of 512 this will work up until 2TB.
Linux has the same 2TB limit problem right now...

>
> -Zhihui

--
Russell Cattelan
--
Digital Elves inc. -- Currently on loan to SGI
Linux XFS core developer.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message