From owner-freebsd-fs  Thu Sep 21 14: 9: 9 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp05.primenet.com (smtp05.primenet.com [206.165.6.135])
	by hub.freebsd.org (Postfix) with ESMTP id 6DC9537B43E
	for <freebsd-fs@FreeBSD.ORG>; Thu, 21 Sep 2000 14:09:02 -0700 (PDT)
Received: (from daemon@localhost)
	by smtp05.primenet.com (8.9.3/8.9.3) id OAA01454;
	Thu, 21 Sep 2000 14:09:18 -0700 (MST)
Received: from usr08.primenet.com(206.165.6.208)
 via SMTP by smtp05.primenet.com, id smtpdAAABAaWWc; Thu Sep 21 14:09:11 2000
Received: (from tlambert@localhost)
	by usr08.primenet.com (8.8.5/8.8.5) id OAA16992;
	Thu, 21 Sep 2000 14:08:52 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200009212108.OAA16992@usr08.primenet.com>
Subject: Crash recovery: SU vs. LFS vs. JFS
To: mbendiks@eunet.no (Marius Bendiksen)
Date: Thu, 21 Sep 2000 21:08:52 +0000 (GMT)
Cc: tuinstra@clarkson.edu (Dwight Tuinstra),
	freebsd-fs@FreeBSD.ORG (freebsd-fs)
In-Reply-To: <Pine.BSF.4.05.10009211808360.39384-100000@login-1.eunet.no> from "Marius Bendiksen" at Sep 21, 2000 06:22:28 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> > As a long-term graduate research project, I've been looking into 
> > the code for LFS (Log-structured File System) on NetBSD.  Such a
> > system is optimized for many small writes, and given the amounts
> > of RAM available for read caches nowadays, should deliver read
> > performance comparable (or not much worse) than FFS.  Additionally,
> > LFS should provide better and faster crash recovery than either FFS
> > or journaling file systems.
> 
> Research (IIRC, Seltzer and Matthews) has shown that FFS outperforms
> LFS when the FFS clustering code has been activated. The crash recovery
> times can supposedly be alleviated by soft updates, I've not looked at
> that yet. Journalling crash recovery vs LFS crash recovery is more complex
> than a mere comparison of speed, as these can be tuned in both cases.

The crash recovery of soft updates can be sped up considerably,
in theory.

In practice, you can't tell if the reason for the crash was an FS
fault, or whether it was some other fault.

Further, for most drives, if you had a DC failure to the drive in
the middle of an actual write, you can have single sector format
corruption (my personal opinion is that if this is possible, so is
multiple sector corruption, if timed just right).

This means that you need NVRAM for at least sector logging, and at
most, track logging, in order to ensure replay based de-corruption
on in progress writes.

Further, consider that the drive may have a track cache that will
result in out of range (for the OS) being written, and the corruption
occuring there.

The _only_ software approach for soft updates is really soft readonly,
and this only works if the system is quiescent at the time of the
crash (soft readonly in force, and the FS marked clean).

This ignores the possibility of a software failure (e.g. the failure
is assumed to be hardware or power).  In the software failure case,
there is no telling what data was corrupted, or whether some of it
was written to disk prior to the corruption hitting something which
the system noticed sufficiently to actually fail.

Thus, soft updates are not a good strategy for fast failure
recovery.

---

LFS is probably fastest, but does not support implied relationships
between data.  For example, if I have a data file and an index file,
and I two-stage commit it by writing new data and then a new index,
if the failure occurs between these operations, I potentially lose
my transaction.  If this were a bank transaction, I would be really
hurting in the wallet.  If this were a different transaction, it may
have less financial risk, but the correctness risk is the same.

NB:	The database above obviously records the new record to a
	different record entry, so that the old one is still a
	valid record in case of failure; the operation is thus
	non-atomic, but idempotent.

The way LFS normally recovers is to go back to oldes time stamped
and valid marked log as "the correct state of the FS", and discard
and partial logs (and with them, implied metadata).

---

JFS is slower.  JFS recovers nearly the same as LFS, but must look
at outstanding transactions, and actually back them out if they
are incomplete, or roll them forward, if it can.  The difference
between these is whether the journal contains only a journal of
events that have transpired, or a journal of both events that have
transpired, and those intended to transpire, but not yet commited
to the disk.  This means that a JFS will keep implied relationships
between data intact, so long as it is signalled before and after a
transaction involving an implied relationship takes place.

So long as the transaction completion is not signalled to the
client application until after the transaction has been commited
(this requires another hook to user space), then we have no problem
with bank transactions.

---

Someone said that LFS logs data and metadata, but JFS logs only
metadata, and that's the difference.  Obviously, this is untrue;
a JFS logs transactions.  Whether these transactions include only
metadata, or include both metadata and data, is really a JFS
implementation detail, not an attribute of JFS' themselves.


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message