From owner-freebsd-fs  Fri Jan 17  2:31:20 2003
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 9BA0D37B401
	for <freebsd-fs@freebsd.org>; Fri, 17 Jan 2003 02:31:16 -0800 (PST)
Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188])
	by mx1.FreeBSD.org (Postfix) with ESMTP id F099E43ED8
	for <freebsd-fs@freebsd.org>; Fri, 17 Jan 2003 02:31:15 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0018.cvx40-bradley.dialup.earthlink.net ([216.244.42.18] helo=mindspring.com)
	by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)
	id 18ZTmF-0003hP-00; Fri, 17 Jan 2003 02:31:08 -0800
Message-ID: <3E27DA7F.D5DBEFB@mindspring.com>
Date: Fri, 17 Jan 2003 02:27:11 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: David Schultz <dschultz@uclink.Berkeley.EDU>
Cc: Jason Schoonover <jason_jks@yahoo.com>, freebsd-fs@FreeBSD.ORG
Subject: JFS vs. Soft Updates (again) (was: Re: large filesystem, journaling 
 filesystem support)
References: <20030114192634.75751.qmail@web13505.mail.yahoo.com> <20030117075118.GA3493@HAL9000.homeunix.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a44530405ad5ade39a7d84765512278c9ea7ce0e8f8d31aa3f350badd9bab72f9c350badd9bab72f9c
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-fs.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-fs>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-fs>
X-Loop: FreeBSD.org

This posting is in favor of a JFS.  It gives detailed technical
arguments about why some of the soft updates claims some people
are making are actually incorrect.

For the record, Kirk McKusick has stated on FreeBSD -arch that
background fsck has the problems I note, in passing, below.


> FreeBSD uses softupdates, which achieves similar efficiency and
> reliability goals to journaling. With softupdates, you don't need
> to fsck at all at boot time following a power failure or crash
> because the worst case scenario (hardware failure aside) is that
> some disk space that is really free is marked as allocated.

No, the worst case following a power failure is a screwed disk
track.

Modern disk drives read and write a track at a time; this is to
avoid rotational latency that woul happen if you waited for a
hard "sector start" marker to come around, and it avoids the need
for "low level formatting".  For a very small window of time in
the late 1990's, two manufacturers, IBM and Quantum, created disk
drives which were capable of using rotational energy as a power
source (regenerative braking) to complete a write in progress,
following a DC failure (this provided a small post-failure
hold-up time.

Modern disk drives no longer do this, because disk manufacturers
are morons (or one was a moron, and the others had to compete on
price, which amounts to the same thing).

The net result is that a DC failure can result in an entire track
getting trashed, if it happens at the right time.

So why is this important?

Soft updates optimizes for sector writing, not track writing,
while journalling can journal on the basis of track-sized
extents.

If it is written correctly (there are a number of technical
challenges to writing this correctly, and SGI, IBM, and Linux
haven't done it, but it's theoretically possible, though very
hard on IDE -- much easier on SCSI because the physical geometry
can be accessed via mode page 2).

The upshot of this is that a journalled FS can recover any
damage from a power failure, if needs be, whereas if this were
to happen on a disk protected by soft updates, you are screwed.

Journalling and soft updates are orthogonal technologies; they
do not solve the same problem space, although there is some minor
overlap.


> In FreeBSD 5.0, you can actually run fsck in the background at any
> time to reclaim this space.

In fact, this is not true.  You can only run a fsck in the
background in the case that you know that the failure mode
was a power failure, and that no data was corrupted.  This is
not something you can know for certain without CMOS.

A panic failure situation may result in corrupted disk buffers
that are flushed to the disk, prior to the panic.

A hardware failure can result in a similar failure.

And a power failure can result in a corrupt track.

In all three of these cases, a background fsck is unable to
recover the system appropriately.  Neither is it possible to
mount the FS read-only, and make it read/write on a cylinder
group basis, following a fsck, until all of the areas of the
disks have been checked, else it's possible to load and run
corrupt code that then corrupts a previously OK area of the
disk.

The only reasonable fix is a CMOS area that contains a failure
condition code.

Unfortunately, one of FreeBSD's failure modes is a spontaneous
reboot; this is because this is the normal failure mode for
PC hardware on a triple fault, which may occur as a result of
a condition that should result in a panic (corruption of kernel
memory), if the memory so corrupted is the GDT, or certain other
types of failures occur.

Thus the only safe way of dealing with this in the soft updates
case is a DC holdup circuit whose sole job is to write a "fower
fail" code into NVRAM, which can be read out by the OS.  This
means that the first thing an OS should do following succesful
recovery after read that value is to write a non-"power fail"
code into the CMOS, so that it can differentiate a power failure
from a soft failure.

PC hardware has no such assitance for OS's, despite Microsoft
and Intel attempting to accelerate the recovery process (maybe
the simply didn't think about the problem in sufficient detail
to realize hardware help is needed).

A journaling FS has the same vulnerability to corrupt kernel
buffers that were written out, but not the same vulnerability
on recovery, as it does not need to distinguish reboots due to
power failure from reboots due to other causes (because it can
be insensitive to the difference, by being insensitive to single
track failures from write in progress).

The upshot of this is that a journalling FS can recover using
an abbreviated process, with only software CMOS cause notification,
without needing special hardware additions for "power fail"
differentiation.


> That said, there is some limited
> interest in porting a journaling filesystem to FreeBSD.  Several
> people have started, but I don't know if anyone has finished.

Part of the disincentive here is that people keep saying that
Soft Updates is "just as good as journalling" or "solves the same
problem space journalling solves", etc., when it doesn't, and
the technologies are actually complementary.

People should stop claiming this, when it isn't true.

If you want to talk about the overlap, fine; but don't claim that
soft updates or bacground fsck adequately solves the loss of power
problem, unless you happen to have an IBM drive from 1997.

Personally, I would welcome a journalling FS on FreeBSD.  It
would have saved us the cost of a custom power supply that
provided DC holdup and AC fail notification.  While it was
significantly cheaper than a UPS, and we were able to make the
change because of soft updates, it would have been even cheaper
if we could have avoided the problem entirely.

The biggest problem, to my mind, that adoption of a journalling
filesystem by FreeBSD keeps hitting its head on, is that people
keep wanting to port GPL'ed JFS code to FreeBSD, not understanding
that it's impossible for a GPL'ed FS to ever be the default for
FreeBSD, because the GPL specifically prohibits use of other
licenses in statically linked code, and the boot file system must
be statically linked into the code in order to mount root, and to
load kernel modules.

If you want to write a JFS for FreeBSD: fine; but if you are going
to start with third party code, be sure that code is under the BSD
license, so that your FS can ship in a binary and usable form on
the CDROM.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message