Date: Sat, 25 May 2002 08:30:04 -0700 (PDT) From: Salvo Bartolotta <bartequi@neomedia.it> To: freebsd-doc@FreeBSD.org Subject: Re: docs/30008: This document should be translated, commented and added Message-ID: <200205251530.g4PFU4b99187@freefall.freebsd.org>
index | next in thread | raw e-mail
The following reply was made to PR docs/30008; it has been noted by GNATS.
From: Salvo Bartolotta <bartequi@neomedia.it>
To: freebsd-gnats-submit@FreeBSD.org, 3d@FreeBSD.org
Cc:
Subject: Re: docs/30008: This document should be translated, commented and added
Date: Sat, 25 May 2002 17:29:10 +0200 (CEST)
This message is in MIME format.
---MOQ1022340550bbe564e284cb9c4c0461b687f576a955
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Dear FreeBSD doc'ers,
I've translated the central part (i.e. part III) of the document. This draft,
which I submit for your review/comments/flames/whatever, will (hopefully) give
you the gist of Pornin's article.
Although I have benefited from a number of effective suggestions from Giorgos
(very kind and helpful, as always), neverthelss I am fully to blame for
anything wrong/queer/inconsistent. Shame on me (if any :-)
---MOQ1022340550bbe564e284cb9c4c0461b687f576a955
Content-Type: text/html; name="x47.html"; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Content-Disposition: inline; filename="x47.html"
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<title>Advanced Fault Tolerant Methods</title>
<meta name="GENERATOR" content=
"Modular DocBook HTML Stylesheet Version 1.71 ">
<link rel="HOME" title=
"Softupdates and Journaling Filesystems" href=
"index.html">
<link rel="PREVIOUS" title=
"Write Caching and reboot" href="x30.html">
<link rel="NEXT" title="Different Questions" href="x95.html">
</head>
<body class="SECT1" bgcolor="#FFFFFF" text="#000000" link=
"#0000FF" vlink="#840084" alink="#0000FF">
<div class="NAVHEADER">
<table summary="Header navigation table" width="100%" border=
"0" cellpadding="0" cellspacing="0">
<tr>
<th colspan="3" align="center">Softupdates and Journaling filesystems</th>
</tr>
<tr>
<td width="10%" align="left" valign="bottom"><a href=
"x30.html" accesskey="P">Previous</a></td>
<td width="80%" align="center" valign="bottom">
</td>
<td width="10%" align="right" valign="bottom"><a href=
"x95.html" accesskey="N">Next</a></td>
</tr>
</table>
<hr align="LEFT" width="100%">
</div>
<div class="SECT1">
<h1 class="SECT1"><a name="AEN47">3. Advanced Fault Tolerant Methods</a></h1>
<p>Let us specify, incidentally, what the mechanics can ensure:
each write of a sector (512 bytes) is atomic, i.e. once it has
been started, it is completed even though the power goes down,
the kernel crashes and the processor catches fire.</p>
<div class="SECT2">
<h2 class="SECT2"><a name="AEN50">3.1. Deferred ordered
write</a></h2>
<p>First of all, people proposed the "deferred ordered write":
metadata updates are asynchronous, but they are performed in
the [proper/correct] order. That is, the system quickly returns
execution to applications, saying "ok, everything is all right,
writes have been carried out", but it performs writes in the
background at disk speed, paying attention to order; e.g. the
creation of numerous files in the same directory actually
involves numerous updates of the same disk portion, and the
system can group them together and carry them out in one single
access. Yet "medatada updates" are ordered, that is, there
are dependencies between various updates: when a file is created
in a directory A and then another file is created in the same
directory, this second operation needs to take place at the same
time, or after the first one -- certainly not before.</p>
<p>Deferred ordered writes pose the following problem: it is
easy to create cyclic dependencies, which block the system or
else require a non-atomic update, and so a crash at the "wrong"
moment puts us in a delicate position. This is rare, but Murphy
arranges for it to happen. Typical example: I move a file from
directory A to directory B, and, almost simultaneously, I
move a file from directory B to directory A.</p>
</div>
<div class="SECT2">
<h2 class="SECT2"><a name="AEN54">3.2. Softupdates</a></h2>
<p>To pull off the coup, people developed "softupdates".
This is derived from a paper by Ganger and Patt (from the
University of Michigan). The *BSD implementation comes from
a certain Mr. McKusick (a key player in the original BSD
project). As far as I have understood, it would have been
sponsored by Sun (which is interested in its inclusion in
Solaris), and an agreement would have been made: when the code
has been debugged, it will pass to the BSD license; which is
not yet the case at present. From the moment the change in
license has taken place, FreeBSD will include the code in its
kernel by default; currently, it is necessary to recompile the
kernel in order to get softupdates. I don't know what NetBSD
and OpenBSD will do. Probably the same.</p>
<p>The principle of softupdates consists in maintaining a twofold
wait file; updates arrive in a wait buffer first, and then
they pass, one by one, to a second buffer, where dependencies
are checked. If an update completes a dependency loop, it is
sent back to the wait buffer, better times will come; the rest
of the cycle passes to a list with higher update priority.
This algorithm is similar to what CVS does in order to merge
various modifications of the same file.</p>
<p>In fact, softupdates entails this:</p>
<ul>
<li>
<p>Good filesystem performance, even in the decompression
of numerous small files. My benchmarks show that decompressing
the sources of an egcs takes 10% more time than ext2, on
the same machine and on the same portion of the disk -- your
mileage may vary, as the 'Mericans say, with your
hardware.</p>
</li>
<li>
<p>Excellent crash tolerance. I would even say that fsck
is warranted to recover all by itself, unless the crash
is due to the disk itself (in which case whatever it does
before stopping is immaterial; however, no filesystems
can tolerate that).</p>
</li>
<li>
<p>Specifically, the FFS implementation ensures upward
compatibility: the filesystem is unmounted, then it is
remounted without softupdates, and this works perfectly.
That's a painless upgrade.</p>
</li>
<li>
<p>Fsck takes a long time, since it has to traverse the entire
filesystem.</p>
</li>
<li>
<p>When a file is deleted, its place is not immediately
freed for reuse, but this can take as much as 30 seconds.
This is because the wait buffer is untidy; therefore the
system does not traverse the information contained therein
when seeking free blocks for its files; when a file is
deleted, its blocks can thus be reallocated only when
the [related] update reaches the second-level buffer.
In practice, it is not a big deal, but you might run into
trouble when you do a "make world" (recompilation and
reinstallation of the base system, on every good BSD
system: since the whole system is reinstalled in a short
time, the binaries in /bin, in particular, are deleted,
and new ones are immediately placed there again, which
produces "frictional occupation" [Cf. "frictional
unemployment": here English paralles French. N.o.T]. If
saturation point is reached, it means trouble. I myself
have run into this case, it is not fiction. This is
typical of systems with a small / partition, since it is
separate from /var, /usr, and /tmp.</p>
</li>
</ul>
<p>Let us note that in the case of fsck it is theoretically
possible to accelerate recovery significantly. Essentially,
it would be a matter of performing updates in such a way that,
in case of crash, the only inconvenience consisted in missing
blocks, that is, blocks that had not come back to the free
blocks spool yet, albeit not referenced elsewhere in the
filesystem. In this case, the filesystem could be reutilized
immediately, and fsck could be run in the background. Here is
what would fulfil point 3. This possibility has been suggested,
I do not know whether it will be carried out, but it clearly should
[The feature is already implemented in FreeBSD 5-CURRENT. N.o.T.].</p>
<p>As a whole, softupdates is a fine mechanism, elegant and
effective. Cf. <a href=
"http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/"
target=
"_top">http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/</a></p>
</div>
<div class="SECT2">
<h2 class="SECT2"><a name="AEN73">3.3. Log-structured
filesystems</a></h2>
<p>There are also "log-structured filesystems". The idea is
simple: all writes (data et metadata) are done in an
uninterrupted flow of operations. Effort is shifted onto
reading, since finding a piece of data may be rather complicated
in such a scheme. Actually, it is necessary to "garbage-collect"
the flow of operations (the log) retrospectively in order to
find the requisite information. There exist some more or less
prototypal implementations for BSD and Linux, named LFS
(cf <a href=
"http://collective.cpoint.net/prof/lfs/" target=
"_top">http://collective.cpoint.net/prof/lfs/</a> for Linux).
In a filesystem, reads are usually more frequent than writes.
This is not the case for what lives in /var/log, where LFS
can be practical. Nevertheless, the use of LFS is marginal.</p>
</div>
<div class="SECT2">
<h2 class="SECT2"><a name="AEN77">3.4.
Journaling</a></h2>
<p>Finally, there is journaling. Journaling is, as it were,
"transactional": when the system wants to make a series of
updates, it builds a new version of the related metadata in a
different place in the filesystem; then, when this new version
(called "transaction", a concept connected with databases) is
ready, it switches to the new version in one "atomic"
operation [atomic relates to "atomos", a Greek word meaning
"indivisible". Here it indicates that the system switches to
the new version (when it is ready) in one single operation,
therefore preventing any possible data corruption or "intermediate"
states. N.o.T.]. Thus the filesystem is always in a consistent
state.</p>
<p>To be more precise [warning: several technical details follow.
N.o.T.]: when metadata updates need to be performed, the new
version is built in a particular region of the disk, namely the
journal.</p>
<p>Incidentally, in ext3, the journal is a file like
any other, referenced by a special superblock field. The
final version of ext3 will automatically create the journal
if it is not present, and will not show it up in the filesystem;
which will avoid its accidental deletion.</p>
<p>The preparation of the new version entails the inclusion of
all the requisite items; in particular, if there are any circular
dependencies, the whole cycle is within. Once the new version
is ready, a commit operation is performed: the "good" sector
is modified so as to point to the new version instead of the
old one. Next, the new version is copied over the old one,
and a second commit is performed to free its place in the
journal.</p>
<p>As a side note, you could simply consider marking the journal
modified, but this would fragment it too much; since it is
always used, this is not desirable at all.</p>
<p>In case of accidental crash, recovery is necessary,
which consists in traversing the journal in order to:</p>
<ul>
<li>
<p>discard transactions not yet finished.</p>
</li>
<li>
<p>finish copying transactions for which the first commit,
but not the second, has been performed.</p>
</li>
</ul>
<p>Since the journal is typically 100 times smaller than the
filesystem, recovery is very fast (it's the difference between
30 seconds and an hour).</p>
<p>You will notice that, at the end of the process, each piece
of metadata is written twice (and read once, but memory buffers
are nevertheless useful in this instance). In the case of
metadata, that is not a serious issue, since metadata is small,
so it is the time to move the disk heads [i.e. seek latency]
that is important. Since everything works asynchronously
between two commits, the kernel optimizes this sort of things
very well. That's why a journaling filesystem is (nearly) as
fast as an FFS with softupdates (in fact, it can be shown
that softupdates remains faster so long as the system has a
good amount of memory, but the reverse applies when it swaps
heavily), the difference in speed being very small, smaller
than that between ext2 and ffs/softupdates.</p>
<p>On the other hand, ext3 in its present form (0.0.2d) is
also a journaling filesystem. In this instance, the problem
of double writes is noticeable [ext3 journalizes both data and
metadata. N.o.T.], and actually its solution means
reducing by half the time to write a file. This problem
will be solved in a later version (0.0.4 in theory -- in fact,
the code already exists, but has not been sufficiently tested
to be activated with reasonable safety). There are various
safety issues to be taken into account. In rejecting a
transaction not yet committed, problems may arise if the
blocks have already begun to fill with data from another file.
It is rather difficult to recover pieces of a priviledged file
within another. Stephen Tweedie (the developer of ext3) says
that he has thought about this, and that the necessary framework
has already been put in place.</p>
<p>There are other journaling filesystems, apart from ext3.
Linux has ReiserFS, whose latest version includes a journaling
layer handling only metadata. Reiser, its author, has been
heard ranting about a new form of super-journaling which cleans
all this up. This super-journaling will be present in the next
version of ReseirFS. Apart from those, at least another two
operating systems have had journaling filesystems in their
"production" versions for a certain time: Tru64 (the former OSF,
Digital/Compaq's Unix for Alpha) has advfs, and it works rather
well, and Windows NT has ntfs. The latter has been present for at
least five years and is really robust [fortunately, since
NT has a tendency to crash often - N.o.A.]. Furthermore, SGI
is porting its journaling filesystem (XFS) to Linux, and it is
beginning to distribute the code under GPL; IBM is also one of
the party, with its JFS (which comes from AIX).</p>
<p>Journaling makes it possible to attain points 1 to 4.
On the other hand, ext3 remains compatible with ext2: an ext3
filesystem can be unmounted and then remounted as ext2; which
works seamlessly. In my opinion, ext3 will be superior to
softupdates when pure metadata journaling has been implemented,
unless "background fsck" has been set up for softupdates [it
actually is, under FreeBSD 5.0 -CURRENT. N.o.T.]. There
might be other factors that can make a difference, though.
For example, ext2/3 is simpler and requires less CPU and code
in order to run; but ffs has a better directory structure
(binary tree instead of a linear list), which speeds up write
access to directories containing a large number of files (e.g.
a traditional news spool). Even in this instance, the OS plays
an important role, Linux having a tendency to smooth over
certain difficulties thanks to dcache [Linux's VFS layer
maintains a cache of currently active and recently used names.
This cache is referred to as the dcache. N.o.T.].</p>
<p>Journaling is also an elegant means of not losing one's
metadata. I very much love the transactional features. And,
on the other hand, I have been using ext3 for all my partitions
(except /tmp) for several months, without any problems. In
this case, too, <i class="EMPHASIS">your mileage may vary</i>.</p>
</div>
</div>
<div class="NAVFOOTER">
<hr align="LEFT" width="100%">
<table summary="Footer navigation table" width="100%" border=
"0" cellpadding="0" cellspacing="0">
<tr>
<td width="33%" align="left" valign="top"><a href=
"x30.html" accesskey="P">Previous</a></td>
<td width="34%" align="center" valign="top"><a href=
"index.html" accesskey="H">Summary</a></td>
<td width="33%" align="right" valign="top"><a href=
"x95.html" accesskey="N">Next</a></td>
</tr>
<tr>
<td width="33%" align="left" valign="top">Write Caching
and reboot</td>
<td width="34%" align="center" valign="top"> </td>
<td width="33%" align="right" valign="top">Other Questions</td>
</tr>
</table>
</div>
</body>
</html>
---MOQ1022340550bbe564e284cb9c4c0461b687f576a955--
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-doc" in the body of the message
help
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200205251530.g4PFU4b99187>
