From owner-freebsd-doc Sat May 25 8:30:27 2002 Delivered-To: freebsd-doc@hub.freebsd.org Received: from freefall.freebsd.org (freefall.FreeBSD.org [216.136.204.21]) by hub.freebsd.org (Postfix) with ESMTP id 4A98F37B401 for ; Sat, 25 May 2002 08:30:04 -0700 (PDT) Received: (from gnats@localhost) by freefall.freebsd.org (8.11.6/8.11.6) id g4PFU4b99187; Sat, 25 May 2002 08:30:04 -0700 (PDT) (envelope-from gnats) Date: Sat, 25 May 2002 08:30:04 -0700 (PDT) Message-Id: <200205251530.g4PFU4b99187@freefall.freebsd.org> To: freebsd-doc@FreeBSD.org Cc: From: Salvo Bartolotta Subject: Re: docs/30008: This document should be translated, commented and added Reply-To: Salvo Bartolotta Sender: owner-freebsd-doc@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.org The following reply was made to PR docs/30008; it has been noted by GNATS. From: Salvo Bartolotta To: freebsd-gnats-submit@FreeBSD.org, 3d@FreeBSD.org Cc: Subject: Re: docs/30008: This document should be translated, commented and added Date: Sat, 25 May 2002 17:29:10 +0200 (CEST) This message is in MIME format. ---MOQ1022340550bbe564e284cb9c4c0461b687f576a955 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Dear FreeBSD doc'ers, I've translated the central part (i.e. part III) of the document. This draft, which I submit for your review/comments/flames/whatever, will (hopefully) give you the gist of Pornin's article. Although I have benefited from a number of effective suggestions from Giorgos (very kind and helpful, as always), neverthelss I am fully to blame for anything wrong/queer/inconsistent. Shame on me (if any :-) ---MOQ1022340550bbe564e284cb9c4c0461b687f576a955 Content-Type: text/html; name="x47.html"; charset=ISO-8859-1 Content-Transfer-Encoding: 8bit Content-Disposition: inline; filename="x47.html" Advanced Fault Tolerant Methods

3. Advanced Fault Tolerant Methods

Let us specify, incidentally, what the mechanics can ensure: each write of a sector (512 bytes) is atomic, i.e. once it has been started, it is completed even though the power goes down, the kernel crashes and the processor catches fire.

3.1. Deferred ordered write

First of all, people proposed the "deferred ordered write": metadata updates are asynchronous, but they are performed in the [proper/correct] order. That is, the system quickly returns execution to applications, saying "ok, everything is all right, writes have been carried out", but it performs writes in the background at disk speed, paying attention to order; e.g. the creation of numerous files in the same directory actually involves numerous updates of the same disk portion, and the system can group them together and carry them out in one single access. Yet "medatada updates" are ordered, that is, there are dependencies between various updates: when a file is created in a directory A and then another file is created in the same directory, this second operation needs to take place at the same time, or after the first one -- certainly not before.

Deferred ordered writes pose the following problem: it is easy to create cyclic dependencies, which block the system or else require a non-atomic update, and so a crash at the "wrong" moment puts us in a delicate position. This is rare, but Murphy arranges for it to happen. Typical example: I move a file from directory A to directory B, and, almost simultaneously, I move a file from directory B to directory A.

3.2. Softupdates

To pull off the coup, people developed "softupdates". This is derived from a paper by Ganger and Patt (from the University of Michigan). The *BSD implementation comes from a certain Mr. McKusick (a key player in the original BSD project). As far as I have understood, it would have been sponsored by Sun (which is interested in its inclusion in Solaris), and an agreement would have been made: when the code has been debugged, it will pass to the BSD license; which is not yet the case at present. From the moment the change in license has taken place, FreeBSD will include the code in its kernel by default; currently, it is necessary to recompile the kernel in order to get softupdates. I don't know what NetBSD and OpenBSD will do. Probably the same.

The principle of softupdates consists in maintaining a twofold wait file; updates arrive in a wait buffer first, and then they pass, one by one, to a second buffer, where dependencies are checked. If an update completes a dependency loop, it is sent back to the wait buffer, better times will come; the rest of the cycle passes to a list with higher update priority. This algorithm is similar to what CVS does in order to merge various modifications of the same file.

In fact, softupdates entails this:

  • Good filesystem performance, even in the decompression of numerous small files. My benchmarks show that decompressing the sources of an egcs takes 10% more time than ext2, on the same machine and on the same portion of the disk -- your mileage may vary, as the 'Mericans say, with your hardware.

  • Excellent crash tolerance. I would even say that fsck is warranted to recover all by itself, unless the crash is due to the disk itself (in which case whatever it does before stopping is immaterial; however, no filesystems can tolerate that).

  • Specifically, the FFS implementation ensures upward compatibility: the filesystem is unmounted, then it is remounted without softupdates, and this works perfectly. That's a painless upgrade.

  • Fsck takes a long time, since it has to traverse the entire filesystem.

  • When a file is deleted, its place is not immediately freed for reuse, but this can take as much as 30 seconds. This is because the wait buffer is untidy; therefore the system does not traverse the information contained therein when seeking free blocks for its files; when a file is deleted, its blocks can thus be reallocated only when the [related] update reaches the second-level buffer. In practice, it is not a big deal, but you might run into trouble when you do a "make world" (recompilation and reinstallation of the base system, on every good BSD system: since the whole system is reinstalled in a short time, the binaries in /bin, in particular, are deleted, and new ones are immediately placed there again, which produces "frictional occupation" [Cf. "frictional unemployment": here English paralles French. N.o.T]. If saturation point is reached, it means trouble. I myself have run into this case, it is not fiction. This is typical of systems with a small / partition, since it is separate from /var, /usr, and /tmp.

Let us note that in the case of fsck it is theoretically possible to accelerate recovery significantly. Essentially, it would be a matter of performing updates in such a way that, in case of crash, the only inconvenience consisted in missing blocks, that is, blocks that had not come back to the free blocks spool yet, albeit not referenced elsewhere in the filesystem. In this case, the filesystem could be reutilized immediately, and fsck could be run in the background. Here is what would fulfil point 3. This possibility has been suggested, I do not know whether it will be carried out, but it clearly should [The feature is already implemented in FreeBSD 5-CURRENT. N.o.T.].

As a whole, softupdates is a fine mechanism, elegant and effective. Cf. http://www.ece.cmu.edu/~ganger/papers/CSE-TR-254-95/

3.3. Log-structured filesystems

There are also "log-structured filesystems". The idea is simple: all writes (data et metadata) are done in an uninterrupted flow of operations. Effort is shifted onto reading, since finding a piece of data may be rather complicated in such a scheme. Actually, it is necessary to "garbage-collect" the flow of operations (the log) retrospectively in order to find the requisite information. There exist some more or less prototypal implementations for BSD and Linux, named LFS (cf http://collective.cpoint.net/prof/lfs/ for Linux). In a filesystem, reads are usually more frequent than writes. This is not the case for what lives in /var/log, where LFS can be practical. Nevertheless, the use of LFS is marginal.

3.4. Journaling

Finally, there is journaling. Journaling is, as it were, "transactional": when the system wants to make a series of updates, it builds a new version of the related metadata in a different place in the filesystem; then, when this new version (called "transaction", a concept connected with databases) is ready, it switches to the new version in one "atomic" operation [atomic relates to "atomos", a Greek word meaning "indivisible". Here it indicates that the system switches to the new version (when it is ready) in one single operation, therefore preventing any possible data corruption or "intermediate" states. N.o.T.]. Thus the filesystem is always in a consistent state.

To be more precise [warning: several technical details follow. N.o.T.]: when metadata updates need to be performed, the new version is built in a particular region of the disk, namely the journal.

Incidentally, in ext3, the journal is a file like any other, referenced by a special superblock field. The final version of ext3 will automatically create the journal if it is not present, and will not show it up in the filesystem; which will avoid its accidental deletion.

The preparation of the new version entails the inclusion of all the requisite items; in particular, if there are any circular dependencies, the whole cycle is within. Once the new version is ready, a commit operation is performed: the "good" sector is modified so as to point to the new version instead of the old one. Next, the new version is copied over the old one, and a second commit is performed to free its place in the journal.

As a side note, you could simply consider marking the journal modified, but this would fragment it too much; since it is always used, this is not desirable at all.

In case of accidental crash, recovery is necessary, which consists in traversing the journal in order to:

  • discard transactions not yet finished.

  • finish copying transactions for which the first commit, but not the second, has been performed.

Since the journal is typically 100 times smaller than the filesystem, recovery is very fast (it's the difference between 30 seconds and an hour).

You will notice that, at the end of the process, each piece of metadata is written twice (and read once, but memory buffers are nevertheless useful in this instance). In the case of metadata, that is not a serious issue, since metadata is small, so it is the time to move the disk heads [i.e. seek latency] that is important. Since everything works asynchronously between two commits, the kernel optimizes this sort of things very well. That's why a journaling filesystem is (nearly) as fast as an FFS with softupdates (in fact, it can be shown that softupdates remains faster so long as the system has a good amount of memory, but the reverse applies when it swaps heavily), the difference in speed being very small, smaller than that between ext2 and ffs/softupdates.

On the other hand, ext3 in its present form (0.0.2d) is also a journaling filesystem. In this instance, the problem of double writes is noticeable [ext3 journalizes both data and metadata. N.o.T.], and actually its solution means reducing by half the time to write a file. This problem will be solved in a later version (0.0.4 in theory -- in fact, the code already exists, but has not been sufficiently tested to be activated with reasonable safety). There are various safety issues to be taken into account. In rejecting a transaction not yet committed, problems may arise if the blocks have already begun to fill with data from another file. It is rather difficult to recover pieces of a priviledged file within another. Stephen Tweedie (the developer of ext3) says that he has thought about this, and that the necessary framework has already been put in place.

There are other journaling filesystems, apart from ext3. Linux has ReiserFS, whose latest version includes a journaling layer handling only metadata. Reiser, its author, has been heard ranting about a new form of super-journaling which cleans all this up. This super-journaling will be present in the next version of ReseirFS. Apart from those, at least another two operating systems have had journaling filesystems in their "production" versions for a certain time: Tru64 (the former OSF, Digital/Compaq's Unix for Alpha) has advfs, and it works rather well, and Windows NT has ntfs. The latter has been present for at least five years and is really robust [fortunately, since NT has a tendency to crash often - N.o.A.]. Furthermore, SGI is porting its journaling filesystem (XFS) to Linux, and it is beginning to distribute the code under GPL; IBM is also one of the party, with its JFS (which comes from AIX).

Journaling makes it possible to attain points 1 to 4. On the other hand, ext3 remains compatible with ext2: an ext3 filesystem can be unmounted and then remounted as ext2; which works seamlessly. In my opinion, ext3 will be superior to softupdates when pure metadata journaling has been implemented, unless "background fsck" has been set up for softupdates [it actually is, under FreeBSD 5.0 -CURRENT. N.o.T.]. There might be other factors that can make a difference, though. For example, ext2/3 is simpler and requires less CPU and code in order to run; but ffs has a better directory structure (binary tree instead of a linear list), which speeds up write access to directories containing a large number of files (e.g. a traditional news spool). Even in this instance, the OS plays an important role, Linux having a tendency to smooth over certain difficulties thanks to dcache [Linux's VFS layer maintains a cache of currently active and recently used names. This cache is referred to as the dcache. N.o.T.].

Journaling is also an elegant means of not losing one's metadata. I very much love the transactional features. And, on the other hand, I have been using ext3 for all my partitions (except /tmp) for several months, without any problems. In this case, too, your mileage may vary.

---MOQ1022340550bbe564e284cb9c4c0461b687f576a955-- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-doc" in the body of the message