FreeBSD Mail Archives

Date:      Fri, 27 Sep 2002 11:31:08 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Alexander Leidinger <Alexander@Leidinger.net>
Cc:        freebsd-current@FreeBSD.ORG
Subject:   Re: Journaled filesystem in CURRENT
Message-ID:  <3D94A3EC.160D5D71@mindspring.com>
References:  <200209251319.g8PDJYoD047918@ib.com.ua> <20020925111232.B3686@Odin.AC.HMC.Edu> <20020926111949.5c0da160.Alexander@Leidinger.net> <20020926090325.A24614@zardoc.esmtp.org> <3D93459B.E4405568@mindspring.com> <20020926210947.5d5fdd45.Alexander@Leidinger.net> <3D9362C1.CFA66F90@mindspring.com> <20020927114503.7c839b9b.Alexander@Leidinger.net>

index | next in thread | previous in thread | raw e-mail

Alexander Leidinger wrote:
> > > Sorry, I don't get it. Can you please be more verbose?
> >
> > This has been discussed to death before, and Kirk McKusick has
> > already posted the definitive post on the topic to FreeBSD-FS.
> 
> Keywords (besides SO and Kirk McKusick)/timeframe/message ID/URL?

"McKusick AND fsck" finds the following in the FreeBSD-arch
archives:

<http://www.FreeBSD.org/cgi/getmsg.cgi?fetch=278083+282035+/usr/local/www/db/text/2001/freebsd-arch/20010401.freebsd-arch>;

> > Note that recent disk drives (I *will not* call them "modern")
> > will potentially trash sectors, if a power failure occurs during
> > writes.
> 
> They don't have a power reservoir large enough to write the entire
> content of their cache to disk? Damn. But I shouldn't wonder, the actual
> economy is the result of letting marketing people make decissions.

No, they do not.  At one time Qunatum manufactured a 7200 RPM drive
that could do a write/seek/write, using the rotational energy of
th disk, but they quit manufactuing these.  The main reason was that
"multimedia" disk drives valued the ability to store quickly over
the ability to store correctly (e.g. no thermal recalibration, etc.).

> > One way to handle Scott Dodson's problem (for example) is to add
> > a "softcheck started" flag in the superblock, so that if a crash
> > occurs durin the abbreviated check, then the full check is done
> 
> I asked Kirk a while ago what happens if we have a power failure while
> we do a bg-fsck. He told me that this isn't harmful, the actual code
> DTRT.

I'm aware of this posting as well.

The issue here is is the answer to the question "What happens
when I fail in such a way as to need a full fsck?".

If your automated default is to do a backgroundfsck, your
kernel will potentially panic as a result of you running on
the FS with only a background fsck in progress (the panic
which could occur would occur because of normal FS operations
in progress, not as a result of the background fsck operation
itself).

After the panic, if you are in a "background fsck mode", you
come up and you panic again, for the same reason, because the
underlying condition of the FS is not related to relatively
harmless overallocations in the cylinder group bitmap, which
is the base assumption the background fsck makes.

Thus, to "correct" Scott's problem, you need to mark the start
and end of a background fsck cycle, such that if there is a
failure in the middle of a background fsck, when the system is
rebooted, the failure is dealt with via a full non-background
fsck, if the disk is in the state "background fsck started but
not yet completed".

To deal with intermittant power outages as the source of
failure, you could, as a tunable parameter, set a count of the
number of times a fatal failure must occur before a background
fsck is no longer an option (e.g. 3 failures duing a background
fsck, and you go to a foreground fsck).

I have quoted "correct" here, because it's really a workaround,
not a fix, for the underlying problem.

> > The write occurs, or it does not.  The journal entry timestamp
> > gets updated after the write completes, or it does not.
> >
> > Thus, you can always recover a JFS to a consistent state almost
> > instantaneously, simply by finding the most recent valid journal
> > entry timestamp, and ignoring anything else -- as long as data is
> > journalled, and not just metadata.
> 
> I'm with Matthias Sch�ndeh�tte here. SO writes the data and then it
> writes the metadata. So either the just written blocks get referenced by
> metadata or it does not. So we can recover to a consistent state almost
> instantaneously too.

Recovering to a consistent state is uninteresting.

Let me explain: you do not want to recover to a consistent state,
per se, you want to recover to *the* consistent state that the FS
*would have been in*, has the failure not occurred, and the
operations which can be rolled forward *had been successful*, and
the operations which can not be rolled forward *had not been
attempted*.

This is the same argument against recovery of an async mounted FS,
following a crash: the number of outstanding operations minux one
is the number of potential operations that left the disk in its
current end state.  Thus, if operations are not ordered, the number
of potential start states grows exponentially.

For example, say I had N related operations in progress; therefore,
the number of consistent states that could have led to the state if
the disk at the time the recovery is attempted is (2^(N-1)).  For
ordered operations, N is always 1, and the result is always (2^0),
or (1) -- therefore it is always possible to recover to *the*
consistent state, rather than *a* consistent state.

Only recovering to *a* consistent state loses implied metadata
(e.g. related updates to record and index files in a relational
database, etc.).  This is unacceptable.

> The only problem is: when you delete some files,
> and the metadata (directory entries) is written, but the free blocks
> information isn't updated yet. Then you have to use (bg-)fsck to correct
> the free block information. But if you need to go online as fast as
> possible with a consistent FS SO doesn't holds you back from this.

It's more about having a *correct* FS, than a *consistent* FS.
An *empty* FS is consistent, after all, so the fastest recovery
possible would be to newfs the thing, right?  8-) 8-).

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D94A3EC.160D5D71>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation