From owner-freebsd-fs  Tue Mar 18 14:43:57 2003
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 979D537B404; Tue, 18 Mar 2003 14:43:50 -0800 (PST)
Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 9CEA143F3F; Tue, 18 Mar 2003 14:43:49 -0800 (PST)
	(envelope-from tlambert2@mindspring.com)
Received: from pool0203.cvx22-bradley.dialup.earthlink.net ([209.179.198.203] helo=mindspring.com)
	by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)
	id 18vPnu-0001Io-00; Tue, 18 Mar 2003 14:43:31 -0800
Message-ID: <3E77A0C0.B1BEA2D7@mindspring.com>
Date: Tue, 18 Mar 2003 14:42:08 -0800
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: Brad Knowles <brad.knowles@skynet.be>
Cc: Bakul Shah <bakul@bitblocks.com>,
	Julian Elischer <julian@elischer.org>,
	FreeBSD current users <current@FreeBSD.ORG>, fs@FreeBSD.ORG
Subject: Re: Anyone working on fsck?
References: <200303180459.XAA15021@goliath.cnchost.com>
	 <3E76C15F.D3C7F9A7@mindspring.com> <a05200f0aba9cdad802e9@[10.0.1.2]>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4c02ff085df222b2c13f20f3d35e2cfb1350badd9bab72f9c350badd9bab72f9c350badd9bab72f9c
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-fs.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-fs>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-fs>
X-Loop: FreeBSD.org

Brad Knowles wrote:
> At 10:49 PM -0800 2003/03/17, Terry Lambert wrote:
> >  Yes, I know.  I'm aware.  He has a lot of data to transfer in
> >  from the disk in order to do the reverse lookup, with that much
> >  data on the disk.
> 
>         I'm confused.  In situations where you need to do reverse
> lookups, don't you normally tend to create doubly-linked lists, or
> other structures that you can traverse quickly and efficiently?

Yes and no.

"Yes", if that usage is typical.  "No", if it's do extraordinary
that it's to the point of "infintessimally small".

This is a case of optimizing the success code path at the expense
of the failure code path, and is not only accepted practice, it's
good practice.

The specific issue here is that the CG bitmaps normally exist to
give layout hints to the block allocator, which would normally
take a huge amount of traversal in order to find free blocks on
its own.  They exist, in other words, to find free blocks fast.

The problem that happens is when the data in a bitmap is stale;
but it can only be stale in a particular way: when a block is
freed back to the FS, the last operation which occurs, in the
ordered update of the metadata, is to mark it free in the CG
bitmap itself.  So the failure case from which the recovery is
proceeding is "a failure to mark a block which is otherwise
unallocated as unallocated in the CG bitmap".

What this boils down to is looking to see if your table
*doesn't* contain an entry, after recovery from a catastropic
failure, where the only thing that could be out of date is
there being an entry in the table that shouldn't be there.

Meanwhile, when performing allocations, you are looking for
the 0 bits... the entries which are *not* there.

Because of this, it's safe to go through and use the FS, and
make new allocations, since the table can only have erroneously
set bits... meaning you cannot allocate a block that's not
allocated.  You will *never* make the mistake of allocating a
block that's *already* allocated.

So the only thing you care about on recovery, which can, in
theory, take as much time as necessary to do it's job, is
clearing the bits that aren't valid.

Meanwhile, new allocations are occurring all the time, since
the FS in question is "live".

So what you do is bound the problem set by establishing a
"snapshot" of the FS, which, itself, adds considerable overhead,
and then you allow any allocations you want to to continue in
the live FS.  This is because the bits you want to clear will
not be validly set by the live FS: you can clear them any time,
and it will be valid to clear them in both the snapshot and in
the live FS, without risking the integrity of the live FS.

Another way of bounding the problem would be to "read lock" the
CG's whose bitmaps you are currently examining, and any new
allocations which attempt to fall in these locked regions are
redirected.  There are a couple of downsides to that approach,
too, but it can be made to work.


In any case, whether you are operating on a live FS or a snapshot,
or performing a foreground fsck on a quiescent/RO/unmounted FS,
the problem is the same: you must select a CG, cache a copy of it
in memory, create a "scratch" copy, and then examine all block
allocations, everywhere, and where they indicate a block allocated
in the cylinder group in hand, clear the bit in the scratch copy.
The set of bits you are left with is the set of bits that are not
actaully allocated.

At this point, you lock the live FS version of the CG (if you are
not operating on a read-locked CG *in* a live FS, of course), and
AND the real CG with the NOT of the copy, and use that to replace
the old bitmap... thus clearing the "phantom allocations".

Then you go onto the next CG bitmap.


Make sense now?


> >  He could change the code so that that everything that wasn't in
> >  use was zeroed.  That would help, since then he could just create
> >  a second CG map in an empty file by traversing all inodes and
> >  indirect blocks for non-zero values, not caring if the indirect
> >  blocks were referenced by inodes.  Assuming indirect blocks and
> >  data blocks were distinguishable.
> >
> >  Like I said, there are a lot of FS changes that could help.
> 
>         I'm not sure I understand how this would be a filesystem change.
> Since this second CG map would be used only during fsck and not
> otherwise present on the system, couldn't you just throw it away when
> you were done, resulting in a filesystem that was structurally
> unchanged from before?

Newfs would have to do a heck of a lot more writing than before.
Freeing would have to write zeroed information.  Data blocks that
might be used as indirect blocks would have to be zeroed.  Etc..

Basically, you would eat the penalty as a small additional overhead
during normal operation, to reduce the cost by one order, in case of
failure.  For a really, really high failure cost at a given known
probability of failure, this cost might amortize out as cheaper, in
terms of overall operational cost.


> >  The *reason* it doesn't matter in the CG bitmap case, is that a
> >  bitmap can only tell "allocated" vs. "unallocated"; there is no
> >  third state that means "I haven't checked this bit yet".
[ ... ]
> 
>         Seems to me that you could easily solve this with a second CG map
> (as you proposed above), that starts of initially zeroed for all
> blocks, and as you go through the "normal" CG maps, you turn on all
> the corresponding bits in the corresponding second map as you touch
> the blocks that are referenced.

No; it's more tricky; see my example above.  Your scratch block has
to start with all the bits set that are already set in the CG bitmap,
not with all zeros.

The reason for this is complicated, but what it boils down to is that
you need to be able to make additional allocations, and you need to
know the difference between new additional allocations which occurred
as a result of operations in files/areas you have already traversed
in the currently occurring pass.  Otherwise, you can't just read-lock
the CG bitmap, you must write-lock it as well.  Using the snapshot
approach is one way of avoiding this issue.

>         This should be a single linear pass, and then you'd be done.  If
> you had one I/O process and multiple worker processes actually
> scanning the CG maps and updating the second copy, this should just
> absolutely *fly*.
> 
>         What am I missing here?

The fact that for a 1TB FS, you don't have enough virtual address
space to hold all the CG bitmaps and all the scratch memory and
the mapped buffers of the FS data you are traversing in your head
at once?  8-) 8-).

Even if you did, you'd end up paging, and you are actually worse
off doing that, then you are doing multiple linear passes, after
breaking the CG's into blocks of CG's to check in linear runs.
That's why I suggested an MADV_WIRE for mapping memory to contain
the CG bitmaps you are interested in perusing per pass.

The reason you are worse off comes down to the fact that, on a
linear traversal of disk data which is so much larger than your
available cache, you will *never* have a case where the block you
are interested in looking at is already in cache.

One of the "sneaky" ways of dealing with this is to use an
"elevator" algorithm: for one CG set, scan forward through the
FS; for the next, scan backward.  This means you will get cache
effects, since the data you examine will be the data you just
examined, until you run out of LRU, and then you degrade to
having to read all data in.

An even more "sneaky" way would be to *know* your LRU list
length, and start a *forward* scan from there.  That would be
2X as effective, since you would get a 100% cache hit, at least
for the length of the LRU, until you, once again, end up in the
degenerate case.  This is one of the reasons VMS had the concept
of being able to know both your "working set quota" and your
"current working set size".

Another "sneaky" approach, which I've already suggested, is to
reduce the size of the working set you care about, by increasing
the locality between "random" FS data and the CG bitmap, so that
any allocations you care about will be in a smaller portion of
the disk.

This is automatic, in the case of "make smaller partitions"; it
requires a change in disk layout policy otherwise, and that change
could damage overall performance, in the same way "growfs" does,
by increasing regional fragmentation.  Even so, there's no way to
keep it from breaking down for very large files which spanned a
(for lack of better terminology) "cylinder group group" boundary.
Even a "sliding group window" would only save you another 50% of
the time (assume average locality is pinned to the inode location).

It's a pretty intractable problem, with existing FFS structure;
you can get minor performance wins, by holding multiple CG's in
hand to do the job (maybe all of them, if the FS is small enough,
which would make it 1 pass), but for a very large FS, you are
pretty much toast, unless you are willing to change FS layout
*some*.

I rather expect that UFS2 will end up changing layout, as people
start using it for very large filesystems.

Actually, thinking about it, you could bound it to a single
non-CG bitmap metadata traversal in all recovery situations...
if you were willing to change FS layout, in a probably compatible
way.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message