Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 17 Mar 2003 22:49:03 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Bakul Shah <bakul@bitblocks.com>
Cc:        Julian Elischer <julian@elischer.org>, FreeBSD current users <current@FreeBSD.ORG>, fs@FreeBSD.ORG
Subject:   Re: Anyone working on fsck?
Message-ID:  <3E76C15F.D3C7F9A7@mindspring.com>
References:  <200303180459.XAA15021@goliath.cnchost.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Bakul Shah wrote:
> > Sorry, but the track-to-track seek latency optimizations you
> > are referring to are turned off, given the newfs defaults, and
> > have been for a very long time.
> 
> I was thinking of the basic idea of cylinder groups as good
> for normal load, not so good for fsck when you have too many
> CGs. I wasn't thinking of what fsck does or does not do.

Number of CG's is irrelevent, really.  You can read 10 of the
bitmaps at a time and precache them, or you can read 1000 of
them and precache them, the difference is going to be minimal.
The thing that takes the time is the forward traversal to do
the reverse lookup.

Julian has the right idea about precaching; I suggested it, too,
before reading his message, but after a certain level, there is
a point of diminishing returns.

Getting rid of cylinder groups just pushes the problem up a
level, it doesn't eliminate it.  No cylinder group bitmap to
be the last thing updated?  OK, then whatever becomes the
last thing updated, instead of that, becomes the reverse
lookup bottleneck.


> > Basically, the problem is that for a BG fsck, it's not possible
> > to lock access on a cylinder group basis, and then there is a hell
> > of a lot of data on that drive.
> 
> Note that Julian said 6 hours to fsck a TB in the normal
> foreground mode.

Yes, I know.  I'm aware.  He has a lot of data to transfer in
from the disk in order to do the reverse lookup, with that much
data on the disk.

He could change the code so that that everything that wasn't in
use was zeroed.  That would help, since then he could just create
a second CG map in an empty file by traversing all inodes and
indirect blocks for non-zero values, not caring if the indirect
blocks were referenced by inodes.  Assuming indirect blocks and
data blocks were distinguishable.

Like I said, there are a lot of FS changes that could help.


> > What Julian needs is a Journalling or Log-structured FS.
> 
> All that Julian wants is a faster fsck without mucking with
> the FS!

And I'd like to regrow all my teeth that have ever had dental
work done on them.

There are some things you can't have.  8-).


> While I agree with you that you do need a full consistency
> check, it is worth thinking about how one can avoid that
> whenever possible.  For example, if you can know where the
> disk head is at the time of crash (based on what blocks were
> being written) it should be possible to avoid a full check.

You can buy hardware that can do this.  Sun sold it in the 1980's,
and people have made NVRAM-backed caching disk controllers for
forever, since then.


> > The easiest way to do this is to provide some sort of domain
> > control, so that you limit the files you have to check to a
> > smaller set of files per CG, so that you don't have to check
> > all the files to check a particular CG -- and then preload the
> > CG's for the sets of files you *do* have to check.
> 
> If you have to visit a CG (during fsck) you have already paid
> the cost of the seek and rotational latency.

No.

The cost of the fsck is *not* the reading of the cylinder group
bitmaps.  You can cache a fixed number of those.  The problem is
in not knowing which inodes and indirect blocks contain direct
and indirect block references to the cylinder groups you happen
to have in cache.

In other words, the cost is in enumerating all the allocated
blocks in all of the files.

The *reason* it doesn't matter in the CG bitmap case, is that a
bitmap can only tell "allocated" vs. "unallocated"; there is no
third state that means "I haven't checked this bit yet".

So you can only load up however many bitmaps you are going to
check in a given pass (hopefully they all fit in memory, but
if they don't fitting in the address space does you no good,
because they still aren't tri-state), and then iterate every
single bit reference in existance, and compare them to the ones
you have, and clear them if they aren't referenced by *anyone*.

In other words, this is *always* a cylinder group gitmap bit-major
operations (as in "row-major" vs. "column-major" arrays in C and
FORTRAN).


> Journalling wouldn't help here if you still have a zillion CGs.

Yes, it would.  If you had Journalling, you wouldn't have CG
bitmaps to worry about, no matter how many CG's you had.

The *only* thing you would have that you cared about is the
"operation started" and "operation complete" stamps on a journal
entry, and you'd *only* care whether the former was greater or
less than than the latter.  And then you'd write a field of "last
valid journal entry flip-flop" into the oldest of the two flip-flop
fields with a journal reference and the current date stamp, and
you'd be done.

The only thing you'd care about was how many "last valid" flip-flops
you had, which would be totally unrelated to cylinder groups, because
it's a file-major idea (and you have to traverse the files anyway).


> > Another way of saying this is... don't put all your space in a
> > single FS.  8-).
> 
> Or in effect treat each CG (or a group of CGs) as a self
> contained filesystem (for the purpose of physical allocation)
> and maintain explicit import/export lists for files that span
> them.

I already said that.  8-) 8-).


> > The problem with having to (effectively) read every inode and
> > direct block on the disk is really insurmountable, I think.
> 
> That is why I was suggesting putting them in one (or small
> number of) contiguous area(s).  On a modern ATA100 or better
> disk you can read a GB in under a minute.  Once the data is
> in-core you can divide up the checking to multiple
> processors.  This is sort of like a distributed graph
> collection: you only need to worry about graphs that cross a
> node boundary.  Most structures wille contained in one node.

There are a couple of FS's which already do this.  One of them
is Eric H. Herrin II and Raphael A. Finkel's "Viva" out of the
University of Kentucky.  That was a 1993 project.

Another is "Stride".

Actually, the BSD 4.0 FS did a similar thing, in keeping different
data types on different areas of the disk.  If you read the BSD
FFS paper, though, you'll see why the moved away from that.  8-).
One of the reasons is fragility, which modern disks *certainly*
do not address, no matter what other "advances" they have: lose
a single track on an IDE due to a power failure during write,
you lose all your data.


> Even for UFS it is probably worth dividing fsck in two or
> more processes, one doing IO, one or more doing computation.

?????? So adding more non-locality accesses to a system that is
slow as a result of non-locality acceses helps how?  8-) 8-).


> > ...AND... The problem is still that you must scan everything on
> > the disk (practically) to identify the inode or indirect block
> > that references the block on the cylinder roup in question, and
> > THAT's the problem.  If you knew a small set of CG's, that needed
> > checked, vs. "all of them", it would *still* bust the cache, which
> > is what takes all the time.
> 
> > Assume, on average, each file goes into indirect blocks.
> 
> On my machine the average file size is 21KB (averaged over
> 4,000,000 files).  Even with 8KB blocksize very few will have
> indirect blocks.  I don't know how typical my file size
> distribution is but I suspect the average case is probably
> smaller files (I store lots of datasheets, manuals,
> databases, PDFs, MP3s, cvs repositories, compressed tars of
> old stuff).
> 
> But in any case wouldn't going forward from inodes make more
> sense?  This is like a standard tracing garbage collection
> algorithm.  Blocks that are not reachable are free.  Even for
> a 1 TB system with 8K blocks you need 2^(40-13-3) == 16Mbytes
> bitmap or some multiple if you want more than 1 bit of state.

There are a lot of smart things that can be done... if you are
willing to change the FS.  Most of them have been done already,
at least 10 years ago.  8-) 8-).


> > The problem is reverse mapping a bit in a CG bitmap to the file
> > that reference it... 8^p.
> 
> Why would you want to do that?!

That's exactly what you are doing, when you are trying to decide
to clear a bit in the CG.  The tristate problem.  You have to know
that it's *not* refernced.

For something that's references, on average, you have to look at
1/2 of your search set.  For something that's not, you have to
look at 100% of your search set.  That's why negative caching in
a directory name lookup cache is twice as valuable as positive
caching.  8-) 8-).


> > Multithreading fsck would be an incredibly bad idea.
> 
> It depends on the actual algorithm.

Right.  For example, a "multithreaded algorithm"... ;^).


> > Personally, I recommend using a different FS, if you are going
> > to create a honing big FS as a single partition.  8-(.
> 
> There are other issues with smaller partitions.  I'd rather
> have a single logical file system where all the space can be
> used.  If physically it is implemented as a number of smaller
> systems that is okay.  Also note that now people can create
> big honking files with video streaming at the rate of 2GB per
> hour and even then you are compromising on quality a bit.

Sure.  I expected that, given Julian's current line of work, and
the size of the FS, that what we were talking about was honking
big files representing image captures of all documents in a large
document set.  This also made me think that the reason for the
huge partition was to be able to handle "single very large file"
instances, more than "a huge number of little files", so that he
couldn't break them up into reasonable sizes.

It was also the reason for the "smiley" on my suggestion that he
use smaller partition sizes.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E76C15F.D3C7F9A7>