Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 12 Nov 2002 10:39:21 -0800
From:      Terry Lambert <tlambert2@mindspring.com>
To:        Tomas Pluskal <plusik@pohoda.cz>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: seeking help to rewrite the msdos filesystem
Message-ID:  <3DD14AD9.DF8D3580@mindspring.com>
References:  <20021112134213.P32524-100000@localhost>

next in thread | previous in thread | raw e-mail | index | archive | help
Tomas Pluskal wrote:
> > No; mostly it's non-page aligned accesses, and the fact that
> > serial access requires traversal of the entire file up to that
> > point, and so becomes exponential, the further you get into
> > the file.
> 
> I believe that non-page aligned accesses would not make it 10 times
> slower, so main problem lies in the traversal of the file ?

You can lose a factor of 3 on this; basically, any 1K block you
have will end up faulting the next page to get the last 512b of
the last sector that starts at 512b from the end of the previous
page.  The issue here is one of doubled latency, every 4th block
in the FS.

In almost all optimization situations, after dealing with the
obvious (e.g. repetition of operations that should not need to
be repeated interior to loops, etc.), optimization will boil
down to an issue of reducing latency on data propagation.


> Is it a problem even if I am reading the file from beginning to end ?

Yes.

> Why?

Sequential access has a specific optimization available, which
is to use the last block to find the next block, but this fact
is not cached.  You could specifically attempt to speed up the
sequential access by doing a cache-behind, and, when asking for
a specific offest, knowing that the last block was at offset N,
and knowing its contents are pointed to at XXX, so that when you
asked for offset N+1, you could go directly to the in-core block,
instead of linearly re-traversing.  It's possible.

The problem with doing this is: (1) It's not thre *now*, and (2)
there is no guarantee that XXX is still in core.  Cached data is
hung off the vnode, not off the inode, and the only safe place to
do the caching is in the inode, but there is no invalidation
notification below the level of vnode, which is the caching object.

In other words, the metadata is cached off the vnode of the mount
point, not off the in-core image of the inode of the file you are
reading from, and even if it were pointed to off the inode, it
would be invalidated, LRU, off the vnode of the mount point, so it
could disappear out from under the inode that's pointing to it.

Hence the suggestion that metadata be explicitly cached, and thus
explicitly referenced: and therefore the file allocation table
area of interest, paged in from an in-core index.


> >     http://www.usenix.org/publications/library/proceedings/sf94/forin.html
> 
> I'll read this through.

Be aware that, like I said, the conclusions are a bit farcical and
contrived.  For all that, it presents a number of techniques that
will be useful for speeding up FAT access by file offset.

One of the issues here is that you will need "last close"
notification, and you are probably also interested in data
caching to have a preference for "open files, largest to
smallest, within an LRU consistency window"; this is really
not counter-intuitive, if you think about a shorter file with
5 opens on it or lots of random access, v.s a longer file with
less frequent access, but both of which are better cached than
a small file with a lot of access (effectively, this is a
Laffer curve).

Probably the best approach, if you have a limited cache, is to
virtualize by offset in a btree, and over a certain depth,
virtualize and retraverse with a lower persistancy cache for
the remainder of the data.  So, say, if you btree'ed an indec
for a file, maybe the hard index would stop at every 4th block,
or every 8th block.  This would be enough to divide the offset
of a new request by 8, then traverse 3 block to get to the data,
rather than traversing 2^N+3 blocks for a btree depth of N.

[This would be easier with a whiteboard, but I can draw an "ASCII
 art" picture, if necessary]


> > However, it does have a couple of good suggestions for speeding
> > FAT file access up.  The most useful is that you should cache
> > the metadata information, so that you can traverse the list of
> > FS blocks in memory, instead of on disk, and caching of all the
> > directory information (for the same reasons).  Because the directory
> > entry is also the inode, in FAT, this has the effect of caching all
> > the inodes, as well.
> 
> Is this metadata caching the main difference of FreeBSD from
> linux/windows, that makes it so slow ?

Windows certainly caches metadata.  Windows also has other FS
attributes that are rather odd, that will cause improved FS
performance, at the epxense of other system performance.  One
specific example is that memory under Windows is obtained from
a module called VMM32.VXD; everything which obtains memory from
this module is required to register a "low memory condition"
callback, which the system can call and say "give back those
pages you can, give back, since I am low on memory".

The interface is written such that the system can ask for memory
back, and then each subsystem can return multiple pages back to
the system on each call to it by the system.  The VFAT32.VXD IF
modules specifically gives memory back only a single page at a
time, unfairly competing with other subsystems, and user prorgams.
which the documentation for the system specifically recommends to
return as much memory as possible on each request.  If you are an
FS developer writing code in the IFSMgr under Windows, and you
don't know this, you will never be able to write FS code that is
capable of competeing, side by side in the same system, against a
VFAT32 FS under benchmark conditions (e.g. as a file server vs.
"NetBench").

I would have to examine the Linux FAT FS code in more detail to
know what they currently do, with regard to caching.  When FS's
were smaller, and Udo Walter was working on the code (say circa
1996), they did some aggressive caching, which they have probably
tuned down, as Windows FS's and disk space have both increased.


> > I don't think the issue is actually clustering.  If the cluster
> > size is set, you really don't have a choice, since your block
> > chain is hung off that, so it's not like FreeBSD goes out of its
> > way to pessimize access.
> 
> I've been doing some experiments with iostat, and I've figured out that on
> FAT filesystem the kernel is generating very small IO requests (cluster
> size), but with ufs the request size is much bigger (128KB).
> I have thought that this was the problem.

This has more to do with sequential access.  Technically, you can
read a FAT cluster at a time instead of an FS block at a time, and
you will achieve some multiplier on sequential access, but you will
find that under load, that the fault rate for blocks will go up.

Also, even if you read 64K at a time, you will end up LRU'ing out
the data that you don't access.

The issue is that UNIX files are accessed by offset, and FAT files
are accessed by offset by chaining clusters from the start to the
cluster of interest, and then reading blocks.

Since you can't cache non-metadata references by ofset at the inode
level, and can only do it at the inode level as a reference into the
device cache hung off the device vnode (which can then be faulted),
there's an impedence mismatch.  If you force the caching in seperate
memory for as much of the FAT as you can, then that buys you the
cluster offsets, and you can cache the blocks containing the chaining
information, as well.

Rather than trying to design this for you on the fly, the best
thing you can do is profile the code, so that you can compare how
things are with how things end up.  This is really needed, if you
plan on using this as a school project, and want to write up
papers.  Also, the actual empirical choice on cache depth and other
tradeoffs are going to be based on intended access patterns.  So
if you want to get a good paper out of it, you'll probably implement
as many types of instrumentation, tunables, and attempts to speed up
what's slow in profile, that you can.  The more you do, the better
your results will be, and me telling you what's worked in the past
any more will bias your efforts unfairly; I'd rather just give you
a direction to look in, and have you reach whatever conclusions you
reach.

-- Terry

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3DD14AD9.DF8D3580>