FreeBSD Mail Archives

Date:      Fri, 13 Aug 1999 20:50:39 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        zzhang@cs.binghamton.edu (Zhihui Zhang)
Cc:        tlambert@primenet.com, phk@critter.freebsd.dk, roberto@keltia.freenix.fr, freebsd-fs@FreeBSD.ORG
Subject:   Re: Help with understand file system performance
Message-ID:  <199908132050.NAA17293@usr01.primenet.com>
In-Reply-To: <Pine.GSO.3.96.990812202049.1878A-100000@sol.cs.binghamton.edu> from "Zhihui Zhang" at Aug 12, 99 09:02:32 pm

> On Thu, 12 Aug 1999, Terry Lambert wrote:
> 
> > The filesystem block allocation table in directories is unique, in
> > that it is generally used as a convenience for locating physical
> > blocks, rather than using the standard filesystem block access
> > mechanisms, when reading or writing directories.
> 
> Directory files have the same on-disk structure as regular files.

Yes.  But they are not accessed internally as if they were regular
files.  The only operation which is trated as "regular" is extending
(and as of 4.4BSD, truncating back) the block allocations in
directories.

The directory manipulation code treats it as a series of blocks,
and translates from the "regular file" aspect into BLKATOFF().

> However, they can never have holes and they can only be incremented at the
> end of the file in device block chunks. No directory entry can cross the
> device block boundary to guarantee the atomic update. 

Right.  There is no such thing as a "sparse block allocation" in a
directory, since BLKATOFF() assumes the existance of a block.

Directory entries are physically prevented from crossing block
boundaries in order to ensure atomic update.  But this is an
implementation detail, and it is not the only way one could ensure
atomicity, so long as one were willing to reallocate (filesystem,
not physical) blocks or frags in order to do the updates (i.e. you
could arrange for a two stage commit; I did this in my Unicode FFS
prototype, since even though a 256 character name would fit in 512b,
there was no room left over for the metadata).

> However, I do not know why you say the block map (direct and indirect
> blocks) of a directory is only used as a convenience. I mean there is a
> need to call VOP_BMAP() on a directory file. The routine ffs_blkatoff() 
> calls bread(), which in turn calls VOP_BMAP(). The in-core inode does have
> several fields to facilitate the insertion of new directory entries. But
> we still need the block map (block allocation table). 

Directory manipulations access blocks directly.

You've no doubt noticed that the vast majority of system calls do
_not_ require VOP_BMAP() calls for copyin/out operations on VM
objects backed by the filesystem.  The need to call VOP_BMAP() is
an artifact of treating the directories as a list of blocks, rather
than treating them as files.

The "convenience" aspect is that they are files, but they are not
used as such, and it's just because it is convenient that files are
used as the underlying abstraction: directories are not naturally
represented as files, and in fact, trying to make them conform to
the normal file behaviour would result in breakage of the atomicity
guarantee.

> Directory files are also specical in that we can not write into them
> with the write() system call as normal files.  They use a special
> routine to grow, i.e., ufs_direnter().  By the way, we can use read()
> system call to read directory files as we do with normal files. 

The lack of the ability to write was mirrored by a lack of ability
to read, as well, until this was changed, intentionally.  Likewise,
there was no ability to mmap directories (read only, of course),
until that, too, was changed.

These are both optimizations to speed certain programs, and are
really antithetical to POSIX.

In reality, if you have looked at the "cookie" code for VOP_READDIR()
in NFS, FFS, and at least one other FS, you will see that the need
for cookies is an artifact of the structure of the interface.  An
alternate interface would allow directory block abstraction seperate
from the externalization of directory entries.  The structure that
is returned by getdents() is actually only coincidentally (albeit
intentionally so) the same as the FFS on disk structure.  See the
4.3/4.4 compatability translation code in the VOP_READDIR() in the
FFS implementation.

The upshot of this is that the ability to read or mmap for read
directories is actually a very bad thing, from an interface
perspective, since it promotes the writing of code that depends
on data format interfaces.  This is similar to the use of the KVM
as a data interface.

It is only coincidental, based on implementation (unintentionally so,
this time) that the POSIX access time updates for files and the access
time of directories (as POSIX mandates for getdents() operations)
happen to coincide.

If you look at the cookie mess, and the NFS server code wire format
translation mess, I'm sure you will agree.  You only need to ask
yourself "how could NFS handle a VOP_READDIR() that came from an
underlying FS that could pack more entries in a block than could
be represented in a block in the external 'coincidental' format?"
to prove to yourself that this is broken.

> > There are a number of performance penalties for this, especially
> > on large directories, where it is not possible to trigger sequential
> > readahead through use of the getdents() system call sequentially
> > accessing sequential 512b/physical_block_size extents.
> 
> I do not understand this. The read-ahead mechanism should work on any
> files. I thought the reorganization of diretory entries within a directory
> block when you delete an entry is an inefficiency. 
> 
> Does this issue have anything to do with the VMIO directory issue
> discussed earlier this year? 

No.  It has to do with VOP_READDIR() not exhibiting behaviour which
would trigger read-ahead, such as is triggered by READ, WRITE, GETPAGES,
and PUTPAGES.

> > The frag size can be tuned down below this (i.e. 1/4, 1/2, 1).
> > 
> > The only case where 1024 bytes of physical disk would be used is at
> > a filesystem block size of 8192 (or greater), which, divided by 8,
> > gives 1024b (or greater).
> 
> I did not realize this before.  The maximum ratio is 8.  So if the
> filesystem block is 8192, the allocation unit (fragment size) can not be
> 512 because 8192/512 > 8.

Yes.  There are only 8 bits available for representing frag allocations.

> > This is called an encapsulated two stage commit, in database terms.
> > 
> > For inodes, indirect blocks, and directory entry blocks, there is
> > no two stage commit, because there is no indirection of their data
> > contents.
> 
> I guess you mean that their data are not managed by any higher level
> metadata which must be updated together. 

Yes.

Despite the fact that "higher level" metadata exists, since the
implementation detail is that they are stored using "files", the
actual implementation does not take advantage of this, either for
triggering read-ahead, or for encapsulated commits of directory
modifications, or for clustering (which could only occur on a
restore from an archive, given the incremental nature of directory
entries), or for any of a dozen other speed enhancements which are
applied to normal files.

This means that directories are, by their nature, rather slow.

> Thanks for your help.
> 
> -Zhihui

Any time.  8-).  It's an interesting discussion to engage in; there
are (not implemented in FreeBSD) interesting soloutions to much of
the performance issues that people raise against the FFS.  The last
time this issue came up that I remember had to do with depth-first
creation and breadth-first traversal of the ports directory structure;
I actually still maintain that this is a problem in the creation of
the directory (i.e. the organization of the archive) more than it is
a problem with the FS itself (a tool is only as good as the craftsman
using it).  If used properly, there really aren't a lot of performance
problems that you can point to (sort of like cutting with vs. against
the grain in a board).

I am becoming convinced that an intermediate abstraction is really
what is called for, to turn the bottom end into what is, in effect,
nothing more than a flat, numeric namespace on top of a variable
granularity block store.  A nice topic for much research... 8-).

					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199908132050.NAA17293>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation