Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 15 Sep 1996 11:19:02 +1000
From:      Bruce Evans <bde@zeta.org.au>
To:        proff@suburbia.net, terry@lambert.org
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: attribute/inode caching
Message-ID:  <199609150119.LAA30846@godzilla.zeta.org.au>

next in thread | raw e-mail | index | archive | help
>> What is the present status of attribute/inode/directory caching under
>> freebsd? When performing a 'du' of even a relatively small hierarachy,

Little improved since 1992.  The vnode cache is a bit larger, at least
on machines with plenty of memory (the default is up from about 200
on an 8MB system to about 2000 on a 32MB system), and you can tweak
its size using sysctl(8) or the unnecessary EXTRAVNODES option, but
the caching still breaks down when the vnode cache starts to thrash.

The problem used to be that buffers for directories were attached to
vnodes, so thrashing of the vnode cache also thrashed the buffer cache.
Also, (?) inode buffers weren't kept in the buffer cache and the buffer
cache was too small to hold many inodes.  Under Linux, at least in 1992,
the buffer cache isn't so tightly coupled to the vnode cache, so ordinary
LRU caching results in the buffer cache filling up with inode data,
so it can easily cache 6000 128-byte inodes (or 20000 32-byte inodes)
and associated directory entries in only 1MB of buffer cache.  This might
not be the best use for the buffer cache, but it is good for traversing
large hierarchies.  I don't know exactly how the unified vm and buffer
cache has affected this.  Apparently, not much.

>> the second 'du' appears no faster than the first and the drive can be
>> heard to thrash around in exactly the same manner.

I notice this mainly when I run `find' on relatively large heirachies.
The problem is not so much that the second traversal reads everything
again, but that the first traversal thrashes the buffer and/or vm cache.

>POSIX mandates that the access time will be marked for update when you
>read the directory; thus it's written out, and the thrashing is expected.

Wrong.  Neither marking for update nor updating the access times requires
writing anything.  In FreeBSD, writing is a side affect of thrashing the
caches and updating is often a side effect of writing.  First, when the
vnode cache thrashes, the vnodes have to be updated and written to the
buffer cache.  Second, when the buffer cache thrashes, the dirty buffers
containing the vnodes have to be written out.  They are usually written
with delayed writes, so the writes need not more than double the overhead
for the thrashing (probably much worse in practice because of seeks).

>One issue which is a big one in my book is that only data hung off a vnode
>is cached in the buffer cache.  The caching is by inode/extent rather than
>by device/extent.

Yes, this is the main problem.

>The net result of this will be that the inode data itself will not be
>cached.

It could be hung off the vnode for the mounted device.  I'm not sure if
it isn't already.  This problem is secondary.  Repeated tree traversals
aren't all that common, and you don't really want them to eat the buffer
cache (you probably want to buffer precisely the inodes and directories
that will be hit again a long time later in the same search, e.g.,
intermediate directories for a depth-first seach).

>There is a "second chance" ihash cache in FFS; other FS's are not so
>lucky; thus your performance will depend on number of elements before
>the hash overflows and whether or not you are testing FFS or some other
>FS.  For instance, expect EXT2FS to have significantly worse performance
>under BSD.

Actually, ext2fs uses the ufs ihash.  Hmm, EXTRAVNODES is necessary after
all, since the ihash table isn't affected by the sysctl to change
`desiredvnodes'.  It's fishy that the ufs table size is the same as the
vfs table size.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199609150119.LAA30846>