Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 3 Nov 2015 21:17:15 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Kirk McKusick <mckusick@mckusick.com>
Cc:        fs@freebsd.org
Subject:   Re: an easy (?) question on namecache sizing
Message-ID:  <20151103173042.K1103@besplex.bde.org>
In-Reply-To: <201511030447.tA34lo5O090332@chez.mckusick.com>
References:  <201511030447.tA34lo5O090332@chez.mckusick.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2 Nov 2015, Kirk McKusick wrote:

> You seem to be proposing several approaches. One is to make
> wantfreevnodes bigger (half or three-quarters of the maximum).
> Another seems to be reverting to the previous (freevnodes >= wantfreevnodes
> && numvnodes >= minvnodes). So what is your proposed change?

For a quick fix, I will try:

     wantfreevnodes = current value (perhaps too large)
     minvnodes = maxvnodes = desiredevnodes

with the old code that creates new vnodes up to maxvnodes instead of
attempting to recycle old vnodes well below maxvnodes.  Only one variable
is needed for this, and the very old name desiredvnodes is best for this,
but the separate variables are useful for trying variations.  More
dynamic configuration when the variables are changed is needed.  Old
versions of FreeBSD already have the separate variables, but changing them
using SYSCTL_INT() doesn't work.  E.g., reducing numvnodes below
numvnodes doesn't eventually reduce numvnodes, but eventually causes
deadlock.

The old version in its default configuration didn't really work for du -s
either.  It appears to work initially, but it basically asks for thrashing
through just 25 vnodes when only stat()s are done, so even ls -l /bin
thrashes once the system has created 1/4 of its "desired" number of
vnodes.  The caches actually work initially because the silly limits are
inactive initially.

To see the silly behavior in an old version of FreeBSD:
- read more than minvnodes files.  This enters the silly region where
   the sully wantfreevnodes limit starts being applied
- do a few du's and ls's in loops and watch them using systat -v.  Verify
   that numvnodes > minvnodes is fairly stable and
   freevnodes <= wantfreevnodes = 25 (default)
- choose any directory with > 25 files in it that has not been looked at
   before.  Reapeat the previous test with du or ls -l on this directory.
   Observe that at least the namecache is broken (I see 11% hits for
   namecache and 51% for dircache for a directory with 31 entries
   (counting "." but not "..")
- increase wantfreenodes to the number of entries in the directory
   (possibly counting both "." and "..") and repeat the previous test.
   Observe that this unbreaks at least the namecache.

The old minvnodes limit (default desiredvnodes / 4) limited the silly
behaviour to above that limit, but since numvnodes was (still is?)
never reduced (even by unmount), the silly region is reached in normal
operation (after reading lots of files) and is fairly sticky after
that (unmount does help by creating lots of free vnodes and then it
takes reading lots of files to reach the silly region again

Now there is no minvnodes limit, and the wantfreevnodes defaults to the
old minvnodes default.  This gives slightly worse behaviour as than
the old version with the default for wantfreevnodes changed from 25 to
desiredvnodes / 4.  25 was far too small and desiredvnodes / 4 is
probably too large for most purposes.  However, desiredvnode can be
enlarged to leave space for a larger than necessary wantfreevnodes,
and the larger than necessary wantfreevnodes is sometimes useful.

Many problems remain, especially for initialization.  The old defaults
worked perfectly for initialization up to the minvnodes limit.  vnodes
were never recycled below that.  Now the default for wantfreevnodes
gives an identical limit with different semantics.   Silly caching
for stat()ed files sometimes occurs below this limit instead of always
occurring above this limit.  E.g., soon after booting, numvnodes is
about 200 and freevnodes is about 100.  In the old version, all stat()s
of new files increase both numvnodes and freevnodes until the limit is
reached -- the caching works.  In the current version, stat()s of new
files cycle through the old free vnodes if possible -- the caching
doesn't work, but instead thrashes especially well when freevnodes is
small.  It takes non-stat() accesses to files to increase maxvnodes.
Eventually maxvnodes becomes large enough for freevnodes to also
become large on average, so the cases with perfect thrashing become
rare.  But the special caching for stat() still gives a determistic
thrashing case.  That us when although wantfreevnodes is larger than
necessary for most cases (and freevnodes is almost as large), it is
not large enough to hold the current working set of stat()s, as can
easily happen for tree walks.  Sometimes you want to walk the tree
more than once and know that it all should fit in caches, but the
recycling makes the caches ineffective.  I think there is a worse
subscase of this, like the one for initialization.  Suppose that
freevnodes is small at the the start of a tree walk.  Then I think
the vnode caching prefers to recycle with this small number than
to create new vnodes.  Reading directories makes freevnodes even
smaller.

The quick fix is supposed to make the free vnodes management almost
null.  A non-quick fix would only turn it off when numvnodes <
maxvnodes.  When numvnodes >= maxvnodes, recycling is still bad
if it is mostly through free vnodes and freevnodes is small.  Here
"small" is relative.  Anything smaller than the working set is
too small, but if the working set is too large to fit then it is
better to let it thrash in a small part of the cache than in a
lage part.  I would try letting freevnodes grow to 3/4 of desiredvnodes
for tree walks but try to keep it lower than 1/4 of desiredvnodes in
normal use.

The current limit of 1/4 of desiredvnodes works better on larger systems.
Such systems might never reach the limits, especially with the slow
ramp-up of numvnodes.  E.g., ref11-amd64 is not nearly as large as
freefall, but now has 11 users and has been up for 29 days; it still
hasn't reached the maxvnodes limit:

   kern.maxvnodes: 621596
   kern.minvnodes: 155399
   vfs.freevnodes: 154789
   vfs.wantfreevnodes: 155399
   vfs.vnodes_created: 117384264
   vfs.numvnodes: 537910

Most FreeBSD systems run big tree walks every night.  This one has about
6M inodes in / and 900M inodes elsewhere.  It would soon reach maxvnodes
if it cached all of these.  Limitiing it to 537910 instead of 621596 is
not useful, but this type of automatic big tree walk where the results
are rarely used shouldn't be allowed to thrash through more than a small
fraction of maxvnodes.  So it can't be automatic to go from a fraction
of 1/4 to 3/4.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20151103173042.K1103>