Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 2 Nov 2015 22:29:56 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        fs@freebsd.org
Subject:   Re: an easy (?) question on namecache sizing
Message-ID:  <20151102210750.S1908@besplex.bde.org>
In-Reply-To: <20151102193756.L1475@besplex.bde.org>
References:  <20151102193756.L1475@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2 Nov 2015, Bruce Evans wrote:

> At least in old versions before cache_changesize() (should be nc_chsize())
> existed, the name cache is supposed to have size about 2 * desiredvnodes,
> but its effective size seems to be only about desiredvnodes / 4?  Why is
> this?
>
> This shows up in du -s on a large directory like /usr.  Whenever the
> directory has more than about desiredvnodes / 4 entries under it, the
> namecache thrashes.  The number of cached vnodes is also limited to
> about desiredvnodes / 4.
>
> The problem might actually be in vnode caching.  ...

This was easy to answer.  The problem is in vnode caching.  Its only
relationship with the namecache is that if you increase the bogus
vnode cache limit then cache_changesize() now adjusts the associated
namecache limit to match but doesn't increases the associated non-bogus
vnode catch limits to match.

>From vfs_subr.c:

X /*
X  * Number of vnodes we want to exist at any one time.  This is mostly used
X  * to size hash tables in vnode-related code.  It is normally not used in
X  * getnewvnode(), as wantfreevnodes is normally nonzero.)
X  *
X  * XXX desiredvnodes is historical cruft and should not exist.
X  */
X int desiredvnodes;

I probably helped eivind write the XXX comment in 2000.  I only just
noticed the error in the main part of the comment.  This is not the
number of vnodes that we want to exist, but about 4 times that number.

X ....
X SYSCTL_PROC(_kern, KERN_MAXVNODES, maxvnodes,
X     CTLTYPE_INT | CTLFLAG_MPSAFE | CTLFLAG_RW, &desiredvnodes, 0,
X     sysctl_update_desiredvnodes, "I", "Maximum number of vnodes");

The maximum is not bogus.  Only "desired" in the name is bogus.  Note
that the sysctl name doesn't say "desired".  But systat -v still uses
the raw variable name (abbreviated to 'desvn").  It is important for
understanding systat -v output and the source code to know that this
variable is actually the maximum and not the desired number, unlike
what its name suggests.

X SYSCTL_ULONG(_kern, OID_AUTO, minvnodes, CTLFLAG_RW,
X     &wantfreevnodes, 0, "Minimum number of vnodes (legacy)");

Further obfuscations.  The desiredvnodes / 4 number number from here.
This value really is the "wanted" or "desired" number of vnodes.
The sysctl obfuscates it by renaming it to "minvnodes", it is not
a minimum except in the sense that when the current number exceeds
it, the vnrlu daemon tries (not very hard) to reduce this limit.
The description of this sysctl as legacy is confusing.  Perhaps the
name of this sysctl is legacy, but its value is less legacy than
that of desiredvnodes by any name.  This is further obfuscated by
exporting this variable twice (once here and once with its correct
name under vfs).  See below for an example of sysctl output.

X ...
X 	wantfreevnodes = desiredvnodes / 4;

This is where the length used below is initialized.

X ...
X /*
X  * Attempt to keep the free list at wantfreevnodes length.
X  */
X static void
X vnlru_free(int count)

Misplaced comment.  This function actually attempts to keep the list
at a certain length that is decided elsewhere.

X ...
X static void
X vnlru_proc(void)
X {
X ...
X 	for (;;) {
X 		kproc_suspend_check(p);
X 		mtx_lock(&vnode_free_list_mtx);
X 		if (freevnodes > wantfreevnodes)
X 			vnlru_free(freevnodes - wantfreevnodes);

This is where the length used above is passed.  This length is defaulted
by the above initialization and may be changed by either of the 2 sysctls
for it.

But the attempt usually fails, and then the vnode cache works better
by growing nearly 4 times as large as is "wanted", up to nearly its
"desired" size which is actually the limit on its size.  In my du -s
test, the attempt succeeds and breaks the caching almost perfectly
when the number of files is slightly larger than wantfreevnodes =
desiredvnodes / 4, but when the files are read the attempt fails and
the caching works when the number of files is smaller than
desiredvnodes.

Reading the files is probably closer to normal operation.  On freefall
now: "sysctl -a | grep vnode" gives:

    kern.maxvnodes: 485993
    kern.minvnodes: 121498
    vfs.freevnodes: 121453
    vfs.wantfreevnodes: 121498
    vfs.vnodes_created: 360808607
    vfs.numvnodes: 408313

Note that the current number is about 3.5 times as large as the "wanted"
number.  This shows that the attempts to reduce to the "wanted" number
usually fail, so the cache is is amost 4 times as large as is "wanted".

The kern values are limits, with a hard maximum and a soft minimum.  The
kern.minvnodes numbers is duplicated under its better name
vfs.wantfreevnodes.

The sysctl for desiredvnodes is now the SYSCTL_PROC() show above.  The
function for this now updates vfs_hash and the namecache, but not
wantfreevnodes.

Code earlier in vfs_subr.c shows that it is wantfreevnodes that is primary
and its duplication for minvnodes really is legacy.  It is vfs.wantvnodes
that should be the SYSCTL_PROC():

X /*
X  * Free vnode target.  Free vnodes may simply be files which have been stat'd
X  * but not read.  This is somewhat common, and a small cache of such files
X  * should be kept to avoid recreation costs.
X  */
X static u_long wantfreevnodes;
X SYSCTL_ULONG(_vfs, OID_AUTO, wantfreevnodes, CTLFLAG_RW, &wantfreevnodes, 0, "");

According to this, keeping vnodes as free but inactive for files that have
been stat'ed but not read is intentional.  The du -s example shows that
this works almost perfectly as foot-shooting, except VMIO stops the foot
being blown very far away.

X /* Number of vnodes in the free list. */
X static u_long freevnodes;
X SYSCTL_ULONG(_vfs, OID_AUTO, freevnodes, CTLFLAG_RD, &freevnodes, 0,
X     "Number of vnodes in the free list");

Data for the foot-shooting:

    kern.maxvnodes: 70000
    kern.minvnodes: 30774
    vfs.numvnodes: 38157
    vfs.vnodes_created: 461556
    vfs.wantfreevnodes: 30774
    vfs.freevnodes: 30775

Here maxvnodes started at 4*30774 but I reduced it to 70000 for comparison
with another system.  I didn't reduce minvnodes from 30774 since neither
I nor the sysctl knew about it.  Then du -s on a directory tree with
49683 files gave the (now even more misconfigured) "wanted" number of free
vnodes almost perfectly.  70000 - 30774 = 39276 is less than the number
of files, so this asks for thrashing of the vnode and name caches.  38157
instead of 39726 vnodes were left cached.

The default misconfiguration gives more mysterious numbers:

    kern.maxvnodes: 123096
    kern.minvnodes: 30774
    vfs.numvnodes: 38143
    vfs.vnodes_created: 50136
    vfs.wantfreevnodes: 30774
    vfs.freevnodes: 30774

Now if 1/4 of maxvnodes can be forced to be free without limiting the
number of non-free ones too much -- 3/4 can remain used; 3/4 of maxvnodes
is about 90000 and that is plenty for caching about 49000 files.
Apparently, the freeing is too active, so when there aren't many vnodes
in use it reaches the target by discard useful vnodes.  For the du -s
access pattern with lots of stat'ed files, the total number of vnodes
in use never grows large enough to justify freeing any.  But on freefall
or any system that has been up for a while doing a variety of tasks, the
number of vnodes in use is large, so discarding some earlier than necessary
works right.  A large du -s then discards lots of old vnodes but not the
ones that it looks at unless there are just too many.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20151102210750.S1908>