From owner-freebsd-fs@freebsd.org  Wed Nov  4 19:49:21 2015
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 24B69A2626D
 for <freebsd-fs@mailman.ysv.freebsd.org>; Wed,  4 Nov 2015 19:49:21 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org
 [IPv6:2001:1900:2254:206a::50:5])
 by mx1.freebsd.org (Postfix) with ESMTP id 0B5BB1D81
 for <freebsd-fs@freebsd.org>; Wed,  4 Nov 2015 19:49:21 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: by mailman.ysv.freebsd.org (Postfix)
 id 079A3A2626C; Wed,  4 Nov 2015 19:49:21 +0000 (UTC)
Delivered-To: fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 062EAA26269
 for <fs@mailman.ysv.freebsd.org>; Wed,  4 Nov 2015 19:49:21 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au
 [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id A506C1D80
 for <fs@freebsd.org>; Wed,  4 Nov 2015 19:49:20 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au
 (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197])
 by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 1985F3C576F;
 Thu,  5 Nov 2015 06:49:16 +1100 (AEDT)
Date: Thu, 5 Nov 2015 06:49:15 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
cc: Kirk McKusick <mckusick@mckusick.com>, Bruce Evans <brde@optusnet.com.au>, 
 fs@freebsd.org
Subject: Re: an easy (?) question on namecache sizing
In-Reply-To: <20151103090448.GC2257@kib.kiev.ua>
Message-ID: <20151105043607.K3175@besplex.bde.org>
References: <20151102224910.E2203@besplex.bde.org>
 <201511030447.tA34lo5O090332@chez.mckusick.com>
 <20151103090448.GC2257@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=R6/+YolX c=1 sm=1 tr=0
 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8
 a=kj9zAlcOel0A:10 a=jm51WUJoxU2LgMh7JXUA:9 a=bnPOoLb-eUIJAd_c:21
 a=zAURBNcse8cAlum9:21 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 04 Nov 2015 19:49:21 -0000

On Tue, 3 Nov 2015, Konstantin Belousov wrote:

> On Mon, Nov 02, 2015 at 08:47:50PM -0800, Kirk McKusick wrote:
>> You seem to be proposing several approaches. One is to make
>> wantfreevnodes bigger (half or three-quarters of the maximum).
>> Another seems to be reverting to the previous (freevnodes >= wantfreevnodes
>> && numvnodes >= minvnodes). So what is your proposed change?
>
> Free vnodes could be freed in the soft fashion by vnlru daemon, or in
> hard manner, by the getnewvnode(), when the max for the vnode count is
> reached. The 'soft' way skips vnodes which are directories, to make it
> more probable that vn_fullpath() would succeed, and also has threshold
> for the count of cached pages. The 'hard' way waits up to 1 sec for
> the vnlru daemon to succeed, before forcing a recycle for any vnode,
> regardless of the 'soft' stoppers. This causes the ticking behaviour of
> the system when only one vnode operation in single thread succeeds in a
> second.

This ticking behaviour seems to mostly wrong.  numvnodes is allowed to
grow to its nominal limit (desiredvnodes) in most places (except an off
by 1 error allows it to grow 1 larger), but vnrlu_proc() uses a limit
of 9/10 of that (without the off by 1 error).  So the usual behaviour
is to grow a little larger than the 9/10 limit, then churn by reclaiming
too much to significantly below the 9/10 limit, then build up to above
the 9/10 again...

I didn't see the 1 vnode per second behaviour, except when I set
desiredvnodes to a values like 1, 10 or 100 I saw very interesting
behaviours.  A little larger than 100 works almost normally, but 10
should give 1 vnode per second since 1 is 1/10 of 10.  I think it
actually gives 2 vnodes per second due to the off by 1 error.

I need to know if the churn is intentional before removing it completely.
See the comment below.

> Large wantfreevnodes value is the safety measure to prevent the tick
> steps in practice. My initial reaction on the complain was to just
> suggest to increase desiredvnodes, at least this is what I do on
> machines where there is a lot of both KVA and memory and intensive file
> loads are expected.

Yes, desiredvnodes should be sized large enough to allow space for
plenty of "free" vnodes, but there are some problems with this:
- non-problem: dynamic resizing of all the limits now works quite well.
   Better than I said earlier.  -current ramps up both the total and free
   vnodes better than I said earlier
- not many people know about this magic
- i386's don't have enough KVA for much enlargement.  The defaults here
   give 70K for desiredvnodes with 1GB of RAM and 123K with 3G of usable
   RAM.  Even 70K is really too many.  Memory use for each vnode seems to
   be unreasonably large (more than 1K average even when most vnodes are
   free).  That gives about 100 MB of wired memory.
- there are some bugs with large numbers of vnodes.  300K of them almost
   fit and almost work on i386.  However:
   - in old versions of FreeBSD, perhaps with VM_BCACHE_SIZE_MAX larger
     than the default, null pointer panics occur
   - at least in -current, uma is reluctant to free vnode allocations.
     Reducing desiredvnodes now works (on the next tick) to reduce
     numvnodes, but uma doesn't free anything, since we don't ask it to.
     I think it will free later if memory becomes short
   - in -current, there is a bug freeing memory when memory becomes short.
     I saw this for cached data, not vnodes.  Most memory was "Inact" for
     VMIO, with a very large number of non-free but non-referenced vnodes
     referencing the cached data.  Then the tar programs used to set this
     up get killed by an out of memory error (IIRC, the kernel one, not
     malloc failure).  I don't use swap but haven't seen this error for
     10-15 years except in much larger programs.

Here is my work in progress:

X diff -u2 vfs_subr.c~ vfs_subr.c
X --- vfs_subr.c~	2015-09-28 06:29:43.000000000 +0000
X +++ vfs_subr.c	2015-11-04 14:25:42.476369000 +0000
X @@ -147,14 +147,15 @@
X 
X  /*
X - * Free vnode target.  Free vnodes may simply be files which have been stat'd
X - * but not read.  This is somewhat common, and a small cache of such files
X - * should be kept to avoid recreation costs.
X + * "Free" vnode target.  Free vnodes are rarely actually free.  Usually
X + * they are for files which have been stat'd but not read; these usually
X + * have inode and namecache data attached to them.  A large cache of such
X + * vnodes should be kept to minimise recreation costs.
X   */
X  static u_long wantfreevnodes;
X -SYSCTL_ULONG(_vfs, OID_AUTO, wantfreevnodes, CTLFLAG_RW, &wantfreevnodes, 0, "");
X -/* Number of vnodes in the free list. */
X +SYSCTL_ULONG(_vfs, OID_AUTO, wantfreevnodes, CTLFLAG_RW,
X +    &wantfreevnodes, 0, "Target for minimum number of \"free\" vnodes");
X  static u_long freevnodes;
X -SYSCTL_ULONG(_vfs, OID_AUTO, freevnodes, CTLFLAG_RD, &freevnodes, 0,
X -    "Number of vnodes in the free list");
X +SYSCTL_ULONG(_vfs, OID_AUTO, freevnodes, CTLFLAG_RD,
X +    &freevnodes, 0, "Number of \"free\" vnodes");
X 
X  static int vlru_allow_cache_src;

The comments were very out of date.  The cache needs to be large (thousands),
or small (25).

Also improve descriptions and fix style bugs in sysctls.

X @@ -275,9 +276,9 @@
X 
X  /*
X - * Number of vnodes we want to exist at any one time.  This is mostly used
X - * to size hash tables in vnode-related code.  It is normally not used in
X - * getnewvnode(), as wantfreevnodes is normally nonzero.)
X + * Target for maximum number of vnodes.
X   *
X - * XXX desiredvnodes is historical cruft and should not exist.
X + * XXX "desired" is a not very good historical name for this.  We only
X + * desire to use the full maximum under loads where caching vnodes is
X + * the best use of memory resources.
X   */
X  int desiredvnodes;

This was so out of date that it was almost correct again except for the
parts saying that it was out of date.

X @@ -292,4 +293,5 @@
X  		return (error);
X  	if (old_desiredvnodes != desiredvnodes) {
X +		wantfreevnodes = desiredvnodes / 4;
X  		vfs_hash_changesize(desiredvnodes);
X  		cache_changesize(desiredvnodes);

Keep this in sync.  I considered telling uma the new allocation size
here (also in init), but that shouldn't be needed and would need to be
done elsewhere if done.  desiredvnodes is a soft limit so uma should
not be limited to that, and this is too early to reduce.  It would be
OK for vnrlu_proc() to tell uma to reduce to the current active allocation.

X @@ -300,7 +302,7 @@
X  SYSCTL_PROC(_kern, KERN_MAXVNODES, maxvnodes,
X      CTLTYPE_INT | CTLFLAG_MPSAFE | CTLFLAG_RW, &desiredvnodes, 0,
X -    sysctl_update_desiredvnodes, "I", "Maximum number of vnodes");
X +    sysctl_update_desiredvnodes, "I", "Target for maximum number of vnodes");
X  SYSCTL_ULONG(_kern, OID_AUTO, minvnodes, CTLFLAG_RW,
X -    &wantfreevnodes, 0, "Minimum number of vnodes (legacy)");
X +    &wantfreevnodes, 0, "Old name for vfs.wantfreevnodes (legacy)");
X  static int vnlru_nowhere;
X  SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
X @@ -844,5 +846,5 @@
X 
X  /*
X - * Attempt to keep the free list at wantfreevnodes length.
X + * Attempt to reduce the free list by the requested amount.
X   */
X  static void
X @@ -922,7 +924,31 @@
X  		kproc_suspend_check(p);
X  		mtx_lock(&vnode_free_list_mtx);
X -		if (freevnodes > wantfreevnodes)
X -			vnlru_free(freevnodes - wantfreevnodes);
X -		if (numvnodes <= desiredvnodes * 9 / 10) {
X +		if (numvnodes > desiredvnodes &&
X +		    freevnodes > wantfreevnodes)
X +			vnlru_free(ulmin(numvnodes - desiredvnodes,
X +			    freevnodes - wantfreevnodes));

Magic related to the 9/10 factor is not removed by this, but moved below
and changed.

This just lets freevnodes grow larger than wantfreevnodes if it safely can.

X +		/*
X +		 * Sleep if there are not too many allocated vnodes or
X +		 * exactly the right number of allocated vnodes but too
X +		 * few free vnodes.  When below the target maximum, we
X +		 * will prefer to allocate a new vnode to reusing a "free"
X +		 * vnode.  When at the target maximum, we prefer to discard
X +		 * some non-free vnodes now to cycling through a too-small
X +		 * number of free vnodes later.  When above the target
X +		 * maximum, we must discard some vnodes now to reduce to
X +		 * the maximum.  The discarding here is not smart and tends
X +		 * to discard too many vnodes with wrong choices, but it is
X +		 * good enough since it rarely done.  Occasional excessive
X +		 * random discarding may even be a feature for evolution.
X +		 *
X +		 * Old versions did even more discarding here, by allowing
X +		 * the target maximum to be reached elsewhere but reducing
X +		 * to 9/10 of that here.  I think this was intended to
X +		 * reduce activity here, but actually gave the bug of the
X +		 * target maximum being 9/10 of its requested value, and
X +		 * the evolutionary feature.
X +		 */
X +		if (numvnodes < desiredvnodes || (numvnodes == desiredvnodes &&
X +		    freevnodes >= wantfreevnodes)) {
X  			vnlruproc_sig = 0;
X  			wakeup(&vnlruproc_sig);

The comment describes this too verbosely.  I plan to remove the
numvnodes == desiredvnodes subcase here, but limit numvnodes elsewhere if
necessary to allow room for expansion of freevnodes up to wantfreevnodes
without numvnodes growing too large.

The above gives churning much like before, but only above the full limit
instead of above 9/10 of that, or in the subcase.  The 9/10 limit is
often exceeded, but the new ones rarely are.  The planned version will
make the subcase unreachable.

X @@ -1031,5 +1057,5 @@
X 
X  /*
X - * Wait for available vnodes.
X + * Wait if necessary for a new vnode to become available.
X   */
X  static int
X @@ -1038,12 +1064,11 @@
X 
X  	mtx_assert(&vnode_free_list_mtx, MA_OWNED);
X -	if (numvnodes > desiredvnodes) {
X +	if (numvnodes >= desiredvnodes) {

Fix off by 1 error.  This allowed creation of 1 more than the max.  This
bug was masked by the 9/10 magic reducing the actual limit to a smaller
fuzzier value.  E.g., a limit of 100000 allows 100001 here, but the
actual limit is 90000 in vnrlu_proc().  It is still 100001 here, but it
is hard to reach much above 90000 after reducing to less than that.

X  		if (suspended) {
X  			/*
X -			 * File system is beeing suspended, we cannot risk a
X -			 * deadlock here, so allocate new vnode anyway.
X +			 * The file system is being suspended.  We cannot
X +			 * risk a deadlock here, so allow allocation of
X +			 * another vnode even if this would give too many.
X  			 */
X -			if (freevnodes > wantfreevnodes)
X -				vnlru_free(freevnodes - wantfreevnodes);

Fix spelling and improve wording in comment.  Remove some code.  The caller
can do it and now does.

X  			return (0);
X  		}
X @@ -1055,5 +1080,5 @@
X  		    "vlruwk", hz);
X  	}
X -	return (numvnodes > desiredvnodes ? ENFILE : 0);
X +	return (numvnodes >= desiredvnodes ? ENFILE : 0);

Fix off by 1 error.

X  }
X 
X @@ -1113,8 +1138,10 @@
X  	mtx_lock(&vnode_free_list_mtx);
X  	/*
X -	 * Lend our context to reclaim vnodes if they've exceeded the max.
X +	 * Lend our context to reclaim vnodes if one more would be too many
X +	 * and there are plenty to reclaim.
X  	 */
X -	if (freevnodes > wantfreevnodes)
X -		vnlru_free(1);
X +	if (numvnodes >= desiredvnodes && freevnodes > wantfreevnodes)
X +		vnlru_free(ulmin(numvnodes + 1 - desiredvnodes,
X +		    freevnodes - wantfreevnodes));

Policy change.  This is supposed to be the same as above except for
adjustments to add 1 for the new vnode, but is buggy.  It should do:
- if numvnodes < desiredvnodes, don't use the free list
- if numvnodes > desiredvnodes (rare), always use the free list if it
   is nonempty
- if numvnodes == desiredvnodes (usual), use the free list if it is
   large enough (perhaps just nonempty, or the old threshold of 25)
Then if the free list ends up below wantvnodes, depend on the churn in
vnrlu_proc() to grow it.  Since the condition for waiting (with the old
off by 1 error or the above bugs) rarely occurs, the churn normally
only occurs once per second.  That limits its damage but also limits
the rate of growing the free list.

Planned version:
- if numvnodes - freevnodes >= desiredvnodes - wantfreevnodes, then the
   cache is becoming unbalanced with low chances of recovery (it has too
   many non-free vnodes and not enough space to add free ones).  Then
   wait more often here, as in the above numvnodes == desiredvnodes case.
   Whenever we wait, we wake up vnlru_proc().  We must do this more often
   than before to grow the free list, but not too often.  It does the
   right things except for also shrinking the free list.

X  	error = getnewvnode_wait(mp != NULL && (mp->mnt_kern_flag &
X  	    MNTK_SUSPEND));

This doesn't really fix the original problem.  du -s now works better
starting with a cold cache by allowing freevnodes to grow much larger
than wantfreevnodes while numvnodes is not near its limit.  But when
warm, numvnodes is near its limit and freevnodes may grow small.  Then
the modifications are to attempt to grow freevnodes up to wantvnodes,
but it would be unreasonable to discard many non-free vnodes to grow
freevnodes larger than that.  The churning might do that accidentally,
after perhaps too long, if the usage pattern changes from mostly reads
to mosty stats.

Bruce