From owner-freebsd-fs@freebsd.org  Sun Nov 15 21:27:01 2015
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0C907A2F397
 for <freebsd-fs@mailman.ysv.freebsd.org>; Sun, 15 Nov 2015 21:27:01 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org
 [IPv6:2001:1900:2254:206a::50:5])
 by mx1.freebsd.org (Postfix) with ESMTP id E44E415ED
 for <freebsd-fs@freebsd.org>; Sun, 15 Nov 2015 21:27:00 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: by mailman.ysv.freebsd.org (Postfix)
 id E2106A2F396; Sun, 15 Nov 2015 21:27:00 +0000 (UTC)
Delivered-To: fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id E1891A2F394
 for <fs@mailman.ysv.freebsd.org>; Sun, 15 Nov 2015 21:27:00 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au
 [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id 6983015EC
 for <fs@freebsd.org>; Sun, 15 Nov 2015 21:26:59 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au
 (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197])
 by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id E277C1A26C0;
 Mon, 16 Nov 2015 07:59:48 +1100 (AEDT)
Date: Mon, 16 Nov 2015 07:59:48 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
cc: Konstantin Belousov <kostikbel@gmail.com>, 
 Kirk McKusick <mckusick@mckusick.com>, fs@freebsd.org
Subject: fixing the vnode cache (was: Re: an easy (?) question on namecache
 sizing)
In-Reply-To: <20151105043607.K3175@besplex.bde.org>
Message-ID: <20151116035721.H1540@besplex.bde.org>
References: <20151102224910.E2203@besplex.bde.org>
 <201511030447.tA34lo5O090332@chez.mckusick.com>
 <20151103090448.GC2257@kib.kiev.ua> <20151105043607.K3175@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=R6/+YolX c=1 sm=1 tr=0
 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8
 a=kj9zAlcOel0A:10 a=lBPozctHxlCGmP5tV90A:9 a=22Eukok0kU7WRqKD:21
 a=E7JcsIwkmnokYeuG:21 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 15 Nov 2015 21:27:01 -0000

On Thu, 5 Nov 2015, Bruce Evans wrote:

> ...
> Here is my work in progress:
> ...

Here is my work sort of finished.  It was a lot of work, and fixes many
bugs, but many fundamental bugs remain to be fixed -- the LRU ordering
has been broken since about FreeBSD-3, and new and old fixes are mostly
complicated messes to work around that.

The main changes here are:
- fix watermarks
- update comments
- add comments

X diff -u2 ../../kern/vfs_subr.c~ ../../kern/vfs_subr.c
X --- ../../kern/vfs_subr.c~	2015-09-28 06:29:43.000000000 +0000
X +++ ../../kern/vfs_subr.c	2015-11-15 02:12:51.170795000 +0000
X @@ -98,4 +98,7 @@
X  #endif
X 
X +volatile static int vlru_verbose = 1;
X +#define	DPRINTF(...)	do { if (vlru_verbose) printf(__VA_ARGS__); } while (0)
X +
X  static void	delmntque(struct vnode *vp);
X  static int	flushbuflist(struct bufv *bufv, int flags, struct bufobj *bo,

This patch has too much debugging code for production use or commit, but
if you try it then keep the debugging messages on to observe how
infrequently reclamation is done and perhaps to see what it does.

X @@ -147,22 +150,36 @@
X 
X  /*
X - * Free vnode target.  Free vnodes may simply be files which have been stat'd
X - * but not read.  This is somewhat common, and a small cache of such files
X - * should be kept to avoid recreation costs.
X + * "Free" vnode target.  Free vnodes are rarely completely free, but are
X + * just ones that are cheap to recycle.  Usually they are for files which
X + * have been stat'd but not read; these usually have inode and namecache
X + * data attached to them.  This target is the preferred minimum size of a
X + * sub-cache consisting mostly of such files. The system balances the size
X + * of this sub-cache with its complement to try to prevent either from
X + * thrashing while the other is relatively inactive.  The targets express
X + * a preference for the best balance.
X + *
X + * "Above" this target there are 2 further targets (watermarks) related
X + * to recyling of free vnodes.  In the best-operating case, the cache is
X + * exactly full, the free list has size between vlowat and vhiwat above the
X + * free target, and recycling from it and normal use maintains this state.
X + * Sometimes the free list is below vlowat or even empty, but this state
X + * is even better for immediate use provided the cache is not full.
X + * Otherwise, vnlru_proc() runs to reclaim enough vnodes (usually non-free
X + * ones) to reach one of these states.  The watermarks are currently hard-
X + * coded as 4% and 9% of the available space higher.  These and the default
X + * of 25% for wantfreevnodes are too large if the memory size is large.
X + * E.g., 9% of 75% of MAXVNODES is more than 566000 vnodes to reclaim
X + * whenever vnlru_proc() becomes active.
X   */

The comment explains lots of what this patch does, but not much of the
history that led to this mess.

In 4.4BSD-Lite*, the free list contained all inactive vnodes and I think
also some really free ones (not pointing to anything, but holding unused
malloced memory).  getnewvnode() reclaimed directly.  desiredvnodes
existed and wantvnodes didn't exist, but the cache size was allowed to
grow in emerency up to size 2*desiredvnodes and without reclamation up
to size desiredvnodes.  So it sort of had actual size 2*desiredvnodes
and wantvnodes was spelled desiredvnodes and gave its actual size.  LRU
worked perfectly except in cases where the reclmation had to skip a
locked vnode).

In FreeBSD-3, wantfreevnodes existed and defaulted to 25.  It could
be set to 0 to get the old behaviour for for freevnodes <= numvnodes,
but growth above desiredvnodes up to 2*desiredvnodes more strongly
restricted (perhaps disallowed).  I don't see how the small default
of 25 could have been right even in 1993 before it was added, when
memory and cache sizes were smaller.  Reclamation was still done
directly in getvnode(), so LRU order was still usually used when
reclamation was done.  However, FreeBSD-3 hangs on to vnodes more
strongly than 4.4BSD, so the free list tends to fill up with old
unused garbage.  Collection of this garbage is still inadequate.

wantfreevnodes had no comment on it in FreeBSD-3.  The comment on it
that was fixed above was added later, and it was wrong even for
FreeBSD-3.  In all versions, the reason for "want"ing this number of
vnodes seems to be to avoid thashing through a cache of even smaller
size, and has nothing to do with reclamation costs like the comment
says.  The FreeBSD-3 default of 25 looks too small even for a watermark
delta, but really was used for the minimum cache size.  And it really
asked for a negative usuable size, since the cache usually soon fills
up with more than 25 unreclaimable garbage vnodes.

In FreeBSD-4, reclamation is sometimes done (hopefully in advance) by
vnlru_proc(), much like in -current but with differently broken watermarks.
Held vnodes are no longer kept on the free list or reclained directly
by getnewvnode().  Held vnodes are not kept in LRU order.  This breaks
LRU order for held vnodes.  Held vnode are reclaimed by freeing everything
attached and putting them on the free list.  LRU order for them is honored
only after that.  wantfreevnodes is still 25, but it is not as broken as
before since the unreclaimable garbage vnodes are not counted in
freevnodes.

In later versions ending in -current, wantfreevnodes is increased to
a reasonably large value.  Held vnodes are reclaimed by freeing eveything
including themselves and not putting them on the free list.  The
brokenness in the watermarks was moved by misadjusting for the differences
in counts caused by this.

The default free list levels in this patch are min = 25%, low = 28%,
high = 32% and max = 100% (growing above high is uninhibited and growing
below low is inhibited).  The corresponding levels in -current are sort
of: min = 25%, low1 = 90% (in vnlru_proc()), low2 = 100% (in
getnewvnode()), high = 100% and max = 25% (growing in either direction
from min = max is inhibited, and low* and high are nonsense).  low* and
high are actually watermarks on the full cache and only affect the free
list and easily reclaimable space indirectly.  They are wrong for that
too -- 100% for getnewvnode() delays the reclaiiming until it has no
chance of completing without waiting, 90% for vnlru_proc() is inconsistent
with this the inclosistency is necessary for things to mostly work OK by
reclaiming enough in advance, at a cost of wasted space and time.

X  static u_long wantfreevnodes;
X -SYSCTL_ULONG(_vfs, OID_AUTO, wantfreevnodes, CTLFLAG_RW, &wantfreevnodes, 0, "");
X -/* Number of vnodes in the free list. */
X +SYSCTL_ULONG(_vfs, OID_AUTO, wantfreevnodes, CTLFLAG_RW,
X +    &wantfreevnodes, 0, "Target for minimum number of \"free\" vnodes");
X  static u_long freevnodes;
X -SYSCTL_ULONG(_vfs, OID_AUTO, freevnodes, CTLFLAG_RD, &freevnodes, 0,
X -    "Number of vnodes in the free list");
X -
X -static int vlru_allow_cache_src;
X -SYSCTL_INT(_vfs, OID_AUTO, vlru_allow_cache_src, CTLFLAG_RW,
X -    &vlru_allow_cache_src, 0, "Allow vlru to reclaim source vnode");
X +SYSCTL_ULONG(_vfs, OID_AUTO, freevnodes, CTLFLAG_RD,
X +    &freevnodes, 0, "Number of \"free\" vnodes");
X 
X  static u_long recycles_count;
X  SYSCTL_ULONG(_vfs, OID_AUTO, recycles, CTLFLAG_RD, &recycles_count, 0,
X -    "Number of vnodes recycled to avoid exceding kern.maxvnodes");
X +    "Number of vnodes recycled to meet vnode cache targets");
X 
X  /*

Clean up names and descriptions in sysctls a bit too.

X @@ -274,12 +291,11 @@
X      syncer_state;
X 
X -/*
X - * Number of vnodes we want to exist at any one time.  This is mostly used
X - * to size hash tables in vnode-related code.  It is normally not used in
X - * getnewvnode(), as wantfreevnodes is normally nonzero.)
X - *
X - * XXX desiredvnodes is historical cruft and should not exist.
X - */
X +/* Target for maximum number of vnodes. */
X  int desiredvnodes;

The code has been churned so much that this variable was already almost
correct again, but its comments were wrong.

X +static int gapvnodes;		/* gap between wanted and desired */
X +static int vhiwat;		/* enough extras after expansion */
X +static int vlowat;		/* minimal extras before expansion */
X +static int vstir;		/* nonzero to stir non-free vnodes */
X +static volatile int vsmalltrigger = 8;	/* pref to keep if > this many pages */
X 
X  static int
X @@ -292,4 +308,6 @@
X  		return (error);
X  	if (old_desiredvnodes != desiredvnodes) {
X +		wantfreevnodes = desiredvnodes / 4;
X +		/* XXX locking seems to be incomplete. */
X  		vfs_hash_changesize(desiredvnodes);
X  		cache_changesize(desiredvnodes);

There seems to be only Giant locking for the sysctl.  This is more than
enough for *vnodes since we allow for them being garbage, but vfs_hash
and vfs_cache need more.

X @@ -300,7 +318,7 @@
X  SYSCTL_PROC(_kern, KERN_MAXVNODES, maxvnodes,
X      CTLTYPE_INT | CTLFLAG_MPSAFE | CTLFLAG_RW, &desiredvnodes, 0,
X -    sysctl_update_desiredvnodes, "I", "Maximum number of vnodes");
X +    sysctl_update_desiredvnodes, "I", "Target for maximum number of vnodes");
X  SYSCTL_ULONG(_kern, OID_AUTO, minvnodes, CTLFLAG_RW,
X -    &wantfreevnodes, 0, "Minimum number of vnodes (legacy)");
X +    &wantfreevnodes, 0, "Old name for vfs.wantfreevnodes (legacy)");
X  static int vnlru_nowhere;
X  SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW,
X @@ -310,4 +328,16 @@
X  static int vnsz2log;
X 
X +/* Statistics/debugging. */
X +static volatile int vdry;
X +static int vexamined;
X +static int vdir;
X +static int vobj;
X +static int vinuse;
X +static int vcache;
X +static int vfree;
X +static int vdoom;
X +static int vbig;
X +static int vskip;
X +
X  /*
X   * Support for the bufobj clean & dirty pctrie.

vdry can be set using ddb to 1, 2, or these values ORed with 4 or 8, to
do a dry reclaim run.

X @@ -333,8 +363,8 @@
X   * Reevaluate the following cap on the number of vnodes after the physical
X   * memory size exceeds 512GB.  In the limit, as the physical memory size
X - * grows, the ratio of physical pages to vnodes approaches sixteen to one.
X + * grows, the ratio of the memory size in KB to to vnodes approaches 64:1.
X   */
X  #ifndef	MAXVNODES_MAX
X -#define	MAXVNODES_MAX	(512 * (1024 * 1024 * 1024 / (int)PAGE_SIZE / 16))
X +#define	MAXVNODES_MAX	(512 * 1024 * 1024 / 64)	/* 8M */
X  #endif
X  static void
X @@ -347,13 +377,14 @@
X  	 * Desiredvnodes is a function of the physical memory size and the
X  	 * kernel's heap size.  Generally speaking, it scales with the
X -	 * physical memory size.  The ratio of desiredvnodes to physical pages
X -	 * is one to four until desiredvnodes exceeds 98,304.  Thereafter, the
X -	 * marginal ratio of desiredvnodes to physical pages is one to
X -	 * sixteen.  However, desiredvnodes is limited by the kernel's heap
X +	 * physical memory size.  The ratio of desiredvnodes to the physical
X +	 * memory size is 1:16 until desiredvnodes exceeds 98,304.
X +	 * Thereafter, the
X +	 * marginal ratio of desiredvnodes to the physical memory size is
X +	 * 1:64.  However, desiredvnodes is limited by the kernel's heap
X  	 * size.  The memory required by desiredvnodes vnodes and vm objects
X -	 * may not exceed one seventh of the kernel's heap size.
X +	 * must not exceed 1/7th of the kernel's heap size.
X  	 */
X -	physvnodes = maxproc + vm_cnt.v_page_count / 16 + 3 * min(98304 * 4,
X -	    vm_cnt.v_page_count) / 16;
X +	physvnodes = maxproc + pgtok(vm_cnt.v_page_count) / 64 +
X +	    3 * min(98304 * 16, pgtok(vm_cnt.v_page_count)) / 64;
X  	virtvnodes = vm_kmem_size / (7 * (sizeof(struct vm_object) +
X  	    sizeof(struct vnode)));

Start fixing defaults.  The comments are still too verbose.  The main
change here is from sizes in pages to sizes in K.  Pages don't scale
correctly.  E.g., 8K-pages gave half as many vnodes as 4K pages, but
the best number of vnodes is unrelated to the page size unless the
memory size is very small.  I only looked at this because the trigger
ppoint depends magically on the scaling here.

X @@ -744,28 +775,16 @@
X   */
X  static int
X -vlrureclaim(struct mount *mp)
X +vlrureclaim(struct mount *mp, int reclaim_nc_src, int trigger)
X  {
X  	struct vnode *vp;
X -	int done;
X -	int trigger;
X -	int usevnodes;
X -	int count;
X +	int count, done, target;
X 
X -	/*
X -	 * Calculate the trigger point, don't allow user
X -	 * screwups to blow us up.   This prevents us from
X -	 * recycling vnodes with lots of resident pages.  We
X -	 * aren't trying to free memory, we are trying to
X -	 * free vnodes.
X -	 */
X -	usevnodes = desiredvnodes;
X -	if (usevnodes <= 0)
X -		usevnodes = 1;
X -	trigger = vm_cnt.v_page_count * 2 / usevnodes;

Calculations mostly moved to caller and fixed.


X  	done = 0;
X  	vn_start_write(NULL, &mp, V_WAIT);
X  	MNT_ILOCK(mp);
X -	count = mp->mnt_nvnodelistsize / 10 + 1;
X -	while (count != 0) {
X +	count = mp->mnt_nvnodelistsize;
X +	target = count * (int64_t)gapvnodes / imax(desiredvnodes, 1);
X +	target = target / 10 + 1;
X +	while (count != 0 && done < target) {
X  		vp = TAILQ_FIRST(&mp->mnt_nvnodelist);
X  		while (vp != NULL && vp->v_type == VMARKER)

'count' is now scaled by gapvnodes/desiredvnodes.  This ratio is normally
not much below 1, but configuring wantfreevnodes much higher is supposed
to work well now and this gives smaller ratios (down to about 1/10 is OK).
When most vnodes are free, there is no chance of getting 10% of numvnodes
from non-free ones here.  Getting free ones here was harmful and is no
longer done.

The mist important fix here is to not stop after looking at just 'count'
vnodes.  Often the cache consists of mostly unreclaimable vnodes so
10% can't be reclaimed even if you search to the end, or none can be
reclaimed if you hit a block of unreclaimable ones and stop early.  This
caused the "tick" behaviour of 1 vnode creation per 1 or 3 seconds more
often than necessary.

So the search is now to approximately to the end.  This may take a lot
of CPU, but only when necessary.  In normal use, the search length is
extended by 10-100%.  Rarely by 1000%.  More rarely to the end.

X @@ -773,4 +792,22 @@
X  		if (vp == NULL)
X  			break;
X +		/*
X +		 * XXX LRU is completely broken for non-free vnodes.  First
X +		 * by calling here in mountpoint order, then by moving
X +		 * unselected vnodes to the end here, and most grossly by
X +		 * removing the vlruvp() function that was supposed to
X +		 * maintain the order.  (This function was born broken
X +		 * since syncer problems prevented it doing anything.)  The
X +		 * order is closer to LRC (C = Created).
X +		 *
X +		 * LRU reclaiming of vnodes seems to have last worked in
X +		 * FreeBSD-3 where LRU wasn't mentioned under any spelling.
X +		 * Then there was no hold count, and inactive vnodes were
X +		 * simply put on the free list in LRU order.  The separate
X +		 * lists also break LRU.  We prefer to reclaim from the
X +		 * free list for technical reasons.  This tends to thrash
X +		 * the free list to keep very unrecently used held vnodes.
X +		 * The problem is mitigated by keeping the free list large.
X +		 */
X  		TAILQ_REMOVE(&mp->mnt_nvnodelist, vp, v_nmntvnodes);
X  		TAILQ_INSERT_TAIL(&mp->mnt_nvnodelist, vp, v_nmntvnodes);

This is wrong about there being no hold count in FreeBSD-3.

I don't see any reason why this can't be or wasn't fixed by simply
anyother list of all vnodes in LRU order.  This function can use this
list and the syncer can keep using the old list.  This would fix the
LRU problem, leaving only the problem of old garbage clogging up the
non-free part and being almost unlimited in time and space.

X @@ -781,10 +818,35 @@
X  		 * If it's been deconstructed already, it's still
X  		 * referenced, or it exceeds the trigger, skip it.
X +		 * XXX reword and update above and merge with the following.
X +		 * Skip free vnodes.  We are trying to make space to expand
X +		 * the free list, not reduce it.
X  		 */

Reclaiming from the free list here didn't create any useful space, but
broke the LRU order for the free list part where it still works.

X +		if (vdry & 1 && (vp->v_iflag & VI_FREE) == 0)
X +			goto out;
X +		if (vdry & 2 && (vp->v_iflag & VI_FREE) != 0)
X +			goto out;
X +		vexamined++;
X +		if (vp->v_type == VDIR)
X +			vdir++;
X +		if (vp->v_object != NULL)
X +			vobj++;
X +		if (vp->v_usecount != 0)
X +			vinuse++;
X +		else if (!reclaim_nc_src && !LIST_EMPTY(&vp->v_cache_src))
X +			vcache++;
X +		else if ((vp->v_iflag & VI_FREE) != 0)
X +			vfree++;
X +		else if ((vp->v_iflag & VI_DOOMED) != 0)
X +			vdoom++;
X +		else if (vp->v_object != NULL &&
X +		    vp->v_object->resident_page_count > trigger)
X +			vbig++;
X +out:

Lots of debugging code.


X  		if (vp->v_usecount ||
X -		    (!vlru_allow_cache_src &&
X -			!LIST_EMPTY(&(vp)->v_cache_src)) ||
X +		    (!reclaim_nc_src && !LIST_EMPTY(&vp->v_cache_src)) ||
X +		    ((vp->v_iflag & VI_FREE) != 0) ||
X  		    (vp->v_iflag & VI_DOOMED) != 0 || (vp->v_object != NULL &&
X -		    vp->v_object->resident_page_count > trigger)) {
X +		    vp->v_object->resident_page_count > trigger) ||
X +		    vdry) {
X  			VI_UNLOCK(vp);
X  			goto next_iter;

The only other changes in this function are:
- vlru_allow_cache_src is now a parameter (usually 0, and its sysctl is
   no longer supported)
- never reclaim free vnodes.

X @@ -810,8 +872,9 @@
X  		 */
X  		if (vp->v_usecount ||
X -		    (!vlru_allow_cache_src &&
X -			!LIST_EMPTY(&(vp)->v_cache_src)) ||
X +		    (!reclaim_nc_src && !LIST_EMPTY(&vp->v_cache_src)) ||
X +		    (vp->v_iflag & VI_FREE) != 0 ||
X  		    (vp->v_object != NULL &&
X  		    vp->v_object->resident_page_count > trigger)) {
X +			vskip++;
X  			VOP_UNLOCK(vp, LK_INTERLOCK);
X  			vdrop(vp);
X @@ -844,5 +907,5 @@
X 
X  /*
X - * Attempt to keep the free list at wantfreevnodes length.
X + * Attempt to reduce the free list by the requested amount.
X   */
X  static void
X @@ -901,4 +964,22 @@
X  	}
X  }
X +
X +/* XXX some names and initialization are bad for limits and watermarks. */
X +static int
X +vspace(void)
X +{
X +	int space;
X +
X +	gapvnodes = imax(desiredvnodes - wantfreevnodes, 100);
X +	vhiwat = gapvnodes / 11; /* 9% -- just under the 10% in vlrureclaim() */
X +	vlowat = vhiwat / 2;
X +	if (numvnodes > desiredvnodes)
X +		return (0);
X +	space = desiredvnodes - numvnodes;
X +	if (freevnodes > wantfreevnodes)
X +		space += freevnodes - wantfreevnodes;
X +	return (space);
X +}
X +

The easily reclaimable space is:
- too many vnodes -- no space
- between numvnodes and desiredvnodes -- create a new vnode
- just enough vnodes and plenty of free ones -- reclaim a free one
- else no space.

X  /*
X   * Attempt to recycle vnodes in a context that is always safe to block.

X @@ -913,16 +994,35 @@
X  {
X  	struct mount *mp, *nmp;
X -	int done;
X  	struct proc *p = vnlruproc;
X +	unsigned long ofreevnodes, onumvnodes;
X +	int done, force, reclaim_nc_src, trigger, usevnodes;

Start fixing bogus unsigned/long types.  int counters never worked with
vnode counts that actually needed to be unsigned/long.  The type of
desiredvnodes was always correct (int).
X 
X  	EVENTHANDLER_REGISTER(shutdown_pre_sync, kproc_shutdown, p,
X  	    SHUTDOWN_PRI_FIRST);
X 
X +	force = 0;
X  	for (;;) {
X  		kproc_suspend_check(p);
X  		mtx_lock(&vnode_free_list_mtx);
X -		if (freevnodes > wantfreevnodes)
X -			vnlru_free(freevnodes - wantfreevnodes);
X -		if (numvnodes <= desiredvnodes * 9 / 10) {
X +		/*
X +		 * If numvnodes is too large (due to desiredvnodes being
X +		 * adjusted using its sysctl, or emergency growth), first
X +		 * try to reduce it by discarding from the free list.
X +		 */
X +		if (numvnodes > desiredvnodes && freevnodes > 0)
X +			vnlru_free(ulmin(numvnodes - desiredvnodes,
X +			    freevnodes));
X +		/*
X +		 * Sleep if the vnode cache is in a good state.  This is
X +		 * when it is not over-full and has space for about a 4%
X +		 * or 9% expansion (by growing its size or inexcessively
X +		 * reducing its free list).  Otherwise, try to reclaim
X +		 * space for a 10% expansion.
X +		 */
X +		if (vstir && force == 0) {
X +			force = 1;
X +			vstir = 0;
X +		}
X +		if (vspace() >= vlowat && force == 0 && vdry == 0) {
X  			vnlruproc_sig = 0;
X  			wakeup(&vnlruproc_sig);

Hopefully the comments are verbose enough to explain this.  It is standard
watermark stuff, done non-magically.

X @@ -933,4 +1033,40 @@
X  		mtx_unlock(&vnode_free_list_mtx);
X  		done = 0;
X +		vexamined = 0;
X +		vdir = 0;
X +		vobj = 0;
X +		vinuse = 0;
X +		vcache = 0;
X +		vfree = 0;
X +		vdoom = 0;
X +		vbig = 0;
X +		vskip = 0;
X +		ofreevnodes = freevnodes;
X +		onumvnodes = numvnodes;

Debugging.

X +		/*
X +		 * Calculate parameters for recycling.  These are the same
X +		 * throughout the loop to give some semblance of fairness.
X +		 * The trigger point is to avoid recycling vnodes with lots
X +		 * of resident pages.  We aren't trying to free memory; we
X +		 * are trying to recycle or at least free vnodes.
X +		 */
X +		if (numvnodes <= desiredvnodes)
X +			usevnodes = numvnodes - freevnodes;
X +		else
X +			usevnodes = numvnodes;
X +		if (usevnodes <= 0)
X +			usevnodes = 1;
X +		/*
X +		 * The trigger value is is chosen to give a conservatively
X +		 * large value to ensure that it alone doesn't prevent
X +		 * making progress.  The value can easily be so large that
X +		 * it is effectively infinite in some congested and
X +		 * misconfigured cases, and this is necessary.  Normally
X +		 * it is about 8 to 100 (pages), which is quite large.
X +		 */
X +		trigger = vm_cnt.v_page_count * 2 / usevnodes;
X +		if (force < 2 && (vdry & 4) == 0)
X +			trigger = vsmalltrigger;

In the old version, the same basic formula for 'trigger' ends up being
a fancy spelling of between 8 and 32 (pages) with the default desiredvnodes.
This is because the default desiredvnodes is v_page count scaled by by
between 1/4 and 1/16.  The formula here inverts this and multiplies by 2.
8 4K pages was about right, but rarely happens now that memory sizes are
larger.  This 8 is hard-coded in the default for vsmalltrigger.

The old scaling makes some sense.  If all vnodes were used and had
'trigger' pages, then they would have twice as many pages as possible.
So the limit prevents more than half of the vnode cache being clogged
up with vnodes too big to be reclaimed.

But this barely works even in configurations close to the default.  1/4
of vnodes are now normally free, and reclaiming now starts at 9/10 full
instead of full.  So where the safety margin was about 1/2, it is now
1/2 - 1/4 - 1/10 = 3/20.  My change fixes the scaling so that it gives
1/2 again.

Configurations not close to the default were more broken.  Suppose
desiredvnodes is 100000 and someone changes it to 1000.  This explodes
'trigger' by a factor of 1000.  And this huge factor is needed to
ensure that complete clogging by large vnodes is not possible.

Workarounds desribed below.

X +		reclaim_nc_src = (force >= 3 || (vdry != 0 && (vdry & 8) == 0));
X  		mtx_lock(&mountlist_mtx);
X  		for (mp = TAILQ_FIRST(&mountlist); mp != NULL; mp = nmp) {
X @@ -939,11 +1075,33 @@
X  				continue;
X  			}
X -			done += vlrureclaim(mp);
X +			done += vlrureclaim(mp, reclaim_nc_src, trigger);
X  			mtx_lock(&mountlist_mtx);
X  			nmp = TAILQ_NEXT(mp, mnt_list);
X  			vfs_unbusy(mp);
X  		}
X +		DPRINTF("targ %d done %d num %lu -> %lu free %lu -> %lu %s\n",
X +		    gapvnodes / 10, done,
X +		    onumvnodes, numvnodes, ofreevnodes, freevnodes,
X +		    vspace() >= vhiwat ? "complete" :
X +		    done != 0 ? "incomplete" : "fail");
X +		DPRINTF(
X +    "exam %d dir %d obj %d inuse %d cache %d free %d doom %d big %d skip %d\n",
X +		    vexamined, vdir, vobj, vinuse, vcache, vfree, vdoom, vbig,
X +		    vskip);

Debugging.

X  		mtx_unlock(&mountlist_mtx);
X -		if (done == 0) {
X +		if (onumvnodes > desiredvnodes && numvnodes <= desiredvnodes)
X +			uma_reclaim();

This hack works well for reclaiming from uma after reducing desiredvnodes.

X +		if (done == 0 && vdry == 0) {
X +			if (force == 0 || force == 1) {
X +				force = 2;
X +				printf("vnlru forcing trigger\n");
X +				continue;
X +			}
X +			if (force == 2) {
X +				force = 3;
X +				printf("vnlru forcing namecache src\n");
X +				continue;
X +			}
X +			force = 0;

This makes several passes: only use the big calculated value for trigger
in emergency.  Similarly, force reclaiming of all cache_src cloggage in
emergency.  There should be another pass with an infinite trigger.

X  #if 0
X  			/* These messages are temporary debugging aids */
X @@ -953,8 +1111,17 @@
X  				printf("vnlru process messages stopped.\n");
X  #endif
X +			printf("vnlru process getting nowhere\n");

This is supposed to be unreachable unless usecount > 0 for all vnodes.
Set desiredvnodes to 1 to see this.

BTW, setting desiredvnodes to 0 used to be robust (one of the imax's
above is to convert this 0 to 1), but now it panics in namecache
reinitialization.

X  			vnlru_nowhere++;
X  			tsleep(vnlruproc, PPAUSE, "vlrup", hz * 3);
X  		} else
X  			kern_yield(PRI_USER);
X +		/*
X +		 * After becoming active to expand above low water, keep
X +		 * active until above high water.
X +		 */
X +		force = (vspace() < vhiwat && vdry == 0 ? 1 : 0);
X +		if (force != 0)
X +			printf("vnlru process retrying\n");
X +		vdry = 0;

force = 1 is a normal retry case (when we grew the space a little but not
enough).  In weird configurations/loads, it is possible to grow by a lot
in tiny steps, with each pass scanning hundreds of thousands of vnodes.
The defaults are supposed to limits the number of steps to 2.

X  	}
X  }
X @@ -1030,6 +1197,16 @@
X  }
X 
X +static void
X +vcheckspace(void)
X +{
X +
X +	if (vspace() < vlowat && vnlruproc_sig == 0) {
X +		vnlruproc_sig = 1;
X +		wakeup(vnlruproc);
X +	}
X +}
X +
X  /*
X - * Wait for available vnodes.
X + * Wait if necessary for space for a new vnode.
X   */
X  static int
X @@ -1038,12 +1215,11 @@
X 
X  	mtx_assert(&vnode_free_list_mtx, MA_OWNED);
X -	if (numvnodes > desiredvnodes) {
X +	if (numvnodes >= desiredvnodes) {

This is getnewvnode_wait().  The changes in it are mostly to fix the
watermark.  Actually waiting is rare before and after.  The 9/10 watermark
elsewhere was not as broken as the 10/10 watermark here.

The comparison in the above was not exacly an off-by-1 error.  It allows
numvnodes to grow 1 too large, but the algorithm depended on this.

X  		if (suspended) {
X  			/*
X -			 * File system is beeing suspended, we cannot risk a
X -			 * deadlock here, so allocate new vnode anyway.
X +			 * The file system is being suspended.  We cannot
X +			 * risk a deadlock here, so allow allocation of
X +			 * another vnode even if this would give too many.
X  			 */
X -			if (freevnodes > wantfreevnodes)
X -				vnlru_free(freevnodes - wantfreevnodes);

Freeing like this (but more correct) is now done in callers.

X  			return (0);
X  		}
X @@ -1052,10 +1228,18 @@
X  			wakeup(vnlruproc);
X  		}
X +		DPRINTF("getnewvnode_wait() actually failed\n");
X  		msleep(&vnlruproc_sig, &vnode_free_list_mtx, PVFS,
X  		    "vlruwk", hz);
X  	}
X -	return (numvnodes > desiredvnodes ? ENFILE : 0);
X +	/* Post-adjust like the pre-adjust in getnewvnode(). */
X +	if (numvnodes + 1 > desiredvnodes && freevnodes > 1)
X +		vnlru_free(1);
X +	return (numvnodes >= desiredvnodes ? ENFILE : 0);

This freeing should be in callers too.  I put it here to try to keep
getnewvnode_reserve() working without changing it more.  The whole
function is only needed to handle messes from the existence of
getnewvnode_reserve().


X  }
X 
X +/*
X + * This hack is fragile, and probably not needed any more now that the
X + * watermark handling works.
X + */
X  void
X  getnewvnode_reserve(u_int count)

Please delete this bad API.  It seems to be only used by zfs, and I didn't
test its fixes.

X @@ -1063,8 +1247,17 @@
X  	struct thread *td;
X 
X +	/* Pre-adjust like the pre-adjust in getnewvnode(), with any count. */
X +	/* XXX no longer so quick, but this part is not racy. */
X +	mtx_lock(&vnode_free_list_mtx);
X +	if (numvnodes + count > desiredvnodes && freevnodes > wantfreevnodes)
X +		vnlru_free(ulmin(numvnodes + count - desiredvnodes,
X +		    freevnodes - wantfreevnodes));
X +	mtx_unlock(&vnode_free_list_mtx);
X +
X  	td = curthread;
X  	/* First try to be quick and racy. */
X  	if (atomic_fetchadd_long(&numvnodes, count) + count <= desiredvnodes) {
X  		td->td_vp_reserv += count;
X +		vcheckspace();	/* XXX no longer so quick, but more racy */
X  		return;
X  	} else
X @@ -1079,7 +1272,16 @@
X  		}
X  	}
X +	vcheckspace();
X  	mtx_unlock(&vnode_free_list_mtx);
X  }
X 
X +/*
X + * This hack is fragile, especially if desiredvnodes or wantvnodes are
X + * misconfgured or changed significantly.  Reducing desiredvnodes below
X + * the reserved amount should cause bizarre behaviour like reducing it
X + * below the number of active vnodes -- the system will try to reduce
X + * numvnodes to match, but should fail, so the subtraction below should
X + * not overflow.
X + */
X  void
X  getnewvnode_drop_reserve(void)
X @@ -1102,4 +1304,5 @@
X  	struct bufobj *bo;
X  	struct thread *td;
X +	static int cyclecount;
X  	int error;
X 
X @@ -1112,17 +1315,35 @@
X  	}
X  	mtx_lock(&vnode_free_list_mtx);
X -	/*
X -	 * Lend our context to reclaim vnodes if they've exceeded the max.
X -	 */
X -	if (freevnodes > wantfreevnodes)
X +	if (numvnodes < desiredvnodes)
X +		cyclecount = 0;
X +	else if (cyclecount++ >= freevnodes) {
X +		cyclecount = 0;
X +		vstir = 1;
X +	}

This uncommented stirring is a hack to try to grow the free list when it
is cycling.  I tried many ways to do this and only this one even sort of
works.

It can happen that the vnodes looked at by a large tree walk can easily
fit in the cache, but they don't since old garbage is preferred.  After
one iteration, the cache state might end up as:
- 32% free (at high watermark), all from the tree walk
- 10% non-free for directories from the tree walk
- 58% non-free (the rest) unrelated to the tree walk
and uncached state might end up as
- 1% of the cache size for non-directories from the tree walk
Then repeating the walk any number of times will thrash the 32% free part
in preference to growing it by 1% to hold the uncached part.

The hack detects this cycling and discards from the non-free list to make
space that the free-list can grow into (if there is no other load).

-current does stirring with similar effects (but more bad ones) by
discarding from the whole cache every second if it grows larger than
9/10 of what it should grow to.  This trashes more than it stirs.  It
doesn't guarantee growth of the free list since it trashes the free
list too, perhaps more than it can grow back.  But any sort of stirring
gives the cache a chance of coming out of a bad stable state.  Random
trashing of the clogged part might be a good way of stirring it.

X +	/*
X +	 * Grow the vnode cache if it will not be above its target max
X +	 * after growing.  Otherwise, if the free list is nonempty, try
X +	 * to reclaim 1 item from it before growing the cache (possibly
X +	 * above its target max if the reclamation failed or is delayed).
X +	 * Otherwise, wait for some space.  In all cases, schedule
X +	 * vnlru_proc() if we are getting short of space.  The watermarks
X +	 * should be chosen so that we never wait or even reclaim from
X +	 * the free list to below its target minimum.
X +	 */
X +	if (numvnodes + 1 <= desiredvnodes)
X +		;
X +	else if (freevnodes > 0)
X  		vnlru_free(1);
X -	error = getnewvnode_wait(mp != NULL && (mp->mnt_kern_flag &
X -	    MNTK_SUSPEND));
X +	else {
X +		error = getnewvnode_wait(mp != NULL && (mp->mnt_kern_flag &
X +		    MNTK_SUSPEND));
X  #if 0	/* XXX Not all VFS_VGET/ffs_vget callers check returns. */
X -	if (error != 0) {
X -		mtx_unlock(&vnode_free_list_mtx);
X -		return (error);
X -	}
X +		if (error != 0) {
X +			mtx_unlock(&vnode_free_list_mtx);
X +			return (error);
X +		}
X  #endif
X +	}
X +	vcheckspace();
X  	atomic_add_long(&numvnodes, 1);
X  	mtx_unlock(&vnode_free_list_mtx);

Fiarly standard watermark stuff.  We are now happy to let the free list
grow below or above its "wanted" size, but expect it to remain beteen
the watermarks which are slightly higher.  Large tree walks should
cause the free list to grow much larger by discarding old non-free
garbage, but that rarely happens.

X @@ -2524,4 +2745,5 @@
X  				mp->mnt_activevnodelistsize--;
X  			}
X +			/* XXX V*AGE hasn't been set since 1997. */
X  			if (vp->v_iflag & VI_AGE) {
X  				TAILQ_INSERT_HEAD(&vnode_free_list, vp,

Please remove VI_AGE.  I think it was nonexistent before FreeBSD-3 and
never set after FreeBSD-3.

Bruce