From owner-freebsd-fs@freebsd.org Wed Nov 4 19:49:21 2015 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 24B69A2626D for ; Wed, 4 Nov 2015 19:49:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 0B5BB1D81 for ; Wed, 4 Nov 2015 19:49:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 079A3A2626C; Wed, 4 Nov 2015 19:49:21 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 062EAA26269 for ; Wed, 4 Nov 2015 19:49:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id A506C1D80 for ; Wed, 4 Nov 2015 19:49:20 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 1985F3C576F; Thu, 5 Nov 2015 06:49:16 +1100 (AEDT) Date: Thu, 5 Nov 2015 06:49:15 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: Kirk McKusick , Bruce Evans , fs@freebsd.org Subject: Re: an easy (?) question on namecache sizing In-Reply-To: <20151103090448.GC2257@kib.kiev.ua> Message-ID: <20151105043607.K3175@besplex.bde.org> References: <20151102224910.E2203@besplex.bde.org> <201511030447.tA34lo5O090332@chez.mckusick.com> <20151103090448.GC2257@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=R6/+YolX c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=kj9zAlcOel0A:10 a=jm51WUJoxU2LgMh7JXUA:9 a=bnPOoLb-eUIJAd_c:21 a=zAURBNcse8cAlum9:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Nov 2015 19:49:21 -0000 On Tue, 3 Nov 2015, Konstantin Belousov wrote: > On Mon, Nov 02, 2015 at 08:47:50PM -0800, Kirk McKusick wrote: >> You seem to be proposing several approaches. One is to make >> wantfreevnodes bigger (half or three-quarters of the maximum). >> Another seems to be reverting to the previous (freevnodes >= wantfreevnodes >> && numvnodes >= minvnodes). So what is your proposed change? > > Free vnodes could be freed in the soft fashion by vnlru daemon, or in > hard manner, by the getnewvnode(), when the max for the vnode count is > reached. The 'soft' way skips vnodes which are directories, to make it > more probable that vn_fullpath() would succeed, and also has threshold > for the count of cached pages. The 'hard' way waits up to 1 sec for > the vnlru daemon to succeed, before forcing a recycle for any vnode, > regardless of the 'soft' stoppers. This causes the ticking behaviour of > the system when only one vnode operation in single thread succeeds in a > second. This ticking behaviour seems to mostly wrong. numvnodes is allowed to grow to its nominal limit (desiredvnodes) in most places (except an off by 1 error allows it to grow 1 larger), but vnrlu_proc() uses a limit of 9/10 of that (without the off by 1 error). So the usual behaviour is to grow a little larger than the 9/10 limit, then churn by reclaiming too much to significantly below the 9/10 limit, then build up to above the 9/10 again... I didn't see the 1 vnode per second behaviour, except when I set desiredvnodes to a values like 1, 10 or 100 I saw very interesting behaviours. A little larger than 100 works almost normally, but 10 should give 1 vnode per second since 1 is 1/10 of 10. I think it actually gives 2 vnodes per second due to the off by 1 error. I need to know if the churn is intentional before removing it completely. See the comment below. > Large wantfreevnodes value is the safety measure to prevent the tick > steps in practice. My initial reaction on the complain was to just > suggest to increase desiredvnodes, at least this is what I do on > machines where there is a lot of both KVA and memory and intensive file > loads are expected. Yes, desiredvnodes should be sized large enough to allow space for plenty of "free" vnodes, but there are some problems with this: - non-problem: dynamic resizing of all the limits now works quite well. Better than I said earlier. -current ramps up both the total and free vnodes better than I said earlier - not many people know about this magic - i386's don't have enough KVA for much enlargement. The defaults here give 70K for desiredvnodes with 1GB of RAM and 123K with 3G of usable RAM. Even 70K is really too many. Memory use for each vnode seems to be unreasonably large (more than 1K average even when most vnodes are free). That gives about 100 MB of wired memory. - there are some bugs with large numbers of vnodes. 300K of them almost fit and almost work on i386. However: - in old versions of FreeBSD, perhaps with VM_BCACHE_SIZE_MAX larger than the default, null pointer panics occur - at least in -current, uma is reluctant to free vnode allocations. Reducing desiredvnodes now works (on the next tick) to reduce numvnodes, but uma doesn't free anything, since we don't ask it to. I think it will free later if memory becomes short - in -current, there is a bug freeing memory when memory becomes short. I saw this for cached data, not vnodes. Most memory was "Inact" for VMIO, with a very large number of non-free but non-referenced vnodes referencing the cached data. Then the tar programs used to set this up get killed by an out of memory error (IIRC, the kernel one, not malloc failure). I don't use swap but haven't seen this error for 10-15 years except in much larger programs. Here is my work in progress: X diff -u2 vfs_subr.c~ vfs_subr.c X --- vfs_subr.c~ 2015-09-28 06:29:43.000000000 +0000 X +++ vfs_subr.c 2015-11-04 14:25:42.476369000 +0000 X @@ -147,14 +147,15 @@ X X /* X - * Free vnode target. Free vnodes may simply be files which have been stat'd X - * but not read. This is somewhat common, and a small cache of such files X - * should be kept to avoid recreation costs. X + * "Free" vnode target. Free vnodes are rarely actually free. Usually X + * they are for files which have been stat'd but not read; these usually X + * have inode and namecache data attached to them. A large cache of such X + * vnodes should be kept to minimise recreation costs. X */ X static u_long wantfreevnodes; X -SYSCTL_ULONG(_vfs, OID_AUTO, wantfreevnodes, CTLFLAG_RW, &wantfreevnodes, 0, ""); X -/* Number of vnodes in the free list. */ X +SYSCTL_ULONG(_vfs, OID_AUTO, wantfreevnodes, CTLFLAG_RW, X + &wantfreevnodes, 0, "Target for minimum number of \"free\" vnodes"); X static u_long freevnodes; X -SYSCTL_ULONG(_vfs, OID_AUTO, freevnodes, CTLFLAG_RD, &freevnodes, 0, X - "Number of vnodes in the free list"); X +SYSCTL_ULONG(_vfs, OID_AUTO, freevnodes, CTLFLAG_RD, X + &freevnodes, 0, "Number of \"free\" vnodes"); X X static int vlru_allow_cache_src; The comments were very out of date. The cache needs to be large (thousands), or small (25). Also improve descriptions and fix style bugs in sysctls. X @@ -275,9 +276,9 @@ X X /* X - * Number of vnodes we want to exist at any one time. This is mostly used X - * to size hash tables in vnode-related code. It is normally not used in X - * getnewvnode(), as wantfreevnodes is normally nonzero.) X + * Target for maximum number of vnodes. X * X - * XXX desiredvnodes is historical cruft and should not exist. X + * XXX "desired" is a not very good historical name for this. We only X + * desire to use the full maximum under loads where caching vnodes is X + * the best use of memory resources. X */ X int desiredvnodes; This was so out of date that it was almost correct again except for the parts saying that it was out of date. X @@ -292,4 +293,5 @@ X return (error); X if (old_desiredvnodes != desiredvnodes) { X + wantfreevnodes = desiredvnodes / 4; X vfs_hash_changesize(desiredvnodes); X cache_changesize(desiredvnodes); Keep this in sync. I considered telling uma the new allocation size here (also in init), but that shouldn't be needed and would need to be done elsewhere if done. desiredvnodes is a soft limit so uma should not be limited to that, and this is too early to reduce. It would be OK for vnrlu_proc() to tell uma to reduce to the current active allocation. X @@ -300,7 +302,7 @@ X SYSCTL_PROC(_kern, KERN_MAXVNODES, maxvnodes, X CTLTYPE_INT | CTLFLAG_MPSAFE | CTLFLAG_RW, &desiredvnodes, 0, X - sysctl_update_desiredvnodes, "I", "Maximum number of vnodes"); X + sysctl_update_desiredvnodes, "I", "Target for maximum number of vnodes"); X SYSCTL_ULONG(_kern, OID_AUTO, minvnodes, CTLFLAG_RW, X - &wantfreevnodes, 0, "Minimum number of vnodes (legacy)"); X + &wantfreevnodes, 0, "Old name for vfs.wantfreevnodes (legacy)"); X static int vnlru_nowhere; X SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, X @@ -844,5 +846,5 @@ X X /* X - * Attempt to keep the free list at wantfreevnodes length. X + * Attempt to reduce the free list by the requested amount. X */ X static void X @@ -922,7 +924,31 @@ X kproc_suspend_check(p); X mtx_lock(&vnode_free_list_mtx); X - if (freevnodes > wantfreevnodes) X - vnlru_free(freevnodes - wantfreevnodes); X - if (numvnodes <= desiredvnodes * 9 / 10) { X + if (numvnodes > desiredvnodes && X + freevnodes > wantfreevnodes) X + vnlru_free(ulmin(numvnodes - desiredvnodes, X + freevnodes - wantfreevnodes)); Magic related to the 9/10 factor is not removed by this, but moved below and changed. This just lets freevnodes grow larger than wantfreevnodes if it safely can. X + /* X + * Sleep if there are not too many allocated vnodes or X + * exactly the right number of allocated vnodes but too X + * few free vnodes. When below the target maximum, we X + * will prefer to allocate a new vnode to reusing a "free" X + * vnode. When at the target maximum, we prefer to discard X + * some non-free vnodes now to cycling through a too-small X + * number of free vnodes later. When above the target X + * maximum, we must discard some vnodes now to reduce to X + * the maximum. The discarding here is not smart and tends X + * to discard too many vnodes with wrong choices, but it is X + * good enough since it rarely done. Occasional excessive X + * random discarding may even be a feature for evolution. X + * X + * Old versions did even more discarding here, by allowing X + * the target maximum to be reached elsewhere but reducing X + * to 9/10 of that here. I think this was intended to X + * reduce activity here, but actually gave the bug of the X + * target maximum being 9/10 of its requested value, and X + * the evolutionary feature. X + */ X + if (numvnodes < desiredvnodes || (numvnodes == desiredvnodes && X + freevnodes >= wantfreevnodes)) { X vnlruproc_sig = 0; X wakeup(&vnlruproc_sig); The comment describes this too verbosely. I plan to remove the numvnodes == desiredvnodes subcase here, but limit numvnodes elsewhere if necessary to allow room for expansion of freevnodes up to wantfreevnodes without numvnodes growing too large. The above gives churning much like before, but only above the full limit instead of above 9/10 of that, or in the subcase. The 9/10 limit is often exceeded, but the new ones rarely are. The planned version will make the subcase unreachable. X @@ -1031,5 +1057,5 @@ X X /* X - * Wait for available vnodes. X + * Wait if necessary for a new vnode to become available. X */ X static int X @@ -1038,12 +1064,11 @@ X X mtx_assert(&vnode_free_list_mtx, MA_OWNED); X - if (numvnodes > desiredvnodes) { X + if (numvnodes >= desiredvnodes) { Fix off by 1 error. This allowed creation of 1 more than the max. This bug was masked by the 9/10 magic reducing the actual limit to a smaller fuzzier value. E.g., a limit of 100000 allows 100001 here, but the actual limit is 90000 in vnrlu_proc(). It is still 100001 here, but it is hard to reach much above 90000 after reducing to less than that. X if (suspended) { X /* X - * File system is beeing suspended, we cannot risk a X - * deadlock here, so allocate new vnode anyway. X + * The file system is being suspended. We cannot X + * risk a deadlock here, so allow allocation of X + * another vnode even if this would give too many. X */ X - if (freevnodes > wantfreevnodes) X - vnlru_free(freevnodes - wantfreevnodes); Fix spelling and improve wording in comment. Remove some code. The caller can do it and now does. X return (0); X } X @@ -1055,5 +1080,5 @@ X "vlruwk", hz); X } X - return (numvnodes > desiredvnodes ? ENFILE : 0); X + return (numvnodes >= desiredvnodes ? ENFILE : 0); Fix off by 1 error. X } X X @@ -1113,8 +1138,10 @@ X mtx_lock(&vnode_free_list_mtx); X /* X - * Lend our context to reclaim vnodes if they've exceeded the max. X + * Lend our context to reclaim vnodes if one more would be too many X + * and there are plenty to reclaim. X */ X - if (freevnodes > wantfreevnodes) X - vnlru_free(1); X + if (numvnodes >= desiredvnodes && freevnodes > wantfreevnodes) X + vnlru_free(ulmin(numvnodes + 1 - desiredvnodes, X + freevnodes - wantfreevnodes)); Policy change. This is supposed to be the same as above except for adjustments to add 1 for the new vnode, but is buggy. It should do: - if numvnodes < desiredvnodes, don't use the free list - if numvnodes > desiredvnodes (rare), always use the free list if it is nonempty - if numvnodes == desiredvnodes (usual), use the free list if it is large enough (perhaps just nonempty, or the old threshold of 25) Then if the free list ends up below wantvnodes, depend on the churn in vnrlu_proc() to grow it. Since the condition for waiting (with the old off by 1 error or the above bugs) rarely occurs, the churn normally only occurs once per second. That limits its damage but also limits the rate of growing the free list. Planned version: - if numvnodes - freevnodes >= desiredvnodes - wantfreevnodes, then the cache is becoming unbalanced with low chances of recovery (it has too many non-free vnodes and not enough space to add free ones). Then wait more often here, as in the above numvnodes == desiredvnodes case. Whenever we wait, we wake up vnlru_proc(). We must do this more often than before to grow the free list, but not too often. It does the right things except for also shrinking the free list. X error = getnewvnode_wait(mp != NULL && (mp->mnt_kern_flag & X MNTK_SUSPEND)); This doesn't really fix the original problem. du -s now works better starting with a cold cache by allowing freevnodes to grow much larger than wantfreevnodes while numvnodes is not near its limit. But when warm, numvnodes is near its limit and freevnodes may grow small. Then the modifications are to attempt to grow freevnodes up to wantvnodes, but it would be unreasonable to discard many non-free vnodes to grow freevnodes larger than that. The churning might do that accidentally, after perhaps too long, if the usage pattern changes from mostly reads to mosty stats. Bruce