From owner-freebsd-fs@freebsd.org Mon Nov 2 11:30:12 2015 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F158AA2301E for ; Mon, 2 Nov 2015 11:30:12 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id DAE0D1E01 for ; Mon, 2 Nov 2015 11:30:12 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id D949FA2301C; Mon, 2 Nov 2015 11:30:12 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D8D17A2301B for ; Mon, 2 Nov 2015 11:30:12 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id 80A8C1E00 for ; Mon, 2 Nov 2015 11:30:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 0D425D40A3F; Mon, 2 Nov 2015 22:29:56 +1100 (AEDT) Date: Mon, 2 Nov 2015 22:29:56 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: fs@freebsd.org Subject: Re: an easy (?) question on namecache sizing In-Reply-To: <20151102193756.L1475@besplex.bde.org> Message-ID: <20151102210750.S1908@besplex.bde.org> References: <20151102193756.L1475@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=R6/+YolX c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=kj9zAlcOel0A:10 a=HVedyWbAMsSLzbH5fYkA:9 a=O0TBwVC9xROfhtfE:21 a=WfTo9GhQe92Om4TV:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Nov 2015 11:30:13 -0000 On Mon, 2 Nov 2015, Bruce Evans wrote: > At least in old versions before cache_changesize() (should be nc_chsize()) > existed, the name cache is supposed to have size about 2 * desiredvnodes, > but its effective size seems to be only about desiredvnodes / 4? Why is > this? > > This shows up in du -s on a large directory like /usr. Whenever the > directory has more than about desiredvnodes / 4 entries under it, the > namecache thrashes. The number of cached vnodes is also limited to > about desiredvnodes / 4. > > The problem might actually be in vnode caching. ... This was easy to answer. The problem is in vnode caching. Its only relationship with the namecache is that if you increase the bogus vnode cache limit then cache_changesize() now adjusts the associated namecache limit to match but doesn't increases the associated non-bogus vnode catch limits to match. >From vfs_subr.c: X /* X * Number of vnodes we want to exist at any one time. This is mostly used X * to size hash tables in vnode-related code. It is normally not used in X * getnewvnode(), as wantfreevnodes is normally nonzero.) X * X * XXX desiredvnodes is historical cruft and should not exist. X */ X int desiredvnodes; I probably helped eivind write the XXX comment in 2000. I only just noticed the error in the main part of the comment. This is not the number of vnodes that we want to exist, but about 4 times that number. X .... X SYSCTL_PROC(_kern, KERN_MAXVNODES, maxvnodes, X CTLTYPE_INT | CTLFLAG_MPSAFE | CTLFLAG_RW, &desiredvnodes, 0, X sysctl_update_desiredvnodes, "I", "Maximum number of vnodes"); The maximum is not bogus. Only "desired" in the name is bogus. Note that the sysctl name doesn't say "desired". But systat -v still uses the raw variable name (abbreviated to 'desvn"). It is important for understanding systat -v output and the source code to know that this variable is actually the maximum and not the desired number, unlike what its name suggests. X SYSCTL_ULONG(_kern, OID_AUTO, minvnodes, CTLFLAG_RW, X &wantfreevnodes, 0, "Minimum number of vnodes (legacy)"); Further obfuscations. The desiredvnodes / 4 number number from here. This value really is the "wanted" or "desired" number of vnodes. The sysctl obfuscates it by renaming it to "minvnodes", it is not a minimum except in the sense that when the current number exceeds it, the vnrlu daemon tries (not very hard) to reduce this limit. The description of this sysctl as legacy is confusing. Perhaps the name of this sysctl is legacy, but its value is less legacy than that of desiredvnodes by any name. This is further obfuscated by exporting this variable twice (once here and once with its correct name under vfs). See below for an example of sysctl output. X ... X wantfreevnodes = desiredvnodes / 4; This is where the length used below is initialized. X ... X /* X * Attempt to keep the free list at wantfreevnodes length. X */ X static void X vnlru_free(int count) Misplaced comment. This function actually attempts to keep the list at a certain length that is decided elsewhere. X ... X static void X vnlru_proc(void) X { X ... X for (;;) { X kproc_suspend_check(p); X mtx_lock(&vnode_free_list_mtx); X if (freevnodes > wantfreevnodes) X vnlru_free(freevnodes - wantfreevnodes); This is where the length used above is passed. This length is defaulted by the above initialization and may be changed by either of the 2 sysctls for it. But the attempt usually fails, and then the vnode cache works better by growing nearly 4 times as large as is "wanted", up to nearly its "desired" size which is actually the limit on its size. In my du -s test, the attempt succeeds and breaks the caching almost perfectly when the number of files is slightly larger than wantfreevnodes = desiredvnodes / 4, but when the files are read the attempt fails and the caching works when the number of files is smaller than desiredvnodes. Reading the files is probably closer to normal operation. On freefall now: "sysctl -a | grep vnode" gives: kern.maxvnodes: 485993 kern.minvnodes: 121498 vfs.freevnodes: 121453 vfs.wantfreevnodes: 121498 vfs.vnodes_created: 360808607 vfs.numvnodes: 408313 Note that the current number is about 3.5 times as large as the "wanted" number. This shows that the attempts to reduce to the "wanted" number usually fail, so the cache is is amost 4 times as large as is "wanted". The kern values are limits, with a hard maximum and a soft minimum. The kern.minvnodes numbers is duplicated under its better name vfs.wantfreevnodes. The sysctl for desiredvnodes is now the SYSCTL_PROC() show above. The function for this now updates vfs_hash and the namecache, but not wantfreevnodes. Code earlier in vfs_subr.c shows that it is wantfreevnodes that is primary and its duplication for minvnodes really is legacy. It is vfs.wantvnodes that should be the SYSCTL_PROC(): X /* X * Free vnode target. Free vnodes may simply be files which have been stat'd X * but not read. This is somewhat common, and a small cache of such files X * should be kept to avoid recreation costs. X */ X static u_long wantfreevnodes; X SYSCTL_ULONG(_vfs, OID_AUTO, wantfreevnodes, CTLFLAG_RW, &wantfreevnodes, 0, ""); According to this, keeping vnodes as free but inactive for files that have been stat'ed but not read is intentional. The du -s example shows that this works almost perfectly as foot-shooting, except VMIO stops the foot being blown very far away. X /* Number of vnodes in the free list. */ X static u_long freevnodes; X SYSCTL_ULONG(_vfs, OID_AUTO, freevnodes, CTLFLAG_RD, &freevnodes, 0, X "Number of vnodes in the free list"); Data for the foot-shooting: kern.maxvnodes: 70000 kern.minvnodes: 30774 vfs.numvnodes: 38157 vfs.vnodes_created: 461556 vfs.wantfreevnodes: 30774 vfs.freevnodes: 30775 Here maxvnodes started at 4*30774 but I reduced it to 70000 for comparison with another system. I didn't reduce minvnodes from 30774 since neither I nor the sysctl knew about it. Then du -s on a directory tree with 49683 files gave the (now even more misconfigured) "wanted" number of free vnodes almost perfectly. 70000 - 30774 = 39276 is less than the number of files, so this asks for thrashing of the vnode and name caches. 38157 instead of 39726 vnodes were left cached. The default misconfiguration gives more mysterious numbers: kern.maxvnodes: 123096 kern.minvnodes: 30774 vfs.numvnodes: 38143 vfs.vnodes_created: 50136 vfs.wantfreevnodes: 30774 vfs.freevnodes: 30774 Now if 1/4 of maxvnodes can be forced to be free without limiting the number of non-free ones too much -- 3/4 can remain used; 3/4 of maxvnodes is about 90000 and that is plenty for caching about 49000 files. Apparently, the freeing is too active, so when there aren't many vnodes in use it reaches the target by discard useful vnodes. For the du -s access pattern with lots of stat'ed files, the total number of vnodes in use never grows large enough to justify freeing any. But on freefall or any system that has been up for a while doing a variety of tasks, the number of vnodes in use is large, so discarding some earlier than necessary works right. A large du -s then discards lots of old vnodes but not the ones that it looks at unless there are just too many. Bruce