Date: Tue, 3 Nov 2015 21:17:15 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Kirk McKusick <mckusick@mckusick.com> Cc: fs@freebsd.org Subject: Re: an easy (?) question on namecache sizing Message-ID: <20151103173042.K1103@besplex.bde.org> In-Reply-To: <201511030447.tA34lo5O090332@chez.mckusick.com> References: <201511030447.tA34lo5O090332@chez.mckusick.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 2 Nov 2015, Kirk McKusick wrote: > You seem to be proposing several approaches. One is to make > wantfreevnodes bigger (half or three-quarters of the maximum). > Another seems to be reverting to the previous (freevnodes >= wantfreevnodes > && numvnodes >= minvnodes). So what is your proposed change? For a quick fix, I will try: wantfreevnodes = current value (perhaps too large) minvnodes = maxvnodes = desiredevnodes with the old code that creates new vnodes up to maxvnodes instead of attempting to recycle old vnodes well below maxvnodes. Only one variable is needed for this, and the very old name desiredvnodes is best for this, but the separate variables are useful for trying variations. More dynamic configuration when the variables are changed is needed. Old versions of FreeBSD already have the separate variables, but changing them using SYSCTL_INT() doesn't work. E.g., reducing numvnodes below numvnodes doesn't eventually reduce numvnodes, but eventually causes deadlock. The old version in its default configuration didn't really work for du -s either. It appears to work initially, but it basically asks for thrashing through just 25 vnodes when only stat()s are done, so even ls -l /bin thrashes once the system has created 1/4 of its "desired" number of vnodes. The caches actually work initially because the silly limits are inactive initially. To see the silly behavior in an old version of FreeBSD: - read more than minvnodes files. This enters the silly region where the sully wantfreevnodes limit starts being applied - do a few du's and ls's in loops and watch them using systat -v. Verify that numvnodes > minvnodes is fairly stable and freevnodes <= wantfreevnodes = 25 (default) - choose any directory with > 25 files in it that has not been looked at before. Reapeat the previous test with du or ls -l on this directory. Observe that at least the namecache is broken (I see 11% hits for namecache and 51% for dircache for a directory with 31 entries (counting "." but not "..") - increase wantfreenodes to the number of entries in the directory (possibly counting both "." and "..") and repeat the previous test. Observe that this unbreaks at least the namecache. The old minvnodes limit (default desiredvnodes / 4) limited the silly behaviour to above that limit, but since numvnodes was (still is?) never reduced (even by unmount), the silly region is reached in normal operation (after reading lots of files) and is fairly sticky after that (unmount does help by creating lots of free vnodes and then it takes reading lots of files to reach the silly region again Now there is no minvnodes limit, and the wantfreevnodes defaults to the old minvnodes default. This gives slightly worse behaviour as than the old version with the default for wantfreevnodes changed from 25 to desiredvnodes / 4. 25 was far too small and desiredvnodes / 4 is probably too large for most purposes. However, desiredvnode can be enlarged to leave space for a larger than necessary wantfreevnodes, and the larger than necessary wantfreevnodes is sometimes useful. Many problems remain, especially for initialization. The old defaults worked perfectly for initialization up to the minvnodes limit. vnodes were never recycled below that. Now the default for wantfreevnodes gives an identical limit with different semantics. Silly caching for stat()ed files sometimes occurs below this limit instead of always occurring above this limit. E.g., soon after booting, numvnodes is about 200 and freevnodes is about 100. In the old version, all stat()s of new files increase both numvnodes and freevnodes until the limit is reached -- the caching works. In the current version, stat()s of new files cycle through the old free vnodes if possible -- the caching doesn't work, but instead thrashes especially well when freevnodes is small. It takes non-stat() accesses to files to increase maxvnodes. Eventually maxvnodes becomes large enough for freevnodes to also become large on average, so the cases with perfect thrashing become rare. But the special caching for stat() still gives a determistic thrashing case. That us when although wantfreevnodes is larger than necessary for most cases (and freevnodes is almost as large), it is not large enough to hold the current working set of stat()s, as can easily happen for tree walks. Sometimes you want to walk the tree more than once and know that it all should fit in caches, but the recycling makes the caches ineffective. I think there is a worse subscase of this, like the one for initialization. Suppose that freevnodes is small at the the start of a tree walk. Then I think the vnode caching prefers to recycle with this small number than to create new vnodes. Reading directories makes freevnodes even smaller. The quick fix is supposed to make the free vnodes management almost null. A non-quick fix would only turn it off when numvnodes < maxvnodes. When numvnodes >= maxvnodes, recycling is still bad if it is mostly through free vnodes and freevnodes is small. Here "small" is relative. Anything smaller than the working set is too small, but if the working set is too large to fit then it is better to let it thrash in a small part of the cache than in a lage part. I would try letting freevnodes grow to 3/4 of desiredvnodes for tree walks but try to keep it lower than 1/4 of desiredvnodes in normal use. The current limit of 1/4 of desiredvnodes works better on larger systems. Such systems might never reach the limits, especially with the slow ramp-up of numvnodes. E.g., ref11-amd64 is not nearly as large as freefall, but now has 11 users and has been up for 29 days; it still hasn't reached the maxvnodes limit: kern.maxvnodes: 621596 kern.minvnodes: 155399 vfs.freevnodes: 154789 vfs.wantfreevnodes: 155399 vfs.vnodes_created: 117384264 vfs.numvnodes: 537910 Most FreeBSD systems run big tree walks every night. This one has about 6M inodes in / and 900M inodes elsewhere. It would soon reach maxvnodes if it cached all of these. Limitiing it to 537910 instead of 621596 is not useful, but this type of automatic big tree walk where the results are rarely used shouldn't be allowed to thrash through more than a small fraction of maxvnodes. So it can't be automatic to go from a fraction of 1/4 to 3/4. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20151103173042.K1103>