Date: Sun, 17 Feb 2019 02:58:43 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Gleb Smirnoff <glebius@freebsd.org> Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r344188 - in head: lib/libc/sys sys/vm Message-ID: <20190217011341.S833@besplex.bde.org> In-Reply-To: <201902152336.x1FNaNUo039321@repo.freebsd.org> References: <201902152336.x1FNaNUo039321@repo.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 15 Feb 2019, Gleb Smirnoff wrote: > Log: > For 32-bit machines rollback the default number of vnode pager pbufs > back to the lever before r343030. For 64-bit machines reduce it slightly, > too. Together with r343030 I bumped the limit up to the value we use at > Netflix to serve 100 Gbit/s of sendfile traffic, and it probably isn't a > good default. This is only a rollback for the vnode pager pbufs sub-pool. Total resource usage (preallocated kva and maximum on RAM that can be mapped into this kva) is still about 5/2 times higher than before in my configuration. It would be 7/2 times higher if I configured fuse and smbfs. r343030 changed the allocation methods in all subsystems except out-of-tree modules, and broke at least the KBI for these modules (*), so it is easy to find the full expansion except for these modules by looking at the diffs (I found the use in fuse and smbfs by grepping some subtrees). Also, the user's and vfs_bio's resource limit is still broken by expanding it by this factor of 5/2 or more. In the old allocation method, there was a single pool of pbufs of size nswbuf which normally has its normal limiting value of 256, where this magic 256 is hard-coded in vfs_bio.c, but if the user somehow knows about this and the tunable kern.nswbuf, then it can be overridden. The limit of 256 was documented in pbuf(9), but the tunable was never documented AFAIK. The variable nswbuf for this was documented in pbuf(9). The 256 entries are shared between any number of subsystems. Most subsystems limited themselves to nswbuf/2 entries, and the man page recommended this. This gave overcommit by a factor of about 5/2 in my configuration (there are 7 subsystems, but some of these have a smaller limit). Now there each subsystem has a separate pool. The size of the sub-pool is still usually nswbuf / 2. This gives overallocation by a factor of about 5/2 in my configuration. The overcommit only causes minor performance problems. 2 subsystems might use all of the buffers, and then all all the other subsystems have to wait, but it is rare for even 2 subsystems to be under load at the same time. It is more of a problem that the limit is too small for a single subsystem. The overallocation gives worse problems such as crashing at boot time or a little later when the user or auto-tuning has maxed out nswbuf. > Provide a loader tunable to change vnode pager pbufs count. Document it. This only controls 1 of the subsystems. It is too much to have a sysctl for each of the subsystems. Some users don't even know about the global sysctl kern.nswbuf that was enough for sendfile on larger (mostly 64-bit) systems. Just increase nswbuf a lot. This wastes kva for most of the subsystems, but kva is cheap if the address space is large. Now the user has to know even more arcane details to limit the kva, and it is impossible to recover the old behaviour. To get the old limit, kern.nswbuf must be set to (256 * 2 / 5) in my configuration, but that significantly decreases the number of buffers for each subsystem. Users might already have set kern.nswbuf to a large value. Since most subsystems used to use half of that many buffers, the wastage from setting it large for the benefit of 1 subsystem was at most a factor of 2. Now the limit can't be increased as much without running out of kva, and the safe increase is more arcane and machine-dependent (starting with the undocumented default being 8 times higher for 64-bit systems, but only for 1 of the subsystems). (*) The KBI wa getpbuf(), trypbuf() and relpbuf(), and this was very easy to (ab)use. Any number of subsystems can try to use the shared pool. This is abused because a small fixed-size pool can't support an unbounded number of subsystems. Now getpbuf() doesn't exist (but is still referred to in swap_pager.c), and there is no man page for the new allocation method. The boot-time preallocation can't work for modules loaded later, and leaves unusable allocations for modules unloaded later. Modules apparently have to do their own preallocation. They should probably not use pbufs at all, and do their own allocations too. It is now clear that there has always been a problem with the default limits. The magic number of 256 hasn't been changed since before FreeBSD-1. There were no pbufs in FreeBSD-1, but there was nswbuf and it dynamically tuned but limited to 256. I think 256 meant "ininity" in 1992, but it wasn't large enough even then. Before r343030 it was even smaller, since there are more subsystems then than in FreeBSD-1. nswbuf needs to be very large to support slow devices. By the very nature of slow devices, the i/o queue tends to fill up with buffers for the slowest device, and if there is a buffer shortage then everything else has to wait for this device to free the buffers. Slowness is relative. In FreeBSD-1, floppy disk devices were still in use and were especially slow. Now hard disks are slow relative to fast SSDs. But the number of buffers was unchanged. It is still essentially unchanged except for vn pager pbufs. The hard disks can complete 128 i/o's for a full queue much faster than a floppy disk, so the relative slowness might be similar, but now there are more subsystems and some systems have many more disks. I have seen this queueing problem before mainly for DVD disks, but thought it was more in the buffer cache than in pbufs. Testing this by increasing and decreasing kern.nswbuf didn't show much change in makeworld benchmarks. They still have idle time with large variance, as if something waits for buffers and doesn't get woken up promptly. Only clpbufs are used much. The counts now available in uma statistics show the strange behaviour that the free count rarely reaches the limit, but with larger limits the free count goes above smaller limits. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190217011341.S833>