Date: Wed, 23 Oct 2002 15:22:48 +0900 From: Seigo Tanimura <tanimura@axe-inc.co.jp> To: arch@FreeBSD.org Cc: tanimura@axe-inc.co.jp, tanimura@FreeBSD.org Subject: Review Request (Re: Dynamic growth of the buffer and buffer page reclaim) Message-ID: <200210230622.g9N6MmoK065433@shojaku.t.axe-inc.co.jp> In-Reply-To: <200210220949.g9M9nroK026750@shojaku.t.axe-inc.co.jp> References: <200210220949.g9M9nroK026750@shojaku.t.axe-inc.co.jp>
next in thread | previous in thread | raw e-mail | index | archive | help
Could anyone interested please review my patch before I either commit it or proceed further? Possible Issues: 1. Fragmentation of the KVA space If fragmentation goes too bad, we have to reclaim the KVA space of the buffers in the clean queue. One solution: let a kernel thread (the page scanner? or the buffer daemon?) scan the buffers in the clean queue periodically. If there is a buffer where its first page is reclaimed(*), give up the KVA space of the buffer and move it to the EMPTYKVA queue. The unreclaimed pages of that buffer will hopefully stay cached in the backing VM object. (*) The address of the pages in a buffer must start at b_kvabase. 2. Lock of buffers and pages A buffer may need locked across vfs_{un,re}wirepages(). The page queue mutex locks the object and the pindex of a page in vfs_rewirepages(), so it should hopefully safe. (but not sure) TIA. On Tue, 22 Oct 2002 18:49:53 +0900, Seigo Tanimura <tanimura@axe-inc.co.jp> said: tanimura> Introduction: tanimura> The I/O buffer of the kernel are currently allocated in buffer_map tanimura> sized statically upon boot, and never grows. This limits the scale of tanimura> I/O performance on a host with large physical memory. We used to tune tanimura> NBUF to cope with that problem. This workaround, however, results in tanimura> a lot of wired pages not available for user processes, which is not tanimura> acceptable for memory-bound applications. tanimura> In order to run both I/O-bound and memory-bound processes on the same tanimura> host, it is essential to achieve: tanimura> A) allocation of buffer from kernel_map to break the limit of a map tanimura> size, and tanimura> B) page reclaim from idle buffers to regulate the number of wired tanimura> pages. tanimura> The patch at: tanimura> http://people.FreeBSD.org/~tanimura/patches/dynamicbuf.diff.gz tanimura> implements buffer allocation from kernel_map and reclaim of buffer tanimura> pages. With this patch, make kernel-depend && make kernel completes tanimura> about 30-60 seconds faster on my PC. tanimura> Implementation in Detail: tanimura> A) is easy; first you need to do s/buffer_map/kernel_map/. Since an tanimura> arbitrary number of buffer pages can be allocated dynamically, buffer tanimura> headers (struct buf) should be allocated dynamically as well. Glue tanimura> them together into a list so that they can be traversed by boot() tanimura> et. al. tanimura> In order to accomplish B), we must find buffers both the filesystem tanimura> and I/O codes will not touch. The clean buffer queue holds such the tanimura> buffers. (exception: if the vnode associated with a clean buffer is tanimura> held by the namecache, it may access the buffer page.) Thus, we tanimura> should unwire the pages of a buffer prior to enqueuing it to the clean tanimura> queue, and rewire the pages down in bremfree() if the pages are not tanimura> reclaimed. tanimura> Although unwiring gives a page a chance of being reclaimed, we can go tanimura> further. In Solaris, it is known that file cache pages should be tanimura> reclaimed prior to the other kinds of pages (anonymous, executable, tanimura> etc.) for a better performance. Mainly due to a lack of time to work tanimura> on distinguishing the kind of a page to be unwired, I simply pass all tanimura> unwired pages to vm_page_dontneed(). This approach places most of the tanimura> unwired buffer pages at just one step to the cache queue. tanimura> Experimental Evaluation and Results: tanimura> The times taken to complete make kernel-depend && make kernel just tanimura> after booting into single-user mode have been measured on my ThinkPad tanimura> 600E (CPU: Pentium II 366MHz, RAM: 160MB) by time(1). The number tanimura> passed to the -j option of make(1) has been varied from 1 to 30 in tanimura> order to control the pressure of the memory demand for user processes. tanimura> The baseline is the kernel without my patch. tanimura> The following table shows the results. All of the times are in tanimura> seconds. tanimura> -j baseline w/ my patch tanimura> real user sys real user sys tanimura> 1 1608.21 1387.94 125.96 1577.88 1391.02 100.90 tanimura> 10 1576.10 1360.17 132.76 1531.79 1347.30 103.60 tanimura> 20 1568.01 1280.89 133.22 1509.36 1276.75 104.69 tanimura> 30 1923.42 1215.00 155.50 1865.13 1219.07 113.43 tanimura> Most of the improvements in the real times are accomplished by the tanimura> speedup of system calls. The hit ratio of getblk() may be increased, tanimura> but not examined yet. tanimura> Another interesting results are the numbers of swaps, shown below. tanimura> -j baseline w/ my patch tanimura> 1 0 0 tanimura> 10 0 0 tanimura> 20 141 77 tanimura> 30 530 465 tanimura> Since the baseline kernel does not free buffer pages at all(*), it may tanimura> be putting a pressure on the pages too much. tanimura> (*) bfreekva() is called only when the whole KVA is too fragmented. tanimura> Userland Interfaces: tanimura> The sysctl variable vfs.bufspace now reports the size of the pages tanimura> allocated for buffer, both wired and unwired. A new sysctl variable, tanimura> vfs.bufwiredspace tells the size of the buffer pages wired down. tanimura> vfs.bufkvaspace returns the size of the KVA space for buffer. tanimura> Future Works: tanimura> The handling of unwired pages can be improved by scanning only buffer tanimura> pages. In that case, we may have to run the vm page scanner more tanimura> frequently, as does Solaris. tanimura> vfs.bufspace does not track the buffer pages reclaimed by the page tanimura> scanner. They are counted when the buffer associated with those pages tanimura> are removed from the clean queue, which is too late. tanimura> Benchmark tools concentrating on disk I/O performance (bonnie, iozone, tanimura> postmark, etc) may be more suitable than make kernel for evaluation. tanimura> Comments and flames are welcome. Thanks a lot. tanimura> -- tanimura> Seigo Tanimura <tanimura@axe-inc.co.jp> <tanimura@FreeBSD.org> To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200210230622.g9N6MmoK065433>