Date: Sun, 14 May 1995 14:56:16 GMT From: "John S. Dyson" <toor@jsdinc.root.com> To: current@FreeBSD.org Subject: Updated notes on the VFS/VM system Message-ID: <199505141456.OAA09678@localhost>
next in thread | raw e-mail | index | archive | help
Hey gang -- I thought that it would be nice to talk about the improvements to the FreeBSD VFS/VM for the 2.0.5 release. This is a follow-on to the notes that I released a few months ago. There is more additional background information in this version of the notes. By no means is this complete, but just wanted to let you know about some of the VM/VFS improvements that you are getting. I reiterate my original comments: I really believe that low level kernel things such as this are only enabling technology. Most of these things were changed to allow people to use the system for more and bigger applications, and hopefully, some day some of us will get together and write a FreeBSD kernel manual. :-). Things fixed in the VM/VFS system since 4.4-Lite by various FreeBSD contributors 1) Collapse problem fully eliminated Fairly complex code has been added to eliminate the growing swap space problem intrinsic in the MACH VM system used in 4.4-Lite. You will notice that the system uses much less swap space than it used to. (Earlier versions of FreeBSD had mods to help the situation, but the code in 2.0.5 contains a complete fix.) The problem is that when a parent creates a child by the fork(2) system call -- the address space is "cloned" through a copy-on-write (COW) mechanism. If both the parent and child modify their address spaces -- each will create their own copy of the modified memory owned by the parent before the fork(2) system call. Each process continues to hold a reference to the original memory residing in the parent. This reference is kept, because the original memory still contains pages that can be shared. The problem appears when the child exits, the reference count to the original memory does drop to one -- that is good. But there is no mechanism to properly merge the memory originally in the parent back into the parent if paging has occured on the original memory from the parent. This causes "orphan" memory (that can eventually get paged out) on some systems. In the original 4.4Lite code (and most 4.4Lite implementations that we know of, except for FreeBSD) there is no way to reclaim this swap-space or memory. On a heavily loaded system with significant paging this will eventually necessitate a reboot. 2) The pageout daemon is now very efficient The original pageout daemon was waken up gratuitously. When physical memory started being overcommited, the system would thrash. Also, the new FreeBSD pageout daemon does significant statistics on page usage, so that it doesn't free pages that are likely to be re-used. (The old one was too simple.) The clock algorithm (or a variation thereof) was originally used in 4.4Lite. The new algorithm is a most-often-used with an LRU component followed by a pure LRU. The most-often-used portion of the algorithm is used to select candidates for pages to be deactivated or placed onto the cache queue. The cache queue and inactive queue are "last chance" type queues. Whenever a page on a "last chance" queue is used, it is placed back onto the active queue. It is now much less likely that the pageout daemon will get rid of a page that is actively being used. Also, the pageout daemon does not need to be waken up nearly as often. 3) Pages are not freed as often A new page queue that has pages that can be easily re-used by user processes was added. The identities of the pages on the queue are not lost until they are reused. We still keep a free queue for interrupt code use and for pages that have lost their identity. This technique is used in other operating systems, and this gives the FreeBSD VM system more time to keep a page in memory. 4) The VM system now no longer gratuitiously wipes the page tables. When COW pages are created, previous usage is tracked at the VM level, making sure that gratuitious page protection is not done. This fix really helps large systems, where there was an O(n^2) type degradation. This degradation has been minimized. On large systems such as "wcarchive AKA: ftp.freebsd.org", the continued forking and execing of processes caused much unnecessary protection of the pages in read-only and COW sections of memory. The process of scanning the page tables and protecting them wiped the cache and in some circumstances, cause all processes that are sharing the memory to fault on next access. This enhancement should significantly improve the performance of WEB servers and other sites where there are many short-lived processes. Originally, we tried to help the situation by including some tightly crafted code in the pmap (machine dependent layer). We noticed some speed-ups but it took further analysis to determine the root-cause of the problem. It is a good thing to improve the pmap code, but the real problem lies in the upper layers of the VM system. There is now better tracking of page and object usage so that the lower level code is used much less often. 5) The VM system and buffer cache has been merged. Now mmap is fully coherent with the read/write system calls. This is an initial implementation, and the VOP_GETPAGE and VOP_PUTPAGE will be compatibly added soon (Probably V2.2). For example, a write to a file immediately causes the data to change immediately in the address space of a process that might have the file mapped. FreeBSD uses a scheme that is minimally invasive into the filesystems themselves. Gradually, there will be improvements in the interface that will require more changes in the lower-layers of the filesystems, for example to support VOP_GETPAGE and VOP_PUTPAGE. This will afford an improvement in efficiency in some circumstances, and support a much cleaner way than current methods of swapping onto files. 6) Dynamic sized buffer cache Along with the merged VM/buffer cache, the buffer cache now uses otherwise unused memory. It does not compete with memory that is likely to be needed in the near future. Additionally, the new code does not create dirty pages not associated with buffers, thereby limiting the number of dirty vfs created pages to the size of the buffer cache. Future enhancements will likely include page-flipping for normal read(2) system calls in certain circumstances. The way the buffer cache is now implemented -- this is very feasible. 7) The system now swaps. Swapping has historically been an unpleasant thing in UNIX-like OSes. Not only has FreeBSD implemented swapping, but has an intelligent policy as to the swappability of processes. Older UNIX-like OSes did not properly choose the correct processes to swap out. This caused problems with process scheduling. One effect of this is that once swapping occurs, processes would appear to go-to-sleep and system utilization would suffer. FreeBSD has an improved algorithm to significantly minimize this effect of processes appearing to go-to-sleep. An example of this is that you can run a program that gobbles memory. Older systems would swap out that process!!! FreeBSD resists that temptation. 8) The VM code does many fewer copies. Unfortunately, the standard 4.4Lite VM code copies all data paged in from files. FreeBSD copies very little of the RO data paged in from files, the only time that the system copies paged-in pages is for COW. The original behavior is a result of the buffer-cache orientation for filesystem I/O. With the advent of mmap and friends, this becomes a problem. The FreeBSD buffer-cache subsystem does not use additional anonymous memory for its buffers, it uses the memory that would have been originally mmaped, thereby eliminating the unnecessary copy. 9) Soft RSS limiting has been added. The system allows the system administrator to limit the RSS of processes. Originally, we had an implementation of hard limiting also. It is within an hour or two of work-time to make functional again... But, it's effect is not pleasant at all to the process being limited, and it was not considered very useful at the time. 10) The FreeBSD VM intelligently clusters pageins. Pageins are clustered VM-intelligent -- not limited to the VFS (I/O optimized) clustering methods. The VFS style clustering is not as useful for pageins because of the likelihood of pages needing to be faulted in reverse order. It definitely does help to use the VFS style clustering, but the VM style clustering helps more. 11) Vastly improved the flushing of dirty vnode-backed pages Since mmap is more likely to be used now, it was necessary to create a more efficient pageout of dirty pages. The current (and still being used) scheme of managing the pages in vm objects is not friendly to many operations needed by the vm system. Prefaulting and vnode pageouts could be done much more efficiently than current code allows. Modifications have already been made to FreeBSD V2.0.5 to help this situation, but further work is being done to fix the access methods to the vm page data structures. 12) VFS_BIO bounce buffering has been added. A fairly architecture-neutral, non-invasive bounce-buffer scheme has been added to vfs_bio (actually vm_machdep for now.) Note that in general 1-3 lines of code needs to be added to each block device driver that needs bouncing. Machines such as ISA based iX86's have problems addressing certain regions of memory with dma devices. Rather than segmenting memory into dmaable/non-dmaable segments, and because of the significant complications that arise when implementing such schemes, the FreeBSD scheme of managing non-dmaable memory is to "bounce" data though the dma-able memory regions. The current scheme is mostly usable for strategy routines for block devices, but there are entry points available for other types of memory needs. Examples can be found in the SCSI code. One major goal of the FreeBSD bounce code is to minimize the effect on existing and future device drivers. 13) More efficient ordering of buffers in the vnode dirty list Makes sync work better if there are lots of delayed write buffers. This is mostly helpful if one modifies the ufs_readwrite to retain delayed write buffers as opposed to immediately queueing async writes. 14) Much better vfs name caching. A hashing scheme was added to vastly improve the performance on large systems. 15) New VFS cluster code. The original cluster code, although working, appeared to violate some layering and depended on a large kva space for the clustered I/O buffers. So for a large number of buffers, too much kva was required. Special buffers are now used to support clustering, thereby minimizing kva space requirements. This helps both CISC and some RISC architectures (such as R3000/R4000), where each 2MB or 4MB costs something significant (like page table pages or TLB entries.) In the original 4.4Lite scheme, much more kva was need to support a given number of buffers than what appears to be necessary. In order to support a cluster size of 64K, each buffer in the buffer cache needed to have 64K of kva allocated to it. Of course, this does not take up real memory directly, but it does take up other fairly scarce resources, and those are kernel virtual memory and page tables. 1000 buffers takes up 64MB of kernel space, for perhaps only 8MB of buffer space!!!! Ouch! The FreeBSD scheme uses a limited number of buffers that have pre-assigned kernel virtual memory for clustering (and certain other) purposes. This allows the FreeBSD buffer size to be 8KB or 16KB, instead of 64KB, and still perform clustering effectively. 16) Reusable page-table memory. The original 4.4Lite implementation did not afford pageable (really reusable memory) page-tables for X86 architectures. This can be very problematical, causing much unnecessary memory usage. In fact, the original code did not free unused page tables for a running process at all. So if the page tables were allocated, they were wired permanently into memory until the process exited. FreeBSD can free unused page tables as needed. John dyson@root.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199505141456.OAA09678>