Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 14 May 1995 14:56:16 GMT
From:      "John S. Dyson" <toor@jsdinc.root.com>
To:        current@FreeBSD.org
Subject:   Updated notes on the VFS/VM system
Message-ID:  <199505141456.OAA09678@localhost>

next in thread | raw e-mail | index | archive | help
Hey gang --

I thought that it would be nice to talk about the improvements to the
FreeBSD VFS/VM for the 2.0.5 release.  This is a follow-on to the notes
that I released a few months ago.  There is more additional background
information in this version of the notes.   By no means is this complete,
but just wanted to let you know about some of the VM/VFS improvements that
you are getting.

I reiterate my original comments: I really believe that low level
kernel things such as this are only enabling technology.  Most of these
things were changed to allow people to use the system for more and bigger
applications, and hopefully, some day some of us will get together and
write a FreeBSD kernel manual. :-).

Things fixed in the VM/VFS system since 4.4-Lite by various FreeBSD contributors

1)	Collapse problem fully eliminated
	Fairly complex code has been added to eliminate the growing
	swap space problem intrinsic in the MACH VM system used in
	4.4-Lite.  You will notice that the system uses much less
	swap space than it used to.  (Earlier versions of FreeBSD
	had mods to help the situation, but the code in 2.0.5 contains a
	complete fix.)

	The problem is that when a parent creates a child
	by the fork(2) system call -- the address space is "cloned"
	through a copy-on-write (COW) mechanism.  If both the parent
	and child modify their address spaces -- each will create
	their own copy of the modified memory owned by the parent
	before the fork(2) system call.  Each process continues to
	hold a reference to the original memory residing in the parent.
	This reference is kept, because the original memory still contains
	pages that can be shared.
	
	The problem appears when the child exits, the reference count to
	the original memory does drop to one -- that is good.  But there
	is no mechanism to properly merge the memory originally in the parent
	back into the parent if paging has occured on the original memory
	from the parent.  This causes "orphan" memory (that can eventually
	get paged out) on some systems.   In the original 4.4Lite code (and
	most 4.4Lite implementations that we know of, except for FreeBSD) there
	is no way to reclaim this swap-space or memory.  On a heavily loaded
	system with significant paging this will eventually necessitate a
	reboot.

2)	The pageout daemon is now very efficient
	The original pageout daemon was waken up gratuitously.  When
	physical memory started being overcommited, the system would
	thrash.  Also, the new FreeBSD pageout daemon does significant
	statistics on page usage, so that it doesn't free pages that
	are likely to be re-used.  (The old one was too simple.)

	The clock algorithm (or a variation thereof) was originally used
	in 4.4Lite.  The new algorithm is a most-often-used with an LRU
	component followed by a pure LRU.  The most-often-used portion of
	the algorithm is used to select candidates for pages to be deactivated
	or placed onto the cache queue.  The cache queue and inactive queue
	are "last chance" type queues.  Whenever a page on a "last chance"
	queue is used, it is placed back onto the active queue.

	It is now much less likely that the pageout daemon will get rid of
	a page that is actively being used.  Also, the pageout daemon does
	not need to be waken up nearly as often.

3)	Pages are not freed as often
	A new page queue that has pages that can be easily re-used
	by user processes was added.  The identities of the pages on the queue
	are not lost until they are reused.  We still keep a free queue for
	interrupt code use and for pages that have lost their identity. 

	This technique is used in other operating systems, and this gives the
	FreeBSD VM system more time to keep a page in memory.

4)	The VM system now no longer gratuitiously wipes the page tables.
	When COW pages are created, previous usage is tracked at the
	VM level, making sure that gratuitious page protection is not
	done.  This fix really helps large systems, where there was
	an O(n^2) type degradation.  This degradation has been minimized.

	On large systems such as "wcarchive AKA: ftp.freebsd.org", the
	continued forking and execing of processes caused much unnecessary
	protection of the pages in read-only and COW sections of memory.  The
	process of scanning the page tables and protecting them wiped the
	cache and in some circumstances, cause all processes that are sharing
	the memory to fault on next access.  This enhancement should
	significantly improve the performance of WEB servers and other
	sites where there are many short-lived processes.

	Originally, we tried to help the situation by including some
	tightly crafted code in the pmap (machine dependent layer).  We noticed
	some speed-ups but it took further analysis to determine the root-cause
	of the problem.  It is a good thing to improve the pmap code, but the
	real problem lies in the upper layers of the VM system.  There is now
	better tracking of page and object usage so that the lower level code
	is used much less often.

5)	The VM system and buffer cache has been merged.
	Now mmap is fully coherent with the read/write system calls.  This
	is an initial implementation, and the VOP_GETPAGE and VOP_PUTPAGE
	will be compatibly added soon (Probably V2.2).  For example, a
	write to a file immediately causes the data to change immediately
	in the address space of a process that might have the file mapped.

	FreeBSD uses a scheme that is minimally invasive into the filesystems
	themselves.  Gradually, there will be improvements in the interface
	that will require more changes in the lower-layers of the filesystems,
	for example to support VOP_GETPAGE and VOP_PUTPAGE.  This will afford
	an improvement in efficiency in some circumstances, and support
	a much cleaner way than current methods of swapping onto files.

6)	Dynamic sized buffer cache
	Along with the merged VM/buffer cache, the buffer cache now uses
	otherwise unused memory.  It does not compete with memory that
	is likely to be needed in the near future.  Additionally, the new
	code does not create dirty pages not associated with buffers,
	thereby limiting the number of dirty vfs created pages to the size
	of the buffer cache.

	Future enhancements will likely include page-flipping for normal
	read(2) system calls in certain circumstances.  The way the buffer cache
	is now implemented -- this is very feasible.

7)	The system now swaps.
	Swapping has historically been an unpleasant thing in UNIX-like
	OSes.  Not only has FreeBSD implemented swapping, but has an
	intelligent policy as to the swappability of processes.

	Older UNIX-like OSes did not properly choose the correct processes
	to swap out.  This caused problems with process scheduling.  One
	effect of this is that once swapping occurs, processes would appear
	to go-to-sleep and system utilization would suffer.  FreeBSD has
	an improved algorithm to significantly minimize this effect of
	processes appearing to go-to-sleep.  An example of this is that
	you can run a program that gobbles memory.  Older systems would
	swap out that process!!!  FreeBSD resists that temptation.

8)	The VM code does many fewer copies.
	Unfortunately, the standard 4.4Lite VM code copies all data
	paged in from files.  FreeBSD copies very little of the RO data
	paged in from files, the only time that the system copies paged-in
	pages is for COW.

	The original behavior is a result of the buffer-cache orientation
	for filesystem I/O.  With the advent of mmap and friends, this
	becomes a problem.  The FreeBSD buffer-cache subsystem does not
	use additional anonymous memory for its buffers, it uses the
	memory that would have been originally mmaped, thereby eliminating
	the unnecessary copy.

9)	Soft RSS limiting has been added.
	The system allows the system administrator to limit the RSS
	of processes.

	Originally, we had an implementation of hard limiting also.  It is
	within an hour or two of work-time to make functional again...  But,
	it's effect is not pleasant at all to the process being limited, and
	it was not considered very useful at the time.

10)	The FreeBSD VM intelligently clusters pageins.
	Pageins are clustered VM-intelligent -- not limited to
	the VFS (I/O optimized) clustering methods.

	The VFS style clustering is not as useful for pageins because of
	the likelihood of pages needing to be faulted in reverse order.  It
	definitely does help to use the VFS style clustering, but the VM
	style clustering helps more.

11)	Vastly improved the flushing of dirty vnode-backed pages
	Since mmap is more likely to be used now, it was necessary
	to create a more efficient pageout of dirty pages.

	The current (and still being used) scheme of managing the pages
	in vm objects is not friendly to many operations needed by the 
	vm system.  Prefaulting and vnode pageouts could be done much
	more efficiently than current code allows.  Modifications have already
	been made to FreeBSD V2.0.5 to help this situation, but further
	work is being done to fix the access methods to the vm page
	data structures.

12)	VFS_BIO bounce buffering has been added.
	A fairly architecture-neutral, non-invasive bounce-buffer scheme
	has been added to vfs_bio (actually vm_machdep for now.)  Note
	that in general 1-3 lines of code needs to be added to each
	block device driver that needs bouncing.

	Machines such as ISA based iX86's have problems addressing certain
	regions of memory with dma devices.  Rather than segmenting memory
	into dmaable/non-dmaable segments, and because of the significant
	complications that arise when implementing such schemes,  the
	FreeBSD scheme of managing non-dmaable memory is to "bounce" data
	though the dma-able memory regions.  The current scheme is mostly
	usable for strategy routines for block devices, but there are
	entry points available for other types of memory needs.  Examples
	can be found in the SCSI code.

	One major goal of the FreeBSD bounce code is to minimize the effect on
	existing and future device drivers.

13)	More efficient ordering of buffers in the vnode dirty list
	Makes sync work better if there are lots of delayed write buffers.
	This is mostly helpful if one modifies the ufs_readwrite to
	retain delayed write buffers as opposed to immediately queueing
	async writes.

14)	Much better vfs name caching.
	A hashing scheme was added to vastly improve the performance
	on large systems.

15)	New VFS cluster code.
	The original cluster code, although working, appeared to violate
	some layering and depended on a large kva space for the clustered
	I/O buffers.  So for a large number of buffers, too much kva was
	required.  Special buffers are now used to support clustering,
	thereby minimizing kva space requirements.  This helps both
	CISC and some RISC architectures (such as R3000/R4000), where each
	2MB or 4MB costs something significant (like page table pages or
	TLB entries.)

	In the original 4.4Lite scheme, much more kva was need to support
	a given number of buffers than what appears to be necessary.  In
	order to support a cluster size of 64K, each buffer in the buffer
	cache needed to have 64K of kva allocated to it.  Of course, this
	does not take up real memory directly, but it does take up other
	fairly scarce resources, and those are kernel virtual memory and
	page tables.  1000 buffers takes up 64MB of kernel space, for perhaps
	only 8MB of buffer space!!!!  Ouch!
	The FreeBSD scheme uses a limited number of buffers that have
	pre-assigned kernel virtual memory for clustering (and certain other)
	purposes.  This allows the FreeBSD buffer size to be 8KB or 16KB,
	instead of 64KB, and still perform clustering effectively.

16)	Reusable page-table memory.
	The original 4.4Lite implementation did not afford pageable (really
	reusable memory) page-tables for X86 architectures.  This can be very
	problematical, causing much unnecessary memory usage.  In fact, the
	original code did not free unused page tables for a running process at
	all.  So if the page tables were allocated, they were wired permanently
	into memory until the process exited.  FreeBSD can free unused page
	tables as needed.


John
dyson@root.com



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199505141456.OAA09678>