FreeBSD Mail Archives

Date:      Wed, 8 Jul 1998 15:43:09 -0500 (CDT)
From:      Joel Ray Holveck <joelh@gnu.org>
To:        Stefan Eggers <seggers@semyam.dinoco.de>
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: Someone working on swapoff? 
Message-ID:  <199807082043.PAA00578@detlev.UUCP>
In-Reply-To: <199807081928.VAA25030@semyam.dinoco.de>
References:  <199807081607.LAA02079@detlev.UUCP> <199807081928.VAA25030@semyam.dinoco.de> <199804070553.AAA10297@dyson.iquest.net>

Remember that this doc was old when I got it in April.

Happy hacking,
joelh

------- Start of forwarded message -------
From: "John S. Dyson" <toor@dyson.iquest.net>
Subject: Re: swap-leak in 2.2.5 ?
To: toor@dyson.iquest.net (John S. Dyson)
Date: Tue, 7 Apr 1998 00:53:33 -0500 (EST)
Cc: joelh@gnu.org, toor@dyson.iquest.net
Message-Id: <199804070553.AAA10297@dyson.iquest.net>

> > A lot of what I'm asking is stuff that is fairly basic to the VM
> > system, and not really that related to monitoring.  Have you a white
> > paper (or something that serves the same function) on FreeBSD VM?
> > (You said that you don't have something on monitoring, but if you can
> > point me to something else it may help.)
> >
> I'll append a doc that talks about entry points, etc.  However, the
> MACH VM is probably the best intro to FreeBSD VM.  They are similar,
> (but FreeBSD is more specific to U**X), and the terminology is similar.
>

Sorry about not appending the out-of-date roadmap doc!!!  Here it is:

Here is a very rough description of the FreeBSD MACH-based VM system
internals...  This document is not definitive, but meant as a quick reference
or overview.  The source code is currently the ONLY definitive documentation.
If there is enough positive feedback from this document, I might be motivated
enough to fill this in with more detail.  Routines or symbols that will
probably be supported forever in one way or another have a "SUPPORTED" notation.
Those routines that could be at risk sometime in the future have no such
notation.

Definitions:

	Data Structure		Equivalent

	vm_map ...............  Address space
	vm_map_entry .........  Portion of address space pointing to only
				one vm_object or another vm_map
	vm_object ............  Repository for data
	vm_page	..............  Indivisible amount of data
	pmap .................  Physical representation of Address space

	(Note that the names above, with "_t" appended refer to pointers
	 as opposed to the structure itself...  e.g. vm_map_t is a pointer to a
	 struct vm_map.)


	Terminology		Equivalent

	pager ................  A sort-of class that is described by
	                        information in the vm_object and the type
	                        of access to external data
	vnode_pager ..........  Code in the vm system that knows how to do
	                        paging I/O with filesystem files.
	swap_pager ...........  Code in the vm system that knows how to do
	                        paging I/O with swap partitions and files.
	                        Anonymous data in the system (e.g. bss) is
	                        paged using this.
	default_pager ........  A logical "placeholder" pager that takes
	                        few system resources until paging is needed.
	                        The default_pager is currently used only for
	                        vm_object's that might need to be paged using
	                        the swap_pager.  When pageouts are needed,
	                        objects that are marked "default" are converted
	                        to "swap" with the associated allocation of
	                        swap data structures.
	device_pager .........  Code in the vm system that can provide memory
	                        mapped I/O with memory mapped devices.  Most
	                        common use of this is X-Windows.
	kva ..................  Kernel virtual address, usually of type
	                        vm_offset_t or caddr_t.
	sva,eva,va ...........  Virtual address(s).  sva - start virtual
	                        address, eva - end virtual address.
	pa ...................  Physical address.
	m,p ..................  Usually used for vm_page_t.
	offset ...............  Offsets into objects are usually vm_ooffset_t,
	                        which translates into a long long (equiv to
				the filesystem off_t.)
	wired ................  Not pageable.
	clean ................  (As in "the page is clean"), means it is in
	                        sync with the backing store.
	page_coloring ........	Unless a VM system provides special support
				for direct mapped caches, the system will
				often allocate pages suboptimally for machines
				with such caching schemes.  The term
				"page_coloring" as I use it, consists of various
				ways that the system provides support for
				improving utilization of system caches.  FreeBSD
				provides support for processor caches that helps
				even the more sophisticated 4-way caching
				schemes (as in a PPro.)

Useful "handles" describing address spaces:
	(warning -- in the most general case, you should make sure that
	 there is a "curproc"!!!).  If your code is being called from
	 a system call or I/O initiation routine, you should be safe.

	Current process address space (vm_map_t): (SUPPORTED)
		&curproc->p_vmspace->vm_map

	Current process pmap (pmap_t): (SUPPORTED)
		&curproc->p_vmspace->vm_pmap

	Kernel address space (vm_map_t): (SUPPORTED)
		kernel_map

	Kernel pmap (pmap_t): (SUPPORTED)
		kernel_pmap, (best referred to as vm_map_pmap(kernel_map))

Less commonly used "handles":

        Address spaces (submaps of kernel_map, unless noted otherwise):
	        never checked for modification: clean_map
	        buffer cache:                   buffer_map (submap of clean_map)
	        pager and cluster buffers:      pager_map (submap of clean_map)
	        used for bounce buffers:        io_map (submap of clean_map)
	        malloc and mbuf cluster area:	kmem_map
		mbuf clusters:			mb_map (submap of kmem_map)

	        args during exec:               exec_map
	        temporary mapping of exec hdr:	exech_map
	        UPAGES per process:             upage_map

Important macros:

	pa = VM_PAGE_TO_PHYS(m); (SUPPORTED)
		returns the physical address for a vm_page_t.

	m = PHYS_TO_VM_PAGE(pa);
		returns the vm_page_t associated with a physical address.
		(try to avoid PHYS_TO_VM_PAGE -- it doesn't always work,
		 because not every physical address has a page, and it
		 usually implies a design flaw, or a quick work-around
		 that needs to be corrected in the future.)

	PAGE_WAKEUP(m); (SUPPORTED)
		This is used to free the lock on a page as represented
		by the PG_BUSY bit.  Other processes that are waiting
		on that page are waken up.  In order to wait on a page
		the following could be done:

			s = splhigh();
			while ((m->flags & PG_BUSY) || m->busy) {
				m->flags |= PG_WANTED;
				tsleep(m, PVM, "xxxxxx", 0);
			}
			splx(s);

		You would do that normally after a vm_page_lookup.

	VM_WAIT; (SUPPORTED)
		Use this if you have tried to do a vm_page_alloc
		in non-interrupt state, and vm_page_alloc did not return
		a vm_page_t pointer (vm_page_alloc returns NULL on failure.)
		VM_WAIT blocks your process and wakes up the pageout daemon.
		When this returns, there likely will have some memory, so
		vm_page_alloc can be retried.
	
Likely return values from most of the vm routines <vm/vm_param.h>:

	KERN_SUCCESS, KERN_INVALID_ADDRESS, KERN_PROTECTION_FAILURE,
	KERN_NO_SPACE, KERN_INVALID_ARGUMENT, KERN_FAILURE,
	KERN_RESOURCE_SHORTAGE, KERN_NOT_RECEIVER?, KERN_NO_ACCESS?

Important X86 tidbit:

	The kernel_pmap is always effectively mapped into the user's pmap.
	When referring to kernel space, one should use the kernel_pmap, and
	all processes will see the change in the kernel.

Memory queues:

	vm_page_queue_free	--	free pages
	vm_page_queue_zero	--	free pages that are zero
	vm_page_queue_cache	--	free pages that still have info
                                        may NOT be BUSY or mapped.
	vm_page_queue_active	--	active pages
	vm_page_queue_inactive	--	inactive pages

Commonly needed VM system routines:

int vm_map_find(map, object, offset, addr, length, find_space,
		prot, max, cow);
(SUPPORTED)

	This finds AND allocates virtual space from the specified map 
	(Address space).  The user can optionally specify a vm object
	to map into the space (e.g. mapped file.)

	The parameters associated with the address space include:
		map	--	The specific vm_map_t involved with the op
		addr	--	Ptr to the address in the vm_map
		length	--	Length of the mapping in bytes

		Note that the address (addr) above is equivalent to
		the address in a process or in the kernel.  If the address
		is >= VM_MIN_KERNEL_ADDRESS you MUST use kernel_map, and
		not &curproc->p_vmspace->vm_map!!!  Secondary note, unless
		you *really* know what you are doing, do not do a vm_map_find
		in the kernel map.  Please use kmem_alloc instead.

		If you specify an initial value for addr, and find_space
		is zero, then the allocation request will succeed only if
		there is enough virtual address space available at the
		specified address.

	The parameters associated with the vm object:
		object	--	Optional VM object -- if NULL, a default
				pager object will be created as needed when
				a fault happens thereby making the object
				necessary (an container for the page.)
		offset	--	Offset into the object (long long, vm_ooffset_t)


	Additional parameters modifying the operation of the routine:

		find_space	-- If there is no space at 'addr', space is
				   found after that place.
		prot,max	-- R/W permissions to address space:
				   VM_PROT_READ, VM_PROT_WRITE, VM_PROT_EXEC
		cow		-- Copy-on-write, original obj is NOT modified.

	Error returns:
		KERN_SUCCESS		-- Operation completed
		KERN_INVALID_ADDRESS	-- Address specified is invalid
		KERN_NO_SPACE		-- No space in the map



int vm_map_remove(map, start, end);
(SUPPORTED)

	This routine deallocates the virtual space between start and end.
	All objects that are backing this space are deallocated as appropriate.
	This is sort-of inverse of vm_map_find above.  Always returns
	KERN_SUCCESS.

int vm_map_protect(map, start, end, new_prot, set_max);
(SUPPORTED)
	Changes the access permissions for a virtual address range
	in the specified map.  This routine makes all necessary modfications
	to the pmap associated with the map also.

int vm_map_pageable(map, start, end, new_pageable);
(SUPPORTED)
	Allows sections of a map to be wired or unwired into memory.

int vm_map_check_protection(map, start, end, protection);
(SUPPORTED)	
	Allows an address range to be checked for specified
	protection attributes.

int vm_map_lookup(map, addr, fault_type, out_entry, object, pindex,
	out_prot, wired, single_use);
(SUPPORTED)	
	This is a routine that provides functionality more than
	the name implies.  The routine does return the map entry
	associated with (map, addr) pair.  But, vm_map_lookup also
	does much of the work necessary to create an object
	for the map entry (in much of the VM code, objects are created
	in a lazy fashion -- only when needed), and also performs much
	of the work for COW.  If the fault_type is a write fault, a
	new object might be created to support the local copy of
	a COW map entry (e.g. .data segment of an executable.)
	

vm_page_t vm_page_alloc(object, pindex, flags);
(SUPPORTED)
		flags -- VM_ALLOC_NORMAL	normal process allocation
		         VM_ALLOC_SYSTEM	preferential allocation
		         VM_ALLOC_INTERRUPT	allocate interrupt-safely
		         VM_ALLOC_ZERO		normal process with priority
		                     		to zero pages
	NON-BLOCKING.

	This is the lowest level page allocation routine.  A NULL is returned
	if the allocation cannot be currently satisified.  The pages are
	returned to the user with the PG_BUSY bit set and are not on any
	queue.  After allocating the page, it is a good idea to issue
	a PAGE_WAKEUP(m) on the page, and at least wire the page.  vm_page_alloc
	has support for page coloring built-in so that the system will choose
	pages more selectively than the usual ad-hoc schemes previously used.

void vm_page_free(object, pindex);
(SUPPORTED)

	This is the lowest level page free routine.  This routine does NOT
	remove ANY mappings associated with the page.  Chaos will ensue if
	the page is not properly removed from all pmap's.  A normally used
	page can be removed from all pmap's by a

		vm_page_protect(m,VM_PROT_NONE);

	However, kernel mappings must be removed one-by-one, and must
	be manually tracked.


void vm_page_activate(m);
void vm_page_deactivate(m);
void vm_page_cache(m);
void vm_page_wire(m);
(SUPPORTED)

	NON-BLOCKING.

	These are the queue manipulation routines.  These are used to affect
	the policy of the paging and allocation system.  If a page is activated,
	it is not likely to be freed soon.  If it is deactivated, it will more
	likely be used.  Cached pages are similar to freed pages, available
	for allocation, but still have their identity for quick reuse.
	If a page is not in one of the other states for a long time, it is
	best to wire it so the system can at least account for it.  A page
	that is wired is "hidden" from the pageout daemon.


vm_object_t vm_object_allocate(type, size);
		type -- OBJT_DEFAULT, default -- converts to swap
		        OBJT_VNODE, vnode object
		        OBJT_SWAP, swap object
		        OBJT_DEVICE, device object
	This is the routine that creates an object.  The user should only
	normally be used create objects of type OBJT_DEFAULT.  Note that the
	size is in units of pages.

vm_object_t vm_pager_allocate(type, handle, size, prot, foff);
(SUPPORTED)
		type -- OBJT_DEFAULT, default -- converts to swap
		        OBJT_VNODE, vnode object
		        OBJT_SWAP, swap object
		        OBJT_DEVICE, device object
	This is the routine that creates an object and associates the
	object with a file.  If the object already exists, the reference
	count for the object will be incremented.  In the case of a
	vnode object, the handle is the vnode pointer, and the foff and prot
	are both ignored.  In the case of a swap object the handle is a
	unique 32bit number (probably address), and the foff and prot are
	both ignored.  The handle for a device object is likely the
	device vnode, the prot is the protection that the memory device
	can support, and the foff is the offset into the device.


vm_object_deallocate(object);
(SUPPORTED)

	This routine decrements the reference bit for the object, potentially
	freeing it.

vm_page_protect(m, prot);
(SUPPORTED)

	Used to turn off permissions for pages mapped into processes.
	vm_page_protect(m, VM_PROT_READ) helps implement COW, and
	vm_page_protect(m, VM_PROT_NONE) is an important step in freeing pages.

vm_fault(map, vaddr, fault_type, change_wiring);
(SUPPORTED)

	Does the things necessary to bring a page into a processes
	address space.  The most common use of this routine is in the
	trap code to implement demand-paging.  Most normal driver
	or system use would be as follows:

	vm_fault(map, vaddr, VM_PROT_READ or (VM_PROT_READ|VM_PROT_WRITE), 0);



KMEM series of operations (meant to be used on kernel_map or submaps of
kernel_map), they always return page aligned addresses.

kva = kmem_alloc(map, size);
kva = kmem_alloc_pageable(map, size);
(SUPPORTED)

	kmem_alloc and kmem_alloc_pageable each allocate space from the
	kernel_map (or any of it's submaps except kmem_map).  kmem_alloc
	allocates both kva space and memory, while kmem_alloc_pageable
	allocates only kva space.  If memory is being allocated (instead
	of just virtual space), you should generally use kmem_alloc.
	kmem_alloc_pageable does not do all of the correct things in all
	cases for the setup of the underlying kernel_object offset.  It is
	best to use kmem_alloc_pageable when you plug the pages directly
	into the kernel address space.

kmem_free(map,addr,size);
(SUPPORTED)

	Use kmem_free to give back the kernel address space as allocated
	by kmem_alloc or kmem_malloc.  Be careful to remove any mappings
	specifically created by pmap_enter before freeing the address range.
	It is especially important to be careful when using kmem_free after
	allocating kva space with kmem_alloc_pageable.

kva = kmem_malloc(map, size, waitflag);

	Use this special form of kmem_alloc for kmem_map or mb_map.  Except
	for current usage, it is best not to use kmem_malloc in new kernel
	extensions.  It is best to use malloc/free for things that you CAN
	use kmem_malloc for.

MALLOC/FREE (refer to /sys/sys/malloc.h for available types.)  These return
aligned memory, but not necessarily on 1 page boundaries.

kva = malloc(size, type, flags);
(SUPPORTED)
	flags = M_NOWAIT (call like this from interrupt level.)
              = M_KERNEL (preferential allocation of memory.)
	      = M_WAITOK (normal call for non-interrupt level.)

	malloc is callable using M_NOWAIT from both splbio and splimp
		interrupt levels.

(void) free(kva, type);
(SUPPORTED)

	The kva specified to free must be identical to the one returned
	by malloc.  The type likewise should be the same, otherwise malloc
	usage accounting will not work correctly (and the system will
	likely panic.)  The kernel malloc/free routines do not deal well
	with partial frees of malloced entities.  If that capability
	is needed, then the kmem_alloc/kmem_free routines would be
	better choices.

PMAP routines.  These routines are the lowest level defined interface to
the processor memory management hardware.  Given the virtual addresses
have been set-up correctly, pmap can be kernel_pmap, the current processes'
pmap or in some cases, another processes pmap.

void pmap_enter(pmap, va, pa, prot, wired);
(SUPPORTED)
	map a single page into the physical address space.

void pmap_remove(pmap, sva, eva);
(SUPPORTED)
	remove a range of pages from the physical address space.

pa = pmap_extract(pmap, va);
(SUPPORTED)
	get the physical address associated with the specified
	mapped page.
		
pa = pmap_kextract(va);
(SUPPORTED)
	same as pmap_extract, except is much more efficient and
	works only for the kernel_pmap (assuming the kernel space.)
		
va = pmap_map(va, startp, endp, prot);
(SUPPORTED)
	map a contiguous range of pages from physical address startp
	through endp at virtual address va.  The returned address
	points to the next address that can be used for mapping.

pmap_protect(pmap, sva, eva, prot);
(SUPPORTED)
	Removes permissions from page protections on pages in the
	specified range.  It does NOT remove protections for other
	pmaps on the pages.

pmap_qenter(va, m, count);
pmap_qremove(va, count);
(PROBABLY SUPPORTED)
	pmap_qenter/pmap_qremove are used for fast kernel mappings
	of vm_page's allocated from the VM system.  The implied pmap
	is kernel_pmap, and must refer to va's that
	are >= VM_MIN_KERNEL_ADDRESS.  Usually one would use address
	that were returned by kmem_alloc_pageable.

	The second argument to pmap_qenter is a pointer to an array
	of pages.  This is used often in the buffer cache code for
	quick mapping of vm_page_t's.

pmap_kenter(va, pa);
pmap_kremove(va);
(PROBABLY SUPPORTED)
	pmap_kenter/pmap_kremove are used for fast kernel mappings.
	The implied pmap is kernel_pmap and must refer to va's that
	are >= VM_MIN_KERNEL_ADDRESS.  Usually one would use address
	that were returned by kmem_alloc_pageable.
		

pmap_growkernel(topaddr);
	This routine supports the creation of additional pagetable
	pages to encompass the address "topaddr".  Kind-of the
	equiv of sbrk for the kernel.  FreeBSD does not need to
	preallocate all of the needed kernel pagetables up-front
	because of this routine.

pmap_destroy(pmap);
	Decrements pmap ref-count, and if zero, destroy's it.

pmap_reference(pmap);
	Increments pmap ref-count.

pmap_pinit(pmap)
	Creates a pmap.

pmap_object_init_pt(pmap, addr, object, pindex, size);
	Prefaults pages into a processes pmap.  If the pages are
	in memory, they are placed directly into a processes address
	space.  This is called at mmap time.

pmap_prefault(pmap, addra, entry, object);
	Prefaults pages into a processes pmap.  This only places
	pages that are in a region around the specified address.
	This is called at vm_fault time.

pmap_change_wiring(pmap, va, wired);
	This notates the page as being wired.  This DOES NOT
	actually wire the page.

pmap_copy(dst_pmap, src_pmap, dst_addr, len, src_addr);
	This is a routine that might be used to short-circuit
	faulting pages into an address space from another.  It
	is currently NOT used.

pmap_zero_page(dstpa);
(SUPPORTED)
	This is the routine that is used to zero a page for demand
	zero.

pmap_copy_page(srcpa,dstpa);
(SUPPORTED)
	This is the routine that is used to copy a page for COW.

pmap_pageable(pmap, sva, eva, pageable);
	This notates a range of pages as being pageable and is
	information.  It is currently NOT used.

pmap_page_protect(dstpa, prot);
	Decreases (and now increases) the protection for a given page.
	It is used to remove a page from all address spaces (for
	example, prior to being freed), or to write-protect (for example,
	for setting up an address space for COW.)  This routine
	should not normally be used, vm_page_protect is vastly
	superior.

The pte bit routines below are much more complicated than they
appear, because they have to check the pte's for each page in
every pmap that the page is mapped.

pmap_is_referenced(srcpa);
(SUPPORTED)
	Senses the reference bit on a given page.

pmap_is_modified(srcpa);
(SUPPORTED)
	Senses the modified bit on a given page.

pmap_clear_modify(dstpa);
(SUPPORTED)
	Clears the modified bit for a given page.
pmap_clear_reference(dstpa);
(SUPPORTED)
	Clears the reference bit for a given page.


kva = pmap_mapdev(pa, size);
(SUPPORTED)
	Maps device memory into the kernel.  kva space is allocated, and
	the physical device is mapped directly into the kernel_pmap ptes.
	This allows full memory access to the device from the kernel.


Additional miscellaneous routines that are useful to kernel developers,
but refer to them in the source.  They most likely will be around for a
"long time."

	vmspace_alloc(min, max, pageable);
	vmspace_free(vm);
	vm_map_reference(map);
	vm_map_deallocate(map);
	vm_map_insert(map, object, offset, start, end, prot, max, cow);
	vm_map_findspace(map, start, length, addr);
	vm_map_lookup(map, address, entry);
	vm_map_inherit(map, start, end, new_inheritance);
	vm_map_clean(map, start, end, syncio, invalidate);
------- End of forwarded message -------
-- 
Joel Ray Holveck - joelh@gnu.org - http://www.wp.com/piquan
   Fourth law of programming:
   Anything that can go wrong wi
sendmail: segmentation violation - core dumped

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199807082043.PAA00578>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation