Date: Wed, 8 Jul 1998 15:43:09 -0500 (CDT) From: Joel Ray Holveck <joelh@gnu.org> To: Stefan Eggers <seggers@semyam.dinoco.de> Cc: freebsd-hackers@FreeBSD.ORG Subject: Re: Someone working on swapoff? Message-ID: <199807082043.PAA00578@detlev.UUCP> In-Reply-To: <199807081928.VAA25030@semyam.dinoco.de> References: <199807081607.LAA02079@detlev.UUCP> <199807081928.VAA25030@semyam.dinoco.de> <199804070553.AAA10297@dyson.iquest.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Remember that this doc was old when I got it in April.
Happy hacking,
joelh
------- Start of forwarded message -------
From: "John S. Dyson" <toor@dyson.iquest.net>
Subject: Re: swap-leak in 2.2.5 ?
To: toor@dyson.iquest.net (John S. Dyson)
Date: Tue, 7 Apr 1998 00:53:33 -0500 (EST)
Cc: joelh@gnu.org, toor@dyson.iquest.net
Message-Id: <199804070553.AAA10297@dyson.iquest.net>
> > A lot of what I'm asking is stuff that is fairly basic to the VM
> > system, and not really that related to monitoring. Have you a white
> > paper (or something that serves the same function) on FreeBSD VM?
> > (You said that you don't have something on monitoring, but if you can
> > point me to something else it may help.)
> >
> I'll append a doc that talks about entry points, etc. However, the
> MACH VM is probably the best intro to FreeBSD VM. They are similar,
> (but FreeBSD is more specific to U**X), and the terminology is similar.
>
Sorry about not appending the out-of-date roadmap doc!!! Here it is:
Here is a very rough description of the FreeBSD MACH-based VM system
internals... This document is not definitive, but meant as a quick reference
or overview. The source code is currently the ONLY definitive documentation.
If there is enough positive feedback from this document, I might be motivated
enough to fill this in with more detail. Routines or symbols that will
probably be supported forever in one way or another have a "SUPPORTED" notation.
Those routines that could be at risk sometime in the future have no such
notation.
Definitions:
Data Structure Equivalent
vm_map ............... Address space
vm_map_entry ......... Portion of address space pointing to only
one vm_object or another vm_map
vm_object ............ Repository for data
vm_page .............. Indivisible amount of data
pmap ................. Physical representation of Address space
(Note that the names above, with "_t" appended refer to pointers
as opposed to the structure itself... e.g. vm_map_t is a pointer to a
struct vm_map.)
Terminology Equivalent
pager ................ A sort-of class that is described by
information in the vm_object and the type
of access to external data
vnode_pager .......... Code in the vm system that knows how to do
paging I/O with filesystem files.
swap_pager ........... Code in the vm system that knows how to do
paging I/O with swap partitions and files.
Anonymous data in the system (e.g. bss) is
paged using this.
default_pager ........ A logical "placeholder" pager that takes
few system resources until paging is needed.
The default_pager is currently used only for
vm_object's that might need to be paged using
the swap_pager. When pageouts are needed,
objects that are marked "default" are converted
to "swap" with the associated allocation of
swap data structures.
device_pager ......... Code in the vm system that can provide memory
mapped I/O with memory mapped devices. Most
common use of this is X-Windows.
kva .................. Kernel virtual address, usually of type
vm_offset_t or caddr_t.
sva,eva,va ........... Virtual address(s). sva - start virtual
address, eva - end virtual address.
pa ................... Physical address.
m,p .................. Usually used for vm_page_t.
offset ............... Offsets into objects are usually vm_ooffset_t,
which translates into a long long (equiv to
the filesystem off_t.)
wired ................ Not pageable.
clean ................ (As in "the page is clean"), means it is in
sync with the backing store.
page_coloring ........ Unless a VM system provides special support
for direct mapped caches, the system will
often allocate pages suboptimally for machines
with such caching schemes. The term
"page_coloring" as I use it, consists of various
ways that the system provides support for
improving utilization of system caches. FreeBSD
provides support for processor caches that helps
even the more sophisticated 4-way caching
schemes (as in a PPro.)
Useful "handles" describing address spaces:
(warning -- in the most general case, you should make sure that
there is a "curproc"!!!). If your code is being called from
a system call or I/O initiation routine, you should be safe.
Current process address space (vm_map_t): (SUPPORTED)
&curproc->p_vmspace->vm_map
Current process pmap (pmap_t): (SUPPORTED)
&curproc->p_vmspace->vm_pmap
Kernel address space (vm_map_t): (SUPPORTED)
kernel_map
Kernel pmap (pmap_t): (SUPPORTED)
kernel_pmap, (best referred to as vm_map_pmap(kernel_map))
Less commonly used "handles":
Address spaces (submaps of kernel_map, unless noted otherwise):
never checked for modification: clean_map
buffer cache: buffer_map (submap of clean_map)
pager and cluster buffers: pager_map (submap of clean_map)
used for bounce buffers: io_map (submap of clean_map)
malloc and mbuf cluster area: kmem_map
mbuf clusters: mb_map (submap of kmem_map)
args during exec: exec_map
temporary mapping of exec hdr: exech_map
UPAGES per process: upage_map
Important macros:
pa = VM_PAGE_TO_PHYS(m); (SUPPORTED)
returns the physical address for a vm_page_t.
m = PHYS_TO_VM_PAGE(pa);
returns the vm_page_t associated with a physical address.
(try to avoid PHYS_TO_VM_PAGE -- it doesn't always work,
because not every physical address has a page, and it
usually implies a design flaw, or a quick work-around
that needs to be corrected in the future.)
PAGE_WAKEUP(m); (SUPPORTED)
This is used to free the lock on a page as represented
by the PG_BUSY bit. Other processes that are waiting
on that page are waken up. In order to wait on a page
the following could be done:
s = splhigh();
while ((m->flags & PG_BUSY) || m->busy) {
m->flags |= PG_WANTED;
tsleep(m, PVM, "xxxxxx", 0);
}
splx(s);
You would do that normally after a vm_page_lookup.
VM_WAIT; (SUPPORTED)
Use this if you have tried to do a vm_page_alloc
in non-interrupt state, and vm_page_alloc did not return
a vm_page_t pointer (vm_page_alloc returns NULL on failure.)
VM_WAIT blocks your process and wakes up the pageout daemon.
When this returns, there likely will have some memory, so
vm_page_alloc can be retried.
Likely return values from most of the vm routines <vm/vm_param.h>:
KERN_SUCCESS, KERN_INVALID_ADDRESS, KERN_PROTECTION_FAILURE,
KERN_NO_SPACE, KERN_INVALID_ARGUMENT, KERN_FAILURE,
KERN_RESOURCE_SHORTAGE, KERN_NOT_RECEIVER?, KERN_NO_ACCESS?
Important X86 tidbit:
The kernel_pmap is always effectively mapped into the user's pmap.
When referring to kernel space, one should use the kernel_pmap, and
all processes will see the change in the kernel.
Memory queues:
vm_page_queue_free -- free pages
vm_page_queue_zero -- free pages that are zero
vm_page_queue_cache -- free pages that still have info
may NOT be BUSY or mapped.
vm_page_queue_active -- active pages
vm_page_queue_inactive -- inactive pages
Commonly needed VM system routines:
int vm_map_find(map, object, offset, addr, length, find_space,
prot, max, cow);
(SUPPORTED)
This finds AND allocates virtual space from the specified map
(Address space). The user can optionally specify a vm object
to map into the space (e.g. mapped file.)
The parameters associated with the address space include:
map -- The specific vm_map_t involved with the op
addr -- Ptr to the address in the vm_map
length -- Length of the mapping in bytes
Note that the address (addr) above is equivalent to
the address in a process or in the kernel. If the address
is >= VM_MIN_KERNEL_ADDRESS you MUST use kernel_map, and
not &curproc->p_vmspace->vm_map!!! Secondary note, unless
you *really* know what you are doing, do not do a vm_map_find
in the kernel map. Please use kmem_alloc instead.
If you specify an initial value for addr, and find_space
is zero, then the allocation request will succeed only if
there is enough virtual address space available at the
specified address.
The parameters associated with the vm object:
object -- Optional VM object -- if NULL, a default
pager object will be created as needed when
a fault happens thereby making the object
necessary (an container for the page.)
offset -- Offset into the object (long long, vm_ooffset_t)
Additional parameters modifying the operation of the routine:
find_space -- If there is no space at 'addr', space is
found after that place.
prot,max -- R/W permissions to address space:
VM_PROT_READ, VM_PROT_WRITE, VM_PROT_EXEC
cow -- Copy-on-write, original obj is NOT modified.
Error returns:
KERN_SUCCESS -- Operation completed
KERN_INVALID_ADDRESS -- Address specified is invalid
KERN_NO_SPACE -- No space in the map
int vm_map_remove(map, start, end);
(SUPPORTED)
This routine deallocates the virtual space between start and end.
All objects that are backing this space are deallocated as appropriate.
This is sort-of inverse of vm_map_find above. Always returns
KERN_SUCCESS.
int vm_map_protect(map, start, end, new_prot, set_max);
(SUPPORTED)
Changes the access permissions for a virtual address range
in the specified map. This routine makes all necessary modfications
to the pmap associated with the map also.
int vm_map_pageable(map, start, end, new_pageable);
(SUPPORTED)
Allows sections of a map to be wired or unwired into memory.
int vm_map_check_protection(map, start, end, protection);
(SUPPORTED)
Allows an address range to be checked for specified
protection attributes.
int vm_map_lookup(map, addr, fault_type, out_entry, object, pindex,
out_prot, wired, single_use);
(SUPPORTED)
This is a routine that provides functionality more than
the name implies. The routine does return the map entry
associated with (map, addr) pair. But, vm_map_lookup also
does much of the work necessary to create an object
for the map entry (in much of the VM code, objects are created
in a lazy fashion -- only when needed), and also performs much
of the work for COW. If the fault_type is a write fault, a
new object might be created to support the local copy of
a COW map entry (e.g. .data segment of an executable.)
vm_page_t vm_page_alloc(object, pindex, flags);
(SUPPORTED)
flags -- VM_ALLOC_NORMAL normal process allocation
VM_ALLOC_SYSTEM preferential allocation
VM_ALLOC_INTERRUPT allocate interrupt-safely
VM_ALLOC_ZERO normal process with priority
to zero pages
NON-BLOCKING.
This is the lowest level page allocation routine. A NULL is returned
if the allocation cannot be currently satisified. The pages are
returned to the user with the PG_BUSY bit set and are not on any
queue. After allocating the page, it is a good idea to issue
a PAGE_WAKEUP(m) on the page, and at least wire the page. vm_page_alloc
has support for page coloring built-in so that the system will choose
pages more selectively than the usual ad-hoc schemes previously used.
void vm_page_free(object, pindex);
(SUPPORTED)
This is the lowest level page free routine. This routine does NOT
remove ANY mappings associated with the page. Chaos will ensue if
the page is not properly removed from all pmap's. A normally used
page can be removed from all pmap's by a
vm_page_protect(m,VM_PROT_NONE);
However, kernel mappings must be removed one-by-one, and must
be manually tracked.
void vm_page_activate(m);
void vm_page_deactivate(m);
void vm_page_cache(m);
void vm_page_wire(m);
(SUPPORTED)
NON-BLOCKING.
These are the queue manipulation routines. These are used to affect
the policy of the paging and allocation system. If a page is activated,
it is not likely to be freed soon. If it is deactivated, it will more
likely be used. Cached pages are similar to freed pages, available
for allocation, but still have their identity for quick reuse.
If a page is not in one of the other states for a long time, it is
best to wire it so the system can at least account for it. A page
that is wired is "hidden" from the pageout daemon.
vm_object_t vm_object_allocate(type, size);
type -- OBJT_DEFAULT, default -- converts to swap
OBJT_VNODE, vnode object
OBJT_SWAP, swap object
OBJT_DEVICE, device object
This is the routine that creates an object. The user should only
normally be used create objects of type OBJT_DEFAULT. Note that the
size is in units of pages.
vm_object_t vm_pager_allocate(type, handle, size, prot, foff);
(SUPPORTED)
type -- OBJT_DEFAULT, default -- converts to swap
OBJT_VNODE, vnode object
OBJT_SWAP, swap object
OBJT_DEVICE, device object
This is the routine that creates an object and associates the
object with a file. If the object already exists, the reference
count for the object will be incremented. In the case of a
vnode object, the handle is the vnode pointer, and the foff and prot
are both ignored. In the case of a swap object the handle is a
unique 32bit number (probably address), and the foff and prot are
both ignored. The handle for a device object is likely the
device vnode, the prot is the protection that the memory device
can support, and the foff is the offset into the device.
vm_object_deallocate(object);
(SUPPORTED)
This routine decrements the reference bit for the object, potentially
freeing it.
vm_page_protect(m, prot);
(SUPPORTED)
Used to turn off permissions for pages mapped into processes.
vm_page_protect(m, VM_PROT_READ) helps implement COW, and
vm_page_protect(m, VM_PROT_NONE) is an important step in freeing pages.
vm_fault(map, vaddr, fault_type, change_wiring);
(SUPPORTED)
Does the things necessary to bring a page into a processes
address space. The most common use of this routine is in the
trap code to implement demand-paging. Most normal driver
or system use would be as follows:
vm_fault(map, vaddr, VM_PROT_READ or (VM_PROT_READ|VM_PROT_WRITE), 0);
KMEM series of operations (meant to be used on kernel_map or submaps of
kernel_map), they always return page aligned addresses.
kva = kmem_alloc(map, size);
kva = kmem_alloc_pageable(map, size);
(SUPPORTED)
kmem_alloc and kmem_alloc_pageable each allocate space from the
kernel_map (or any of it's submaps except kmem_map). kmem_alloc
allocates both kva space and memory, while kmem_alloc_pageable
allocates only kva space. If memory is being allocated (instead
of just virtual space), you should generally use kmem_alloc.
kmem_alloc_pageable does not do all of the correct things in all
cases for the setup of the underlying kernel_object offset. It is
best to use kmem_alloc_pageable when you plug the pages directly
into the kernel address space.
kmem_free(map,addr,size);
(SUPPORTED)
Use kmem_free to give back the kernel address space as allocated
by kmem_alloc or kmem_malloc. Be careful to remove any mappings
specifically created by pmap_enter before freeing the address range.
It is especially important to be careful when using kmem_free after
allocating kva space with kmem_alloc_pageable.
kva = kmem_malloc(map, size, waitflag);
Use this special form of kmem_alloc for kmem_map or mb_map. Except
for current usage, it is best not to use kmem_malloc in new kernel
extensions. It is best to use malloc/free for things that you CAN
use kmem_malloc for.
MALLOC/FREE (refer to /sys/sys/malloc.h for available types.) These return
aligned memory, but not necessarily on 1 page boundaries.
kva = malloc(size, type, flags);
(SUPPORTED)
flags = M_NOWAIT (call like this from interrupt level.)
= M_KERNEL (preferential allocation of memory.)
= M_WAITOK (normal call for non-interrupt level.)
malloc is callable using M_NOWAIT from both splbio and splimp
interrupt levels.
(void) free(kva, type);
(SUPPORTED)
The kva specified to free must be identical to the one returned
by malloc. The type likewise should be the same, otherwise malloc
usage accounting will not work correctly (and the system will
likely panic.) The kernel malloc/free routines do not deal well
with partial frees of malloced entities. If that capability
is needed, then the kmem_alloc/kmem_free routines would be
better choices.
PMAP routines. These routines are the lowest level defined interface to
the processor memory management hardware. Given the virtual addresses
have been set-up correctly, pmap can be kernel_pmap, the current processes'
pmap or in some cases, another processes pmap.
void pmap_enter(pmap, va, pa, prot, wired);
(SUPPORTED)
map a single page into the physical address space.
void pmap_remove(pmap, sva, eva);
(SUPPORTED)
remove a range of pages from the physical address space.
pa = pmap_extract(pmap, va);
(SUPPORTED)
get the physical address associated with the specified
mapped page.
pa = pmap_kextract(va);
(SUPPORTED)
same as pmap_extract, except is much more efficient and
works only for the kernel_pmap (assuming the kernel space.)
va = pmap_map(va, startp, endp, prot);
(SUPPORTED)
map a contiguous range of pages from physical address startp
through endp at virtual address va. The returned address
points to the next address that can be used for mapping.
pmap_protect(pmap, sva, eva, prot);
(SUPPORTED)
Removes permissions from page protections on pages in the
specified range. It does NOT remove protections for other
pmaps on the pages.
pmap_qenter(va, m, count);
pmap_qremove(va, count);
(PROBABLY SUPPORTED)
pmap_qenter/pmap_qremove are used for fast kernel mappings
of vm_page's allocated from the VM system. The implied pmap
is kernel_pmap, and must refer to va's that
are >= VM_MIN_KERNEL_ADDRESS. Usually one would use address
that were returned by kmem_alloc_pageable.
The second argument to pmap_qenter is a pointer to an array
of pages. This is used often in the buffer cache code for
quick mapping of vm_page_t's.
pmap_kenter(va, pa);
pmap_kremove(va);
(PROBABLY SUPPORTED)
pmap_kenter/pmap_kremove are used for fast kernel mappings.
The implied pmap is kernel_pmap and must refer to va's that
are >= VM_MIN_KERNEL_ADDRESS. Usually one would use address
that were returned by kmem_alloc_pageable.
pmap_growkernel(topaddr);
This routine supports the creation of additional pagetable
pages to encompass the address "topaddr". Kind-of the
equiv of sbrk for the kernel. FreeBSD does not need to
preallocate all of the needed kernel pagetables up-front
because of this routine.
pmap_destroy(pmap);
Decrements pmap ref-count, and if zero, destroy's it.
pmap_reference(pmap);
Increments pmap ref-count.
pmap_pinit(pmap)
Creates a pmap.
pmap_object_init_pt(pmap, addr, object, pindex, size);
Prefaults pages into a processes pmap. If the pages are
in memory, they are placed directly into a processes address
space. This is called at mmap time.
pmap_prefault(pmap, addra, entry, object);
Prefaults pages into a processes pmap. This only places
pages that are in a region around the specified address.
This is called at vm_fault time.
pmap_change_wiring(pmap, va, wired);
This notates the page as being wired. This DOES NOT
actually wire the page.
pmap_copy(dst_pmap, src_pmap, dst_addr, len, src_addr);
This is a routine that might be used to short-circuit
faulting pages into an address space from another. It
is currently NOT used.
pmap_zero_page(dstpa);
(SUPPORTED)
This is the routine that is used to zero a page for demand
zero.
pmap_copy_page(srcpa,dstpa);
(SUPPORTED)
This is the routine that is used to copy a page for COW.
pmap_pageable(pmap, sva, eva, pageable);
This notates a range of pages as being pageable and is
information. It is currently NOT used.
pmap_page_protect(dstpa, prot);
Decreases (and now increases) the protection for a given page.
It is used to remove a page from all address spaces (for
example, prior to being freed), or to write-protect (for example,
for setting up an address space for COW.) This routine
should not normally be used, vm_page_protect is vastly
superior.
The pte bit routines below are much more complicated than they
appear, because they have to check the pte's for each page in
every pmap that the page is mapped.
pmap_is_referenced(srcpa);
(SUPPORTED)
Senses the reference bit on a given page.
pmap_is_modified(srcpa);
(SUPPORTED)
Senses the modified bit on a given page.
pmap_clear_modify(dstpa);
(SUPPORTED)
Clears the modified bit for a given page.
pmap_clear_reference(dstpa);
(SUPPORTED)
Clears the reference bit for a given page.
kva = pmap_mapdev(pa, size);
(SUPPORTED)
Maps device memory into the kernel. kva space is allocated, and
the physical device is mapped directly into the kernel_pmap ptes.
This allows full memory access to the device from the kernel.
Additional miscellaneous routines that are useful to kernel developers,
but refer to them in the source. They most likely will be around for a
"long time."
vmspace_alloc(min, max, pageable);
vmspace_free(vm);
vm_map_reference(map);
vm_map_deallocate(map);
vm_map_insert(map, object, offset, start, end, prot, max, cow);
vm_map_findspace(map, start, length, addr);
vm_map_lookup(map, address, entry);
vm_map_inherit(map, start, end, new_inheritance);
vm_map_clean(map, start, end, syncio, invalidate);
------- End of forwarded message -------
--
Joel Ray Holveck - joelh@gnu.org - http://www.wp.com/piquan
Fourth law of programming:
Anything that can go wrong wi
sendmail: segmentation violation - core dumped
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199807082043.PAA00578>
