From owner-svn-src-stable-7@FreeBSD.ORG Thu Feb 26 15:59:22 2009 Return-Path: Delivered-To: svn-src-stable-7@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id A669F106566B; Thu, 26 Feb 2009 15:59:22 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c]) by mx1.freebsd.org (Postfix) with ESMTP id 8FF368FC12; Thu, 26 Feb 2009 15:59:22 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from svn.freebsd.org (localhost [127.0.0.1]) by svn.freebsd.org (8.14.3/8.14.3) with ESMTP id n1QFxM3L031880; Thu, 26 Feb 2009 15:59:22 GMT (envelope-from jhb@svn.freebsd.org) Received: (from jhb@localhost) by svn.freebsd.org (8.14.3/8.14.3/Submit) id n1QFxMwT031876; Thu, 26 Feb 2009 15:59:22 GMT (envelope-from jhb@svn.freebsd.org) Message-Id: <200902261559.n1QFxMwT031876@svn.freebsd.org> From: John Baldwin Date: Thu, 26 Feb 2009 15:59:22 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-stable@freebsd.org, svn-src-stable-7@freebsd.org X-SVN-Group: stable-7 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Subject: svn commit: r189075 - in stable/7: lib/libc lib/libc/string lib/libc/sys share/man/man9 sys sys/amd64/amd64 sys/amd64/include sys/arm/arm sys/arm/include sys/conf sys/contrib/pf sys/dev/ath/ath_hal... X-BeenThere: svn-src-stable-7@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SVN commit messages for only the 7-stable src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Feb 2009 15:59:23 -0000 Author: jhb Date: Thu Feb 26 15:59:22 2009 New Revision: 189075 URL: http://svn.freebsd.org/changeset/base/189075 Log: MFC: Add support for "superpages" on amd64 and i386. This includes adding the superpage reservation system to the machine-independent VM system as well as changes to the pmap code for amd64 and i386 to support superpages. Reviewed by: alc Tested by: ps Added: stable/7/sys/vm/vm_reserv.c - copied, changed from r174982, head/sys/vm/vm_reserv.c stable/7/sys/vm/vm_reserv.h - copied, changed from r174982, head/sys/vm/vm_reserv.h Deleted: stable/7/sys/vm/vm_pageq.c Modified: stable/7/lib/libc/ (props changed) stable/7/lib/libc/string/ffsll.c (props changed) stable/7/lib/libc/string/flsll.c (props changed) stable/7/lib/libc/sys/mincore.2 stable/7/share/man/man9/ (props changed) stable/7/share/man/man9/vm_map_find.9 stable/7/sys/ (props changed) stable/7/sys/amd64/amd64/pmap.c stable/7/sys/amd64/include/pmap.h stable/7/sys/amd64/include/vmparam.h stable/7/sys/arm/arm/pmap.c stable/7/sys/arm/include/vmparam.h stable/7/sys/conf/files stable/7/sys/conf/options stable/7/sys/contrib/pf/ (props changed) stable/7/sys/dev/ath/ath_hal/ (props changed) stable/7/sys/dev/cxgb/ (props changed) stable/7/sys/i386/i386/pmap.c stable/7/sys/i386/include/pmap.h stable/7/sys/i386/include/vmparam.h stable/7/sys/ia64/ia64/pmap.c stable/7/sys/ia64/include/vmparam.h stable/7/sys/kern/kern_exec.c stable/7/sys/kern/kern_malloc.c stable/7/sys/powerpc/include/vmparam.h stable/7/sys/powerpc/powerpc/pmap_dispatch.c stable/7/sys/sparc64/include/vmparam.h stable/7/sys/sparc64/sparc64/pmap.c stable/7/sys/sun4v/include/vmparam.h stable/7/sys/sun4v/sun4v/pmap.c stable/7/sys/sys/mman.h stable/7/sys/vm/device_pager.c stable/7/sys/vm/memguard.c stable/7/sys/vm/pmap.h stable/7/sys/vm/vm.h stable/7/sys/vm/vm_extern.h stable/7/sys/vm/vm_fault.c stable/7/sys/vm/vm_init.c stable/7/sys/vm/vm_kern.c stable/7/sys/vm/vm_map.c stable/7/sys/vm/vm_map.h stable/7/sys/vm/vm_mmap.c stable/7/sys/vm/vm_object.c stable/7/sys/vm/vm_object.h stable/7/sys/vm/vm_page.c stable/7/sys/vm/vm_page.h stable/7/sys/vm/vm_pageout.c stable/7/sys/vm/vm_phys.c stable/7/sys/vm/vm_phys.h stable/7/sys/vm/vnode_pager.c Modified: stable/7/lib/libc/sys/mincore.2 ============================================================================== --- stable/7/lib/libc/sys/mincore.2 Thu Feb 26 15:51:54 2009 (r189074) +++ stable/7/lib/libc/sys/mincore.2 Thu Feb 26 15:59:22 2009 (r189075) @@ -72,6 +72,8 @@ Page has been modified by us. Page has been referenced. .It Dv MINCORE_MODIFIED_OTHER Page has been modified. +.It Dv MINCORE_SUPER +Page is part of a "super" page. (only i386 & amd64) .El .Pp The information returned by Modified: stable/7/share/man/man9/vm_map_find.9 ============================================================================== --- stable/7/share/man/man9/vm_map_find.9 Thu Feb 26 15:51:54 2009 (r189074) +++ stable/7/share/man/man9/vm_map_find.9 Thu Feb 26 15:59:22 2009 (r189075) @@ -25,7 +25,7 @@ .\" .\" $FreeBSD$ .\" -.Dd July 19, 2003 +.Dd May 10, 2008 .Dt VM_MAP_FIND 9 .Os .Sh NAME @@ -38,7 +38,7 @@ .Ft int .Fo vm_map_find .Fa "vm_map_t map" "vm_object_t object" "vm_ooffset_t offset" -.Fa "vm_offset_t *addr" "vm_size_t length" "boolean_t find_space" +.Fa "vm_offset_t *addr" "vm_size_t length" "int find_space" .Fa "vm_prot_t prot" "vm_prot_t max" "int cow" .Fc .Sh DESCRIPTION @@ -70,11 +70,25 @@ by the caller before calling this functi .Pp If .Fa find_space -is -.Dv TRUE , +is either +.Dv VMFS_ALIGNED_SPACE +or +.Dv VMFS_ANY_SPACE , the function will call .Xr vm_map_findspace 9 to discover a free region. +Moreover, if +.Fa find_space +is +.Dv VMFS_ALIGNED_SPACE , +the address of the free region will be optimized for the use of superpages. +Otherwise, if +.Fa find_space +is +.Dv VMFS_NO_SPACE , +.Xr vm_map_insert 9 +is called with the given address, +.Fa addr . .Sh IMPLEMENTATION NOTES This function acquires a lock on .Fa map @@ -90,9 +104,14 @@ The .Fn vm_map_find function returns .Dv KERN_SUCCESS -if space for the mapping could be found and -the mapping was successfully created. -If space could not be found in the map, +if the mapping was successfully created. +If space could not be found or +.Fa find_space +was +.Dv VMFS_NO_SPACE +and the given address, +.Fa addr , +was already mapped, .Dv KERN_NO_SPACE will be returned. If the discovered range turned out to be bogus, Modified: stable/7/sys/amd64/amd64/pmap.c ============================================================================== --- stable/7/sys/amd64/amd64/pmap.c Thu Feb 26 15:51:54 2009 (r189074) +++ stable/7/sys/amd64/amd64/pmap.c Thu Feb 26 15:59:22 2009 (r189075) @@ -7,7 +7,7 @@ * All rights reserved. * Copyright (c) 2003 Peter Wemm * All rights reserved. - * Copyright (c) 2005 Alan L. Cox + * Copyright (c) 2005-2008 Alan L. Cox * All rights reserved. * * This code is derived from software contributed to Berkeley by @@ -107,10 +107,12 @@ __FBSDID("$FreeBSD$"); #include "opt_msgbuf.h" #include "opt_pmap.h" +#include "opt_vm.h" #include #include #include +#include #include #include #include @@ -134,6 +136,7 @@ __FBSDID("$FreeBSD$"); #include #include #include +#include #include #include @@ -149,11 +152,7 @@ __FBSDID("$FreeBSD$"); #define PMAP_SHPGPERPROC 200 #endif -#if defined(DIAGNOSTIC) -#define PMAP_DIAGNOSTIC -#endif - -#if !defined(PMAP_DIAGNOSTIC) +#if !defined(DIAGNOSTIC) #define PMAP_INLINE __gnu89_inline #else #define PMAP_INLINE @@ -166,6 +165,9 @@ __FBSDID("$FreeBSD$"); #define PV_STAT(x) do { } while (0) #endif +#define pa_index(pa) ((pa) >> PDRSHIFT) +#define pa_to_pvh(pa) (&pv_table[pa_index(pa)]) + struct pmap kernel_pmap_store; vm_offset_t virtual_avail; /* VA of first avail page (after kernel bss) */ @@ -176,6 +178,12 @@ static vm_paddr_t dmaplimit; vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS; pt_entry_t pg_nx; +SYSCTL_NODE(_vm, OID_AUTO, pmap, CTLFLAG_RD, 0, "VM/pmap parameters"); + +static int pg_ps_enabled; +SYSCTL_INT(_vm_pmap, OID_AUTO, pg_ps_enabled, CTLFLAG_RD, &pg_ps_enabled, 0, + "Are large page mappings enabled?"); + static u_int64_t KPTphys; /* phys addr of kernel level 1 */ static u_int64_t KPDphys; /* phys addr of kernel level 2 */ u_int64_t KPDPphys; /* phys addr of kernel level 3 */ @@ -188,6 +196,7 @@ static u_int64_t DMPDPphys; /* phys addr * Data for the pv entry allocation mechanism */ static int pv_entry_count = 0, pv_entry_max = 0, pv_entry_high_water = 0; +static struct md_page *pv_table; static int shpgperproc = PMAP_SHPGPERPROC; /* @@ -204,11 +213,29 @@ static caddr_t crashdumpmap; static void free_pv_entry(pmap_t pmap, pv_entry_t pv); static pv_entry_t get_pv_entry(pmap_t locked_pmap, int try); - +static void pmap_pv_demote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa); +static boolean_t pmap_pv_insert_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa); +static void pmap_pv_promote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa); +static void pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va); +static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, + vm_offset_t va); + +static boolean_t pmap_demote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va); +static boolean_t pmap_enter_pde(pmap_t pmap, vm_offset_t va, vm_page_t m, + vm_prot_t prot); static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, vm_page_t mpte); +static void pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte); +static boolean_t pmap_is_modified_pvh(struct md_page *pvh); +static vm_page_t pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va); +static void pmap_promote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va); +static boolean_t pmap_protect_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t sva, + vm_prot_t prot); +static int pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva, + vm_page_t *free); static int pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t sva, pd_entry_t ptepde, vm_page_t *free); +static void pmap_remove_pt_page(pmap_t pmap, vm_page_t mpte); static void pmap_remove_page(pmap_t pmap, vm_offset_t va, pd_entry_t *pde, vm_page_t *free); static void pmap_remove_entry(struct pmap *pmap, vm_page_t m, @@ -362,21 +389,6 @@ pmap_pte(pmap_t pmap, vm_offset_t va) } -static __inline pt_entry_t * -pmap_pte_pde(pmap_t pmap, vm_offset_t va, pd_entry_t *ptepde) -{ - pd_entry_t *pde; - - pde = pmap_pde(pmap, va); - if (pde == NULL || (*pde & PG_V) == 0) - return NULL; - *ptepde = *pde; - if ((*pde & PG_PS) != 0) /* compat with i386 pmap_pte() */ - return ((pt_entry_t *)pde); - return (pmap_pde_to_pte(pde, va)); -} - - PMAP_INLINE pt_entry_t * vtopte(vm_offset_t va) { @@ -511,6 +523,7 @@ pmap_bootstrap(vm_paddr_t *firstaddr) */ PMAP_LOCK_INIT(kernel_pmap); kernel_pmap->pm_pml4 = (pdp_entry_t *) (KERNBASE + KPML4phys); + kernel_pmap->pm_root = NULL; kernel_pmap->pm_active = -1; /* don't allow deactivation */ TAILQ_INIT(&kernel_pmap->pm_pvchunk); @@ -609,6 +622,26 @@ pmap_page_init(vm_page_t m) void pmap_init(void) { + pd_entry_t *pd; + vm_page_t mpte; + vm_size_t s; + int i, pv_npg; + + /* + * Initialize the vm page array entries for the kernel pmap's + * page table pages. + */ + pd = pmap_pde(kernel_pmap, VM_MIN_KERNEL_ADDRESS); + for (i = 0; i < NKPT; i++) { + if ((pd[i] & (PG_PS | PG_V)) == (PG_PS | PG_V)) + continue; + mpte = PHYS_TO_VM_PAGE(pd[i] & PG_FRAME); + KASSERT(mpte >= vm_page_array && + mpte < &vm_page_array[vm_page_array_size], + ("pmap_init: page table page is out of range")); + mpte->pindex = pmap_pde_pindex(VM_MIN_KERNEL_ADDRESS) + i; + mpte->phys_addr = pd[i] & PG_FRAME; + } /* * Initialize the address space (zone) for the pv entries. Set a @@ -619,9 +652,28 @@ pmap_init(void) pv_entry_max = shpgperproc * maxproc + cnt.v_page_count; TUNABLE_INT_FETCH("vm.pmap.pv_entries", &pv_entry_max); pv_entry_high_water = 9 * (pv_entry_max / 10); + + /* + * Are large page mappings enabled? + */ + TUNABLE_INT_FETCH("vm.pmap.pg_ps_enabled", &pg_ps_enabled); + + /* + * Calculate the size of the pv head table for superpages. + */ + for (i = 0; phys_avail[i + 1]; i += 2); + pv_npg = round_2mpage(phys_avail[(i - 2) + 1]) / NBPDR; + + /* + * Allocate memory for the pv head table for superpages. + */ + s = (vm_size_t)(pv_npg * sizeof(struct md_page)); + s = round_page(s); + pv_table = (struct md_page *)kmem_alloc(kernel_map, s); + for (i = 0; i < pv_npg; i++) + TAILQ_INIT(&pv_table[i].pv_list); } -SYSCTL_NODE(_vm, OID_AUTO, pmap, CTLFLAG_RD, 0, "VM/pmap parameters"); static int pmap_pventry_proc(SYSCTL_HANDLER_ARGS) { @@ -652,6 +704,25 @@ pmap_shpgperproc_proc(SYSCTL_HANDLER_ARG SYSCTL_PROC(_vm_pmap, OID_AUTO, shpgperproc, CTLTYPE_INT|CTLFLAG_RW, &shpgperproc, 0, pmap_shpgperproc_proc, "IU", "Page share factor per proc"); +SYSCTL_NODE(_vm_pmap, OID_AUTO, pde, CTLFLAG_RD, 0, + "2MB page mapping counters"); + +static u_long pmap_pde_demotions; +SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, demotions, CTLFLAG_RD, + &pmap_pde_demotions, 0, "2MB page demotions"); + +static u_long pmap_pde_mappings; +SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, mappings, CTLFLAG_RD, + &pmap_pde_mappings, 0, "2MB page mappings"); + +static u_long pmap_pde_p_failures; +SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, p_failures, CTLFLAG_RD, + &pmap_pde_p_failures, 0, "2MB page promotion failures"); + +static u_long pmap_pde_promotions; +SYSCTL_ULONG(_vm_pmap_pde, OID_AUTO, promotions, CTLFLAG_RD, + &pmap_pde_promotions, 0, "2MB page promotions"); + /*************************************************** * Low level helper routines..... @@ -953,17 +1024,25 @@ pmap_extract_and_hold(pmap_t pmap, vm_of vm_paddr_t pmap_kextract(vm_offset_t va) { - pd_entry_t *pde; + pd_entry_t pde; vm_paddr_t pa; if (va >= DMAP_MIN_ADDRESS && va < DMAP_MAX_ADDRESS) { pa = DMAP_TO_PHYS(va); } else { - pde = vtopde(va); - if (*pde & PG_PS) { - pa = (*pde & PG_PS_FRAME) | (va & PDRMASK); + pde = *vtopde(va); + if (pde & PG_PS) { + pa = (pde & PG_PS_FRAME) | (va & PDRMASK); } else { - pa = *vtopte(va); + /* + * Beware of a concurrent promotion that changes the + * PDE at this point! For example, vtopte() must not + * be used to access the PTE because it would use the + * new PDE. It is, however, safe to use the old PDE + * because the page table page is preserved by the + * promotion. + */ + pa = *pmap_pde_to_pte(&pde, va); pa = (pa & PG_FRAME) | (va & PAGE_MASK); } } @@ -1085,8 +1164,105 @@ pmap_free_zero_pages(vm_page_t free) while (free != NULL) { m = free; free = m->right; - vm_page_free_zero(m); + /* Preserve the page's PG_ZERO setting. */ + vm_page_free_toq(m); + } +} + +/* + * Schedule the specified unused page table page to be freed. Specifically, + * add the page to the specified list of pages that will be released to the + * physical memory manager after the TLB has been updated. + */ +static __inline void +pmap_add_delayed_free_list(vm_page_t m, vm_page_t *free, boolean_t set_PG_ZERO) +{ + + if (set_PG_ZERO) + m->flags |= PG_ZERO; + else + m->flags &= ~PG_ZERO; + m->right = *free; + *free = m; +} + +/* + * Inserts the specified page table page into the specified pmap's collection + * of idle page table pages. Each of a pmap's page table pages is responsible + * for mapping a distinct range of virtual addresses. The pmap's collection is + * ordered by this virtual address range. + */ +static void +pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte) +{ + vm_page_t root; + + PMAP_LOCK_ASSERT(pmap, MA_OWNED); + root = pmap->pm_root; + if (root == NULL) { + mpte->left = NULL; + mpte->right = NULL; + } else { + root = vm_page_splay(mpte->pindex, root); + if (mpte->pindex < root->pindex) { + mpte->left = root->left; + mpte->right = root; + root->left = NULL; + } else if (mpte->pindex == root->pindex) + panic("pmap_insert_pt_page: pindex already inserted"); + else { + mpte->right = root->right; + mpte->left = root; + root->right = NULL; + } + } + pmap->pm_root = mpte; +} + +/* + * Looks for a page table page mapping the specified virtual address in the + * specified pmap's collection of idle page table pages. Returns NULL if there + * is no page table page corresponding to the specified virtual address. + */ +static vm_page_t +pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va) +{ + vm_page_t mpte; + vm_pindex_t pindex = pmap_pde_pindex(va); + + PMAP_LOCK_ASSERT(pmap, MA_OWNED); + if ((mpte = pmap->pm_root) != NULL && mpte->pindex != pindex) { + mpte = vm_page_splay(pindex, mpte); + if ((pmap->pm_root = mpte)->pindex != pindex) + mpte = NULL; } + return (mpte); +} + +/* + * Removes the specified page table page from the specified pmap's collection + * of idle page table pages. The specified page table page must be a member of + * the pmap's collection. + */ +static void +pmap_remove_pt_page(pmap_t pmap, vm_page_t mpte) +{ + vm_page_t root; + + PMAP_LOCK_ASSERT(pmap, MA_OWNED); + if (mpte != pmap->pm_root) { + root = vm_page_splay(mpte->pindex, pmap->pm_root); + KASSERT(mpte == root, + ("pmap_remove_pt_page: mpte %p is missing from pmap %p", + mpte, pmap)); + } + if (mpte->left == NULL) + root = mpte->right; + else { + root = vm_page_splay(mpte->pindex, mpte->left); + root->right = mpte->right; + } + pmap->pm_root = root; } /* @@ -1165,8 +1341,7 @@ _pmap_unwire_pte_hold(pmap_t pmap, vm_of * Put page on a list so that it is released after * *ALL* TLB shootdown is done */ - m->right = *free; - *free = m; + pmap_add_delayed_free_list(m, free, TRUE); return 1; } @@ -1193,6 +1368,7 @@ pmap_pinit0(pmap_t pmap) PMAP_LOCK_INIT(pmap); pmap->pm_pml4 = (pml4_entry_t *)(KERNBASE + KPML4phys); + pmap->pm_root = NULL; pmap->pm_active = 0; TAILQ_INIT(&pmap->pm_pvchunk); bzero(&pmap->pm_stats, sizeof pmap->pm_stats); @@ -1229,6 +1405,7 @@ pmap_pinit(pmap_t pmap) /* install self-referential address mapping entry(s) */ pmap->pm_pml4[PML4PML4I] = VM_PAGE_TO_PHYS(pml4pg) | PG_V | PG_RW | PG_A | PG_M; + pmap->pm_root = NULL; pmap->pm_active = 0; TAILQ_INIT(&pmap->pm_pvchunk); bzero(&pmap->pm_stats, sizeof pmap->pm_stats); @@ -1404,7 +1581,7 @@ pmap_allocpte(pmap_t pmap, vm_offset_t v { vm_pindex_t ptepindex; pd_entry_t *pd; - vm_page_t m, free; + vm_page_t m; KASSERT((flags & (M_NOWAIT | M_WAITOK)) == M_NOWAIT || (flags & (M_NOWAIT | M_WAITOK)) == M_WAITOK, @@ -1424,21 +1601,21 @@ retry: * This supports switching from a 2MB page to a * normal 4K page. */ - if (pd != 0 && (*pd & (PG_PS | PG_V)) == (PG_PS | PG_V)) { - *pd = 0; - pd = 0; - pmap->pm_stats.resident_count -= NBPDR / PAGE_SIZE; - free = NULL; - pmap_unuse_pt(pmap, va, *pmap_pdpe(pmap, va), &free); - pmap_invalidate_all(kernel_pmap); - pmap_free_zero_pages(free); + if (pd != NULL && (*pd & (PG_PS | PG_V)) == (PG_PS | PG_V)) { + if (!pmap_demote_pde(pmap, pd, va)) { + /* + * Invalidation of the 2MB page mapping may have caused + * the deallocation of the underlying PD page. + */ + pd = NULL; + } } /* * If the page table page is mapped, we just increment the * hold count, and activate it. */ - if (pd != 0 && (*pd & PG_V) != 0) { + if (pd != NULL && (*pd & PG_V) != 0) { m = PHYS_TO_VM_PAGE(*pd & PG_FRAME); m->wire_count++; } else { @@ -1471,6 +1648,8 @@ pmap_release(pmap_t pmap) KASSERT(pmap->pm_stats.resident_count == 0, ("pmap_release: pmap resident count %ld != 0", pmap->pm_stats.resident_count)); + KASSERT(pmap->pm_root == NULL, + ("pmap_release: pmap has reserved page table page(s)")); m = PHYS_TO_VM_PAGE(pmap->pm_pml4[PML4PML4I] & PG_FRAME); @@ -1645,11 +1824,16 @@ SYSCTL_INT(_vm_pmap, OID_AUTO, pmap_coll * drastic measures to free some pages so we can allocate * another pv entry chunk. This is normally called to * unmap inactive pages, and if necessary, active pages. + * + * We do not, however, unmap 2mpages because subsequent accesses will + * allocate per-page pv entries until repromotion occurs, thereby + * exacerbating the shortage of free pv entries. */ static void pmap_collect(pmap_t locked_pmap, struct vpgqueues *vpq) { - pd_entry_t ptepde; + struct md_page *pvh; + pd_entry_t *pde; pmap_t pmap; pt_entry_t *pte, tpte; pv_entry_t next_pv, pv; @@ -1668,28 +1852,27 @@ pmap_collect(pmap_t locked_pmap, struct else if (pmap != locked_pmap && !PMAP_TRYLOCK(pmap)) continue; pmap->pm_stats.resident_count--; - pte = pmap_pte_pde(pmap, va, &ptepde); - if (pte == NULL) { - panic("null pte in pmap_collect"); - } + pde = pmap_pde(pmap, va); + KASSERT((*pde & PG_PS) == 0, ("pmap_collect: found" + " a 2mpage in page %p's pv list", m)); + pte = pmap_pde_to_pte(pde, va); tpte = pte_load_clear(pte); KASSERT((tpte & PG_W) == 0, ("pmap_collect: wired pte %#lx", tpte)); if (tpte & PG_A) vm_page_flag_set(m, PG_REFERENCED); - if (tpte & PG_M) { - KASSERT((tpte & PG_RW), - ("pmap_collect: modified page not writable: va: %#lx, pte: %#lx", - va, tpte)); + if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) vm_page_dirty(m); - } free = NULL; - pmap_unuse_pt(pmap, va, ptepde, &free); + pmap_unuse_pt(pmap, va, *pde, &free); pmap_invalidate_page(pmap, va); pmap_free_zero_pages(free); TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); - if (TAILQ_EMPTY(&m->md.pv_list)) - vm_page_flag_clear(m, PG_WRITEABLE); + if (TAILQ_EMPTY(&m->md.pv_list)) { + pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); + if (TAILQ_EMPTY(&pvh->pv_list)) + vm_page_flag_clear(m, PG_WRITEABLE); + } free_pv_entry(pmap, pv); if (pmap != locked_pmap) PMAP_UNLOCK(pmap); @@ -1824,24 +2007,133 @@ retry: return (pv); } -static void -pmap_remove_entry(pmap_t pmap, vm_page_t m, vm_offset_t va) +/* + * First find and then remove the pv entry for the specified pmap and virtual + * address from the specified pv list. Returns the pv entry if found and NULL + * otherwise. This operation can be performed on pv lists for either 4KB or + * 2MB page mappings. + */ +static __inline pv_entry_t +pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, vm_offset_t va) { pv_entry_t pv; - PMAP_LOCK_ASSERT(pmap, MA_OWNED); mtx_assert(&vm_page_queue_mtx, MA_OWNED); - TAILQ_FOREACH(pv, &m->md.pv_list, pv_list) { - if (pmap == PV_PMAP(pv) && va == pv->pv_va) + TAILQ_FOREACH(pv, &pvh->pv_list, pv_list) { + if (pmap == PV_PMAP(pv) && va == pv->pv_va) { + TAILQ_REMOVE(&pvh->pv_list, pv, pv_list); break; + } } - KASSERT(pv != NULL, ("pmap_remove_entry: pv not found")); - TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); - if (TAILQ_EMPTY(&m->md.pv_list)) - vm_page_flag_clear(m, PG_WRITEABLE); + return (pv); +} + +/* + * After demotion from a 2MB page mapping to 512 4KB page mappings, + * destroy the pv entry for the 2MB page mapping and reinstantiate the pv + * entries for each of the 4KB page mappings. + */ +static void +pmap_pv_demote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa) +{ + struct md_page *pvh; + pv_entry_t pv; + vm_offset_t va_last; + vm_page_t m; + + mtx_assert(&vm_page_queue_mtx, MA_OWNED); + KASSERT((pa & PDRMASK) == 0, + ("pmap_pv_demote_pde: pa is not 2mpage aligned")); + + /* + * Transfer the 2mpage's pv entry for this mapping to the first + * page's pv list. + */ + pvh = pa_to_pvh(pa); + va = trunc_2mpage(va); + pv = pmap_pvh_remove(pvh, pmap, va); + KASSERT(pv != NULL, ("pmap_pv_demote_pde: pv not found")); + m = PHYS_TO_VM_PAGE(pa); + TAILQ_INSERT_TAIL(&m->md.pv_list, pv, pv_list); + /* Instantiate the remaining NPTEPG - 1 pv entries. */ + va_last = va + NBPDR - PAGE_SIZE; + do { + m++; + KASSERT((m->flags & (PG_FICTITIOUS | PG_UNMANAGED)) == 0, + ("pmap_pv_demote_pde: page %p is not managed", m)); + va += PAGE_SIZE; + pmap_insert_entry(pmap, va, m); + } while (va < va_last); +} + +/* + * After promotion from 512 4KB page mappings to a single 2MB page mapping, + * replace the many pv entries for the 4KB page mappings by a single pv entry + * for the 2MB page mapping. + */ +static void +pmap_pv_promote_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa) +{ + struct md_page *pvh; + pv_entry_t pv; + vm_offset_t va_last; + vm_page_t m; + + mtx_assert(&vm_page_queue_mtx, MA_OWNED); + KASSERT((pa & PDRMASK) == 0, + ("pmap_pv_promote_pde: pa is not 2mpage aligned")); + + /* + * Transfer the first page's pv entry for this mapping to the + * 2mpage's pv list. Aside from avoiding the cost of a call + * to get_pv_entry(), a transfer avoids the possibility that + * get_pv_entry() calls pmap_collect() and that pmap_collect() + * removes one of the mappings that is being promoted. + */ + m = PHYS_TO_VM_PAGE(pa); + va = trunc_2mpage(va); + pv = pmap_pvh_remove(&m->md, pmap, va); + KASSERT(pv != NULL, ("pmap_pv_promote_pde: pv not found")); + pvh = pa_to_pvh(pa); + TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_list); + /* Free the remaining NPTEPG - 1 pv entries. */ + va_last = va + NBPDR - PAGE_SIZE; + do { + m++; + va += PAGE_SIZE; + pmap_pvh_free(&m->md, pmap, va); + } while (va < va_last); +} + +/* + * First find and then destroy the pv entry for the specified pmap and virtual + * address. This operation can be performed on pv lists for either 4KB or 2MB + * page mappings. + */ +static void +pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va) +{ + pv_entry_t pv; + + pv = pmap_pvh_remove(pvh, pmap, va); + KASSERT(pv != NULL, ("pmap_pvh_free: pv not found")); free_pv_entry(pmap, pv); } +static void +pmap_remove_entry(pmap_t pmap, vm_page_t m, vm_offset_t va) +{ + struct md_page *pvh; + + mtx_assert(&vm_page_queue_mtx, MA_OWNED); + pmap_pvh_free(&m->md, pmap, va); + if (TAILQ_EMPTY(&m->md.pv_list)) { + pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); + if (TAILQ_EMPTY(&pvh->pv_list)) + vm_page_flag_clear(m, PG_WRITEABLE); + } +} + /* * Create a pv entry for page at pa for * (pmap, va). @@ -1878,6 +2170,170 @@ pmap_try_insert_pv_entry(pmap_t pmap, vm } /* + * Create the pv entry for a 2MB page mapping. + */ +static boolean_t +pmap_pv_insert_pde(pmap_t pmap, vm_offset_t va, vm_paddr_t pa) +{ + struct md_page *pvh; + pv_entry_t pv; + + mtx_assert(&vm_page_queue_mtx, MA_OWNED); + if (pv_entry_count < pv_entry_high_water && + (pv = get_pv_entry(pmap, TRUE)) != NULL) { + pv->pv_va = va; + pvh = pa_to_pvh(pa); + TAILQ_INSERT_TAIL(&pvh->pv_list, pv, pv_list); + return (TRUE); + } else + return (FALSE); +} + +/* + * Tries to demote a 2MB page mapping. + */ +static boolean_t +pmap_demote_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t va) +{ + pd_entry_t newpde, oldpde; + pt_entry_t *firstpte, newpte, *pte; + vm_paddr_t mptepa; + vm_page_t free, mpte; + + PMAP_LOCK_ASSERT(pmap, MA_OWNED); + mpte = pmap_lookup_pt_page(pmap, va); + if (mpte != NULL) + pmap_remove_pt_page(pmap, mpte); + else { + KASSERT((*pde & PG_W) == 0, + ("pmap_demote_pde: page table page for a wired mapping" + " is missing")); + free = NULL; + pmap_remove_pde(pmap, pde, trunc_2mpage(va), &free); + pmap_invalidate_page(pmap, trunc_2mpage(va)); + pmap_free_zero_pages(free); + CTR2(KTR_PMAP, "pmap_demote_pde: failure for va %#lx" + " in pmap %p", va, pmap); + return (FALSE); + } + mptepa = VM_PAGE_TO_PHYS(mpte); + firstpte = (pt_entry_t *)PHYS_TO_DMAP(mptepa); + oldpde = *pde; + newpde = mptepa | PG_M | PG_A | (oldpde & PG_U) | PG_RW | PG_V; + KASSERT((oldpde & (PG_A | PG_V)) == (PG_A | PG_V), + ("pmap_demote_pde: oldpde is missing PG_A and/or PG_V")); + KASSERT((oldpde & (PG_M | PG_RW)) != PG_RW, + ("pmap_demote_pde: oldpde is missing PG_M")); + KASSERT((oldpde & PG_PS) != 0, + ("pmap_demote_pde: oldpde is missing PG_PS")); + newpte = oldpde & ~PG_PS; + if ((newpte & PG_PDE_PAT) != 0) + newpte ^= PG_PDE_PAT | PG_PTE_PAT; + + /* + * If the mapping has changed attributes, update the page table + * entries. + */ + KASSERT((*firstpte & PG_FRAME) == (newpte & PG_FRAME), + ("pmap_demote_pde: firstpte and newpte map different physical" + " addresses")); + if ((*firstpte & PG_PTE_PROMOTE) != (newpte & PG_PTE_PROMOTE)) + for (pte = firstpte; pte < firstpte + NPTEPG; pte++) { + *pte = newpte; + newpte += PAGE_SIZE; + } + + /* + * Demote the mapping. This pmap is locked. The old PDE has + * PG_A set. If the old PDE has PG_RW set, it also has PG_M + * set. Thus, there is no danger of a race with another + * processor changing the setting of PG_A and/or PG_M between + * the read above and the store below. + */ + pde_store(pde, newpde); + + /* + * Invalidate a stale mapping of the page table page. + */ + pmap_invalidate_page(pmap, (vm_offset_t)vtopte(va)); + + /* + * Demote the pv entry. This depends on the earlier demotion + * of the mapping. Specifically, the (re)creation of a per- + * page pv entry might trigger the execution of pmap_collect(), + * which might reclaim a newly (re)created per-page pv entry + * and destroy the associated mapping. In order to destroy + * the mapping, the PDE must have already changed from mapping + * the 2mpage to referencing the page table page. + */ + if ((oldpde & PG_MANAGED) != 0) + pmap_pv_demote_pde(pmap, va, oldpde & PG_PS_FRAME); + + pmap_pde_demotions++; + CTR2(KTR_PMAP, "pmap_demote_pde: success for va %#lx" + " in pmap %p", va, pmap); + return (TRUE); +} + +/* + * pmap_remove_pde: do the things to unmap a superpage in a process + */ +static int +pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva, + vm_page_t *free) +{ + struct md_page *pvh; + pd_entry_t oldpde; + vm_offset_t eva, va; + vm_page_t m, mpte; + + PMAP_LOCK_ASSERT(pmap, MA_OWNED); + KASSERT((sva & PDRMASK) == 0, + ("pmap_remove_pde: sva is not 2mpage aligned")); + oldpde = pte_load_clear(pdq); + if (oldpde & PG_W) + pmap->pm_stats.wired_count -= NBPDR / PAGE_SIZE; + + /* + * Machines that don't support invlpg, also don't support + * PG_G. + */ + if (oldpde & PG_G) + pmap_invalidate_page(kernel_pmap, sva); + pmap->pm_stats.resident_count -= NBPDR / PAGE_SIZE; + if (oldpde & PG_MANAGED) { + pvh = pa_to_pvh(oldpde & PG_PS_FRAME); + pmap_pvh_free(pvh, pmap, sva); + eva = sva + NBPDR; + for (va = sva, m = PHYS_TO_VM_PAGE(oldpde & PG_PS_FRAME); + va < eva; va += PAGE_SIZE, m++) { + if ((oldpde & (PG_M | PG_RW)) == (PG_M | PG_RW)) + vm_page_dirty(m); + if (oldpde & PG_A) + vm_page_flag_set(m, PG_REFERENCED); + if (TAILQ_EMPTY(&m->md.pv_list) && + TAILQ_EMPTY(&pvh->pv_list)) + vm_page_flag_clear(m, PG_WRITEABLE); + } + } + if (pmap == kernel_pmap) { + if (!pmap_demote_pde(pmap, pdq, sva)) + panic("pmap_remove_pde: failed demotion"); + } else { + mpte = pmap_lookup_pt_page(pmap, sva); + if (mpte != NULL) { + pmap_remove_pt_page(pmap, mpte); + KASSERT(mpte->wire_count == NPTEPG, + ("pmap_remove_pde: pte page wire count error")); + mpte->wire_count = 0; + pmap_add_delayed_free_list(mpte, free, FALSE); + atomic_subtract_int(&cnt.v_wire_count, 1); + } + } + return (pmap_unuse_pt(pmap, sva, *pmap_pdpe(pmap, sva), free)); +} + +/* * pmap_remove_pte: do the things to unmap a page in a process */ static int @@ -1900,12 +2356,8 @@ pmap_remove_pte(pmap_t pmap, pt_entry_t pmap->pm_stats.resident_count -= 1; if (oldpte & PG_MANAGED) { m = PHYS_TO_VM_PAGE(oldpte & PG_FRAME); - if (oldpte & PG_M) { - KASSERT((oldpte & PG_RW), - ("pmap_remove_pte: modified page not writable: va: %#lx, pte: %#lx", - va, oldpte)); + if ((oldpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) vm_page_dirty(m); - } if (oldpte & PG_A) vm_page_flag_set(m, PG_REFERENCED); pmap_remove_entry(pmap, m, va); @@ -2013,11 +2465,24 @@ pmap_remove(pmap_t pmap, vm_offset_t sva * Check for large page. */ if ((ptpaddr & PG_PS) != 0) { - *pde = 0; - pmap->pm_stats.resident_count -= NBPDR / PAGE_SIZE; - pmap_unuse_pt(pmap, sva, *pdpe, &free); - anyvalid = 1; - continue; + /* + * Are we removing the entire large page? If not, + * demote the mapping and fall through. + */ + if (sva + NBPDR == va_next && eva >= va_next) { + /* + * The TLB entry for a PG_G mapping is + * invalidated by pmap_remove_pde(). + */ + if ((ptpaddr & PG_G) == 0) + anyvalid = 1; + pmap_remove_pde(pmap, pde, sva, &free); + continue; + } else if (!pmap_demote_pde(pmap, pde, sva)) { + /* The large page mapping was destroyed. */ + continue; + } else + ptpaddr = *pde; } /* @@ -2067,30 +2532,34 @@ out: void pmap_remove_all(vm_page_t m) { + struct md_page *pvh; pv_entry_t pv; pmap_t pmap; pt_entry_t *pte, tpte; - pd_entry_t ptepde; + pd_entry_t *pde; + vm_offset_t va; vm_page_t free; -#if defined(PMAP_DIAGNOSTIC) - /* - * XXX This makes pmap_remove_all() illegal for non-managed pages! - */ - if (m->flags & PG_FICTITIOUS) { - panic("pmap_remove_all: illegal for unmanaged page, va: 0x%lx", - VM_PAGE_TO_PHYS(m)); - } -#endif + KASSERT((m->flags & PG_FICTITIOUS) == 0, + ("pmap_remove_all: page %p is fictitious", m)); mtx_assert(&vm_page_queue_mtx, MA_OWNED); + pvh = pa_to_pvh(VM_PAGE_TO_PHYS(m)); + while ((pv = TAILQ_FIRST(&pvh->pv_list)) != NULL) { + va = pv->pv_va; + pmap = PV_PMAP(pv); + PMAP_LOCK(pmap); + pde = pmap_pde(pmap, va); + (void)pmap_demote_pde(pmap, pde, va); + PMAP_UNLOCK(pmap); + } while ((pv = TAILQ_FIRST(&m->md.pv_list)) != NULL) { pmap = PV_PMAP(pv); PMAP_LOCK(pmap); pmap->pm_stats.resident_count--; - pte = pmap_pte_pde(pmap, pv->pv_va, &ptepde); - if (pte == NULL) { - panic("null pte in pmap_remove_all"); - } + pde = pmap_pde(pmap, pv->pv_va); + KASSERT((*pde & PG_PS) == 0, ("pmap_remove_all: found" + " a 2mpage in page %p's pv list", m)); + pte = pmap_pde_to_pte(pde, pv->pv_va); tpte = pte_load_clear(pte); if (tpte & PG_W) pmap->pm_stats.wired_count--; @@ -2100,14 +2569,10 @@ pmap_remove_all(vm_page_t m) /* * Update the vm_page_t clean and reference bits. */ - if (tpte & PG_M) { - KASSERT((tpte & PG_RW), - ("pmap_remove_all: modified page not writable: va: %#lx, pte: %#lx", - pv->pv_va, tpte)); + if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) vm_page_dirty(m); - } free = NULL; - pmap_unuse_pt(pmap, pv->pv_va, ptepde, &free); + pmap_unuse_pt(pmap, pv->pv_va, *pde, &free); pmap_invalidate_page(pmap, pv->pv_va); pmap_free_zero_pages(free); TAILQ_REMOVE(&m->md.pv_list, pv, pv_list); @@ -2118,6 +2583,54 @@ pmap_remove_all(vm_page_t m) } /* + * pmap_protect_pde: do the things to protect a 2mpage in a process + */ +static boolean_t +pmap_protect_pde(pmap_t pmap, pd_entry_t *pde, vm_offset_t sva, vm_prot_t prot) *** DIFF OUTPUT TRUNCATED AT 1000 LINES ***