Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Jun 2013 00:32:02 +0200
From:      Oliver Pinter <>
To:        Chris Torek <>
Subject:   Re: expanding amd64 past the 1TB limit
Message-ID:  <>
In-Reply-To: <>
References:  <> <>

next in thread | previous in thread | raw e-mail | index | archive | help
On 6/27/13, Chris Torek <> wrote:
> OK, I wasted :-) way too much time, but here's a text file that
> can be comment-ified or stored somewhere alongside the code or
> whatever...
> (While drawing this I realized that there's at least one "wasted"
> page if the machine has .5 TB or less: we can just leave zero
> slots in the corresponding L4 direct-map entries.  But that would
> require switching to the bcopy() method also mentioned below.  Or
> indexing into vmspace0.vm_pmap.pm_pml4, which is basically the
> same thing.)
> Chris
>     -----
> There are six -- or sometimes five -- sets of pages allocated here
> at boot time to map physical memory in two ways.  Note that each
> page, regardless of level, stores 512 PTEs (or PDEs or PDPs, but
> let's just use PTE here and prefix it with "level" as needed: 4,
> 3, 2, or 1.)
> There is one page for the top level, L4, page table entries.  Each
> L4 PTE maps 512 GB of space.  Unless it's marked "invalid", no L4
> PTE can be marked "stop here": it either is marked as "this
> address is invalid", or it points to one physically-adressed page
> full of L3 PTEs.  Eventually, those L3 PTEs will map-or-reject
> half a terabyte.  512 entries, each mapping .5 TB, allow us to map
> 256 TB, which is as much as the hardware supports (there are, in
> effect, only 48 virtual address bits: the top 16 bits must match
> the 47th bit).
> The L4 entry halfway down, at PML4PML4I, is set to point back to
> this page itself; that's the "recursive page table" for user
> space, which we do nothing else with at boot time.
> We need (up to) NDMPML4E pages, each holding 512 L3 PTEs, for the
> direct map space.  If the processor supports 1 GB pages, an L3 PTE
> can be marked with "stop here" and these L3 PTEs each grant (or
> forbid) access to 1 GB of physical space at a time.  A system
> with, say, 3 GB of RAM starting at 0 can map it all with three L3
> PTEs: "address 0 is valid for 1GB", "address 1GB is valid for
> 1GB", "address 2GB is valid for 1GB".  The remaining L3 PTEs are
> zero, making the remaining address space invalid.
> If the processor does not support 1 GB pages, or if there is less
> than 1 GB of RAM "at the end" (e.g., if the system has 4.5 GB),
> the L3 PTEs may need to point to more pages holding L2 PTEs.
> These L2 PTEs always support 2 MB pages.  Each page of L2 PTEs
> maps 1 GB. So a machine with 4.5 GB and 1 GB mappings needs one L3
> page with four valid 1 GB L3 PTEs and then one L3 PTE pointing to
> one page of L2 PTEs.  That one page of L2 PTEs is half-filled,
> containing 256 2MB-sized PTEs, mapping the 512 MB.  The remaining
> half of that page is zero, making the remaining addresses invalid.
> Pictorially, and adding the names of the physical page(s), thus
> far we have this.  (Note, the L4 PTE page is drawn more than twice
> as tall as the L3 and L2 pages, just to get space for arrows.)
>               LEVEL 4:                LEVEL 3:             LEVEL 2:
>                       _._
>           KPML4phys  v   \
>              +---------+  |
>              |  0:     |  |
>              |---------|  |
>              |  1:     |  |         DMPDPphys              DMPDphys
>              (   ...   )  |    .-> +---------+         +----------------+
>              | 127:    |  |   /    |  0: 0GB |     .-> |  0: 4GB        |
>              |---------|  |  |     |  1: 1GB |    /    |  1: 4GB+2MB    |
>   PML4PML4I: | 128: *--|--/  |     |  2: 2GB |   /     |  2: 4GB+4MB    |
>              |---------|     |     |  3: 3GB |  /      (      ...       )
>              | 129:    |     |     |  4:  *--|-/       | 255: 4.5GB-2MB |
>              |   ...   |     |     |  5:     |         | 256:           |
>   ________   |---------|     |     (   ...   )         | 257:           |
>  /  DMPML4I: |      *--|-----/     | 511:    |         (      ...       )
>  NDMPML4E    |---------|           +---------+         +----------------+
>  \________   |      *--|---------> |   0:    |
>              |---------|           |   1:    |
> 	     |         |           |   2:    |  (These are used only
>              |---------|           |   3:    |   if the system has more
>              |   ...   |           (   ...   )   than 512 GB)
> 	   ( |---------|      )    | 509:    |
> 	   ( | 510: see below )    | 510:    |
> 	   ( |---------|      )    | 511:    |
> 	   ( | 511: see below )    +---------+
> 	     +---------+
> If the hardware supports 1GB pages, "ndm1g" is the number of
> gigabyte entries (4 in the example above).  Otherwise it's just
> zero.  Meanwhile "ndmpdp" is the number of gigabytes of RAM that
> need to be mapped, in this case 5.  Thus, if ndmpdp > ndm1g, we
> need ndmpdp-ndm1g pages to hold some L2 PTEs.
> Now we get to the weirder case of the kernel itself (both its
> non-direct-mapped dynamically allocated virtual memory, and its
> text/data/bss).  The branch offset limitations encourage the
> placement of the kernel's text, etc., in the last 2 GB of virtual
> space, i.e., starting at 0xffff.ffff.f800.0000.  But, we want
> a reasonable amount of room for dynamic VM.  So we give the kernel
> at least 512 GB of VM -- that's one L4 PTE -- while making sure that
> the text snuggles up close to the end of the space, in that last 2 GB
> of the at-least-512-GB area.
> Meanwhile, the boot loader has loaded the kernel into relatively
> low physical memory addresses.
> If KPML4I is 511 (and it actually is), this uses the final L4 slot
> to map the kernel.  If we want to allow kernel VM to have more
> than 512 GB available, though, we need extra space below KPML4I,
> i.e., starting at KPMLBASE.  So we allocate NKPML4E pages that
> we set up as L3 PTEs, and point the last NKPML4E slots in the L4
> page table here.  If NKPML4E is 4, for instance, we will have
> this:
>   last part of KPML4phys:
> 	     (   ...   )    .----> [page #0 of all-zero L3 PTEs]
> 	     | DMPML4I |   /
> 	     (   ...   )   |  .--> [page #1 of all-zero L3 PTEs]
> 	     | 507:    |   | /
> 	     | 508: *--|--/  | .-> [page #2 of all-zero L3 PTEs]
> 	     | 509: *--|----/  |
> 	     | 510: *--|------/
> 	     | 511: *--|---------> [page #3 of L3 PTEs, see below]
> 	     +---------+
> The reason for having those "empty" (all-zero) PTE pages is that
> whenever new processes are created, in pmap_pinit(), they have
> their (new) L4 PTE page set up to point to the *same* physical
> pages that the kernel is using.  Thus, if the kernel creates or
> destroys any level-3-or-below mapping by writing into any of the
> above four pages, that mapping is also created/destroyed in all
> processes.  Similarly, the NDMPML4 pages starting at DMPDPphys are
> mapped identically in all processes.  The kernel can therefore
> "borrow" a user pmap at any time, i.e., there's no need to adjust
> the CPU's CR4 on entry to the kernel.
> (If we used bcopy() to copy the kernel pmap's NKPML4E and NDMPML4E
> entries into the new pmap, the L3 pages would not have to be
> physically contiguous, but the KVA ones would still all have to
> exist.  It's free to allocate physically contiguous pages here
> anyway though.)
> So, the last NKPML4E slots in KPML4phys point to the following
> page tables, which use all of L3, L2, and L1 style PTEs.  (Note
> that we did not need any L1 PTEs for the direct map, which always
> uses 2MB or 1GB super-pages.)
>           LEVEL 3:         LEVEL 2:                 LEVEL 1:
>     (assuming NKPML4=4)                             (nkpt pages)
> 	 KPDPphys                                      KPTphys
>         +---------+                               +---------------+
>  page 0 |  0:     |                           .-> |  0:      0 KB |
>         |  1:     |                          /    |  1:      4 KB |
>         |  2:     |                         /     |  2:      8 KB |
>         |  3:     |                        /      |  3:     12 KB |
>         (   ...   )                       |       (      ...      )
>         | 509:    |                       |       | 509: 2MB-12KB |
>         | 510:    |                       |       | 510: 2MB-8KB  |
>         | 511:    |                       |       | 511: 2MB-4KB  |
>         +---------+                       |       +---------------+
>  page 1 |  0:     |                       |   .-> |  0:      2 MB |
>         |  1:     |                       |  /    |  1:   2MB+4KB |
>         |  2:     |                       | |     (      ...      )
>         |  3:     |                       | |     (      ...      )
>         (   ...   )                       | |     +---------------+
>         | 509:    |                       | | .-> (      ...      )
>         | 510:    |                       | | |   (      ...      )
>         | 511:    |            KPDphys    | | |   +---------------+
>         +---------+          +---------+  | | | ..(  ... ... ...  )
>  page 2 |  0:     |    .---> |  0:  *--|--/ | | .       [etc]
>         |  1:     |   /      |  1:  *--|---/  | .
>         |  2:     |  |       |  2:  *--|-----/ .
>         |  3:     |  |       |  3:  *--|---....
>         (   ...   )  |       (   ...   )
>         | 509:    |  |       | 509: ...|
>         | 510:    |  |       | 510: ...|
>         | 511:    |  |       | 511: ...|
>         +---------+  |       +---------+
>  page 3 |  0:     |  |   .-> |  0:  ...|
>         |  1:     |  |  /    (   ...   )
>         |  2:     |  |  |    (   ...   )
>         |  3:     |  |  |    (   ...   )
>         (   ...   )  |  |    (   ...   )
>         | 509:    |  |  |    (   ...   )
>         | 510: *--|--/  |    (   ...   )
>         | 511: *--|----/     | 511:    |
>         +---------+          +---------+
> There are nkpdpe pages at KPDphys, where nkpdpe is either 1 or 2.
> One page maps 1 GB, and the other page maps the remaining 1 GB.
> Remember that kernel text+data+bss lives in the final 2 GB of the
> virtual address space, so there cannot be more than 2 GB.  These
> one or two pages map nkpt pages at KPTphys.

added two VM guru, to CC

> _______________________________________________
> mailing list
> To unsubscribe, send any mail to ""

Want to link to this message? Use this URL: <>