Date: Mon, 18 Aug 2014 17:35:52 -0500 From: Alan Cox <alan.l.cox@gmail.com> To: Warner Losh <imp@bsdimp.com> Cc: "freebsd-arch@freebsd.org" <arch@freebsd.org>, Peter Grehan <grehan@freebsd.org> Subject: Re: superpages for UMA Message-ID: <CAJUyCcM_4-jiJ5PqnmT6H-2qg63nEXmpZ69vGGb6SR0Trp8e0Q@mail.gmail.com> In-Reply-To: <257A0976-7C5E-4029-AF32-BFB3A6C60832@bsdimp.com> References: <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua> <CAJUyCcM7ZipmYu8OLxT2TCPjS%2BCSTGPRnotdKgchoNQH8s8ndA@mail.gmail.com> <53F25E60.5050109@freebsd.org> <257A0976-7C5E-4029-AF32-BFB3A6C60832@bsdimp.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Aug 18, 2014 at 3:26 PM, Warner Losh <imp@bsdimp.com> wrote: > > On Aug 18, 2014, at 2:13 PM, Peter Grehan <grehan@freebsd.org> wrote: > > >> Newer Intel CPUs have more entries, and AMD CPUs have long (since > >> Barcelona) had more. In particular, they allow 2 MB page mappings to = be > >> cached in a larger L2 TLB. Nowadays, the trouble is with the 1 GB > pages. > >> A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB page= s. > > > > There are new(ish) ones effectively without 1GB pages. From the > "Software Optimization Guide for AMD Family 16h Processors" > > > > "Smashing" > > ... > > "when the Family 16h processor encounters a 1-Gbyte page size, it will > smash translations of that 1-Gbyte region into 2-Mbyte TLB entries, each > > of which translates a 2-Mbyte region of the 1-Gbyte page." > > =E2=80=9Cwe=E2=80=99ll emulate this feature designed to make things go fa= ster in hardware > in software by doing the very thing that makes it go slow in hardware.=E2= =80=9D > > Fun times. Performance Smashing! > > I'm guessing that these are low-power processors, where they don't want to have another CAM consuming power. Under those circumstances, it's still better to support 1 GB page mappings in the page table even if the TLB doesn't support them than not to support 1 GB page mappings at all. With the hierarchical page tables on x86, you get a 512x reduction in page table size with each increase in page size. So, on a TLB miss, the page table walk is more likely to be all L2 data cache hits, rather than misses that go all the way to DRAM. One feature that I always liked about the AMD performance counters was that they allowed you to count L2 cache misses caused by page table walks on a TLB miss. This was often a better predictor of whether large pages were going to be beneficial than counting TLB misses.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJUyCcM_4-jiJ5PqnmT6H-2qg63nEXmpZ69vGGb6SR0Trp8e0Q>