Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 18 Aug 2014 14:48:46 -0500
From:      Alan Cox <alan.l.cox@gmail.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        arch@freebsd.org, Gleb Smirnoff <glebius@freebsd.org>, "Alexander V. Chernikov" <melifaro@freebsd.org>, "Andrey V. Elsukov" <ae@freebsd.org>
Subject:   Re: superpages for UMA
Message-ID:  <CAJUyCcM7ZipmYu8OLxT2TCPjS%2BCSTGPRnotdKgchoNQH8s8ndA@mail.gmail.com>
In-Reply-To: <20140818183925.GP2737@kib.kiev.ua>
References:  <53F215A9.8010708@FreeBSD.org> <20140818183925.GP2737@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Aug 18, 2014 at 1:39 PM, Konstantin Belousov <kostikbel@gmail.com>
wrote:

> On Mon, Aug 18, 2014 at 07:03:05PM +0400, Alexander V. Chernikov wrote:
> > Hello list.
> >
> > Currently UMA(9) uses PAGE_SIZE kegs to store items in.
> > It seems fine for most usage scenarios,  however there are some where
> > very large number of items is required.
> >
> > I've run into this problem while using ipfw tables (radix based) with
> > ~50k records. This is how
> > `pmcstat -TS DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK -w1` looks like:
> > PMC: [DTLB_LOAD_MISSES.MISS_CAUSES_A_WALK] Samples: 2359 (100.0%) , 0
> > unresolved
> >
> > %SAMP IMAGE      FUNCTION             CALLERS
> >   28.7 kernel     rn_match             ipfw_lookup_table:21.7
> > rtalloc_fib_nolock:7.0
> >   25.5 ipfw.ko    ipfw_chk             ipfw_check_hook
> >    6.0 kernel     rn_lookup            ipfw_lookup_table
> >
> > Some numbers: table entry occupies 128 bytes, so we may store no more
> > than 30 records in single page-sized keg.
> > 50k records require more than 1500 kegs.
> > As far as I understand second-level TLB for modern Intel CPU may be 256
> > or 512 entries( for 4K pages ), so using large number of entries
> > results in TLB cache misses constantly happening.
> >
> > Other examples:
> > Route tables (in current implementation): struct rte occupies more than
> > 128 bytes and storing full-view (> 500k routes) would result in TLB
> > misses happening all of the time.
> > Various stateful packet processing: modern SLB/firewall can have
> > millions of states. Regardless of state size PAGE_SIZE'd kegs is not the
> > best choice.
> >
> > All of these can be addressed:
> > Ipwa tables/ipfw dynamic state allocation code can (and will) be
> > rewritten to use uma+uma_zone_set_allocf (suggested by glebius),
> > radix should simply be changed to a different lookup algo (as it is
> > happening in ipfw tables).
> >
> > However, we may consider on adding another UMA flag to allocate
> > 2M/1G-sized kegs per request.
> > (Additionally, Intel Haswell arch has 512 entries in STLB shared?
> > between 4k/2M so it should help the former).
> >
> > What do you think?
> >
> Zones with small object sizes use uma_small_alloc() to request physical
> page and its KVA mapping. On amd64, uma_small_alloc() allocates a
> physical page and returns direct mapping address for the page. The
> direct map is done by large pages (2MB, 1GB if avaliable). In this
> sense, your allocations already use large pages for virtual memory
> translations.
>
> Zones are not local in the KVA, i.e. objects from the same zone are
> usually far apart in the KVA.  Zones do not get dedicated submaps to
> contain the zone-owned pages.
>
> Note that large pages TLB is usually relatively small.  E.g. on my
> Nehalem machine, it only has 32 entries which can hold 2MB pages,
> which results in the 64MB of cached address space translations in
> the best case.  You might try to reduce the available memory to
> see the increased locality and better DTLB hit ratio, if your load
> can survive with lesser memory size.
>


Newer Intel CPUs have more entries, and AMD CPUs have long (since
Barcelona) had more.  In particular, they allow 2 MB page mappings to be
cached in a larger L2 TLB.  Nowadays, the trouble is with the 1 GB pages.
A lot of CPUs still only support an 8 entry, 1 level TLB for 1 GB pages.

It might make sense to increase the largest size used by the buddy
allocator in vm_phys.c to 1 GB.  Then, the VM_FREEPOOL_DIRECT mechanism
might help.  Back in the days when Opteron TLBs had only 8 2MB entries, I
wrote the following in the commit message for r170477:

"The twist is that this allocator tries to reduce the number of TLB
misses incurred by accesses through a direct map to small, UMA-managed
objects and page table pages.  Roughly speaking, the physical pages
that are allocated for such purposes are clustered together in the
physical address space.  The performance benefits vary.  In the most
extreme case, a uniprocessor kernel running on an Opteron, I measured
an 18% reduction in system time during a buildworld.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJUyCcM7ZipmYu8OLxT2TCPjS%2BCSTGPRnotdKgchoNQH8s8ndA>