FreeBSD Mail Archives

Date:      Thu, 30 Sep 2010 10:15:07 -0700
From:      Matthew Fleming <mdf356@gmail.com>
To:        Andre Oppermann <andre@freebsd.org>
Cc:        freebsd-hackers <freebsd-hackers@freebsd.org>, freebsd-current@freebsd.org
Subject:   Re: Examining the VM splay tree effectiveness
Message-ID:  <AANLkTi=2NXtVN=UyucJbbE4tzYk4DHw%2B31Mbkj9ZtULQ@mail.gmail.com>
In-Reply-To: <4CA4BCD2.4070303@freebsd.org>
References:  <4CA4BCD2.4070303@freebsd.org>


On Thu, Sep 30, 2010 at 9:37 AM, Andre Oppermann <andre@freebsd.org> wrote:
> Just for the kick of it I decided to take a closer look at the use of
> splay trees (inherited from Mach if I read the history correctly) in
> the FreeBSD VM system suspecting an interesting journey.
>
> The VM system has two major structures:
> �1) the VM map which is per process and manages its address space
> � �by tracking all regions that are allocated and their permissions.
> �2) the VM page table which is per VM map and provides the backing
> � �memory pages to the regions and interfaces with the pmap layer
> � �(virtual->physical).
>
> Both of these are very frequently accessed through memory allocations
> from userspace and page faults from the pmap layer. �Their efficiency
> and SMP scalability is crucial for many operations beginning with fork/
> exec, going through malloc/mmap and ending with free/exit of a process.
>
> Both the vmmap and page table make use of splay trees to manage the
> entries and to speed up lookups compared to long to traverse linked
> lists or more memory expensive hash tables. �Some structures though
> do have an additional linked list to simplify ordered traversals.
>
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time. �With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node. �For all gory details see:
> �http://en.wikipedia.org/wiki/Splay_tree
>
> This behavior has a few important implications:
> �1) the root node and constant rotation works like a cache with
> � �least recent looked up node at the top and less recently ones
> � �close to the top;
> �2) every lookup that doesn't match the root node, ie. a cache miss,
> � �causes a rotation of the tree to make the newly found node the new
> � �root;
> �3) during the rotate it has to re-arrange and touch possibly a large
> � �number of other nodes;
> �4) in worst case the tree may resemble a linked list. �A splay tree
> � �makes no guarantee that it is balanced.
>
> For the kernel and the splay tree usage in the VM map and page table
> some more implications come up:
> �a) the amortization effect/cost only balance if the majority of
> � �lookups are root node (cache) hits, otherwise the cost of
> � �rebalancing destroys any advantage and quickly turns into a
> � �large overhead.
> �b) the overhead becomes even worse due to touching many nodes and
> � �causing a lot of CPU cache line dirtying. �For processes with
> � �shared memory or threads across CPU's this overhead becomes
> � �almost excessive because the CPU's stomp on each others cached
> � �nodes and get their own caches invalidated. �The penalty is a
> � �high-latency memory refetch not only for the root node lookup
> � �but every hop in the following rebalancing operation => thrashing.
>
> To quantify the positive and negative effects of the splay tree I
> instrumented the code and added counters to track the behavior of
> the splay tree in the vmmap and page table.
>
> The test case is an AMD64 kernel build to get a first overview.
> Other system usages may have different results depending on their
> fork and memory usage patters.
>
> The results are very interesting:
>
> �The page table shows a *very* poor root node hit rate and an excessive
> �amount of rotational node modifications representing almost the
> �worst case for splay trees.
>
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
>
> �The vmmap shows a high root node hit rate but still a significant
> �amount of rotational node modifications.
>
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
>
> From these observations I come to the conclusion that a splay tree
> is suboptimal for the normal case and quickly horrible even on small
> deviations from the normal case in the vmmap. �For the page table it
> is always horrible. �Not to mention the locking complications due to
> modifying the tree on lookups.
>
> Based on an examination of other binary trees I put forward the
> hypothesis that a Red-Black tree is a much better, if not the optimal,
> fit for the vmmap and page table. �RB trees are balanced binary trees
> and O(log n) in all cases. �The big advantage in this context is that
> lookups are pure reads and do not cause CPU cache invalidations on
> other CPU's and always only require a read lock without the worst
> case properties of the unbalanced splay tree. �The high cache locality
> of the vmmap lookups can be used with the RB tree as well by simply
> adding a pointer to the least recently found node. �To prevent write locking
> this can be done lazily. �More profiling is required to make
> a non-speculative statement on this though. �In addition a few of
> the additional linked lists that currently accompany the vmmap and
> page structures are no longer necessary as they easily can be done
> with standard RB tree accessors. �Our standard userspace jemalloc
> also uses RB trees for its internal housekeeping. �RB tree details:
> �http://en.wikipedia.org/wiki/Red-black_tree
>
> I say hypothesis because I haven't measured the difference to an
> RB tree implementation yet. �I've hacked up a crude and somewhat
> mechanical patch to convert the vmmap and page VM structures to
> use RB trees, the vmmap part is not stable yet. �The page part
> seems to work fine though.
>
> This is what I've hacked together so far:
> �http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
> �http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff
>
> The diffs are still in their early stages and do not make use of
> any code simplifications becoming possible by using RB trees instead
> of the current splay trees.
>
> Comments on the VM issue and splay vs. RB tree hypothesis welcome.

Do you have actual performance numbers from a real workload?

I implemented an AA tree locally (basically a 2,3 tree that uses
coloring to represent the 3 nodes) and the performance was much worse
than the splay tree.  I assume this is due to the overhead of keeping
the tree balanced; I didn't instrument either the splay or the AA
code.  I suspect that the overhead of an RB tree would similarly wash
out the benefits of O(log_2 n) time, but this is theory.  The facts
would be measuring several different workloads an looking at the
system/real times for them between the two solutions.

We've talked internally at $work about using a B+-tree with maybe
branching factor 5-7; whatever makes sense for the size of a cache
line.  This seems likely to be better for performance than an RB-tree
but does require a lot more changes, since separate memory is needed
for the tree's nodes outside the vm_page structure.  There just hasn't
been time to implement it and try it out.

Unfortunately I won't have the time to experiment at $work for a few
months on this problem.

Thanks,
matthew

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTi=2NXtVN=UyucJbbE4tzYk4DHw%2B31Mbkj9ZtULQ>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation