Date: Thu, 30 Sep 2010 10:15:07 -0700 From: Matthew Fleming <mdf356@gmail.com> To: Andre Oppermann <andre@freebsd.org> Cc: freebsd-hackers <freebsd-hackers@freebsd.org>, freebsd-current@freebsd.org Subject: Re: Examining the VM splay tree effectiveness Message-ID: <AANLkTi=2NXtVN=UyucJbbE4tzYk4DHw%2B31Mbkj9ZtULQ@mail.gmail.com> In-Reply-To: <4CA4BCD2.4070303@freebsd.org> References: <4CA4BCD2.4070303@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Sep 30, 2010 at 9:37 AM, Andre Oppermann <andre@freebsd.org> wrote: > Just for the kick of it I decided to take a closer look at the use of > splay trees (inherited from Mach if I read the history correctly) in > the FreeBSD VM system suspecting an interesting journey. > > The VM system has two major structures: > =A01) the VM map which is per process and manages its address space > =A0 =A0by tracking all regions that are allocated and their permissions. > =A02) the VM page table which is per VM map and provides the backing > =A0 =A0memory pages to the regions and interfaces with the pmap layer > =A0 =A0(virtual->physical). > > Both of these are very frequently accessed through memory allocations > from userspace and page faults from the pmap layer. =A0Their efficiency > and SMP scalability is crucial for many operations beginning with fork/ > exec, going through malloc/mmap and ending with free/exit of a process. > > Both the vmmap and page table make use of splay trees to manage the > entries and to speed up lookups compared to long to traverse linked > lists or more memory expensive hash tables. =A0Some structures though > do have an additional linked list to simplify ordered traversals. > > A splay tree is an interesting binary search tree with insertion, > lookup and removal performed in O(log n) *amortized* time. =A0With > the *amortized* time being the crucial difference to other binary trees. > On every access *including* lookup it rotates the tree to make the > just found element the new root node. =A0For all gory details see: > =A0http://en.wikipedia.org/wiki/Splay_tree > > This behavior has a few important implications: > =A01) the root node and constant rotation works like a cache with > =A0 =A0least recent looked up node at the top and less recently ones > =A0 =A0close to the top; > =A02) every lookup that doesn't match the root node, ie. a cache miss, > =A0 =A0causes a rotation of the tree to make the newly found node the new > =A0 =A0root; > =A03) during the rotate it has to re-arrange and touch possibly a large > =A0 =A0number of other nodes; > =A04) in worst case the tree may resemble a linked list. =A0A splay tree > =A0 =A0makes no guarantee that it is balanced. > > For the kernel and the splay tree usage in the VM map and page table > some more implications come up: > =A0a) the amortization effect/cost only balance if the majority of > =A0 =A0lookups are root node (cache) hits, otherwise the cost of > =A0 =A0rebalancing destroys any advantage and quickly turns into a > =A0 =A0large overhead. > =A0b) the overhead becomes even worse due to touching many nodes and > =A0 =A0causing a lot of CPU cache line dirtying. =A0For processes with > =A0 =A0shared memory or threads across CPU's this overhead becomes > =A0 =A0almost excessive because the CPU's stomp on each others cached > =A0 =A0nodes and get their own caches invalidated. =A0The penalty is a > =A0 =A0high-latency memory refetch not only for the root node lookup > =A0 =A0but every hop in the following rebalancing operation =3D> thrashin= g. > > To quantify the positive and negative effects of the splay tree I > instrumented the code and added counters to track the behavior of > the splay tree in the vmmap and page table. > > The test case is an AMD64 kernel build to get a first overview. > Other system usages may have different results depending on their > fork and memory usage patters. > > The results are very interesting: > > =A0The page table shows a *very* poor root node hit rate and an excessive > =A0amount of rotational node modifications representing almost the > =A0worst case for splay trees. > > http://chart.apis.google.com/chart?chs=3D700x300&chbh=3Da,23&chds=3D0,127= 484769&chd=3Dt:16946915,16719791,48872230,131057,74858589,105206121&cht=3Db= vg&chl=3Dinsert|remove|lookup|cachehit|rotates|rotatemodifies > > =A0The vmmap shows a high root node hit rate but still a significant > =A0amount of rotational node modifications. > > http://chart.apis.google.com/chart?chs=3D700x300&chbh=3Da,23&chds=3D0,218= 81583&chd=3Dt:724293,723701,20881583,19044544,3719582,4553350&cht=3Dbvg&chl= =3Dinsert|remove|lookup|cachehit|rotates|rotatemodifies > > From these observations I come to the conclusion that a splay tree > is suboptimal for the normal case and quickly horrible even on small > deviations from the normal case in the vmmap. =A0For the page table it > is always horrible. =A0Not to mention the locking complications due to > modifying the tree on lookups. > > Based on an examination of other binary trees I put forward the > hypothesis that a Red-Black tree is a much better, if not the optimal, > fit for the vmmap and page table. =A0RB trees are balanced binary trees > and O(log n) in all cases. =A0The big advantage in this context is that > lookups are pure reads and do not cause CPU cache invalidations on > other CPU's and always only require a read lock without the worst > case properties of the unbalanced splay tree. =A0The high cache locality > of the vmmap lookups can be used with the RB tree as well by simply > adding a pointer to the least recently found node. =A0To prevent write lo= cking > this can be done lazily. =A0More profiling is required to make > a non-speculative statement on this though. =A0In addition a few of > the additional linked lists that currently accompany the vmmap and > page structures are no longer necessary as they easily can be done > with standard RB tree accessors. =A0Our standard userspace jemalloc > also uses RB trees for its internal housekeeping. =A0RB tree details: > =A0http://en.wikipedia.org/wiki/Red-black_tree > > I say hypothesis because I haven't measured the difference to an > RB tree implementation yet. =A0I've hacked up a crude and somewhat > mechanical patch to convert the vmmap and page VM structures to > use RB trees, the vmmap part is not stable yet. =A0The page part > seems to work fine though. > > This is what I've hacked together so far: > =A0http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff > =A0http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff > > The diffs are still in their early stages and do not make use of > any code simplifications becoming possible by using RB trees instead > of the current splay trees. > > Comments on the VM issue and splay vs. RB tree hypothesis welcome. Do you have actual performance numbers from a real workload? I implemented an AA tree locally (basically a 2,3 tree that uses coloring to represent the 3 nodes) and the performance was much worse than the splay tree. I assume this is due to the overhead of keeping the tree balanced; I didn't instrument either the splay or the AA code. I suspect that the overhead of an RB tree would similarly wash out the benefits of O(log_2 n) time, but this is theory. The facts would be measuring several different workloads an looking at the system/real times for them between the two solutions. We've talked internally at $work about using a B+-tree with maybe branching factor 5-7; whatever makes sense for the size of a cache line. This seems likely to be better for performance than an RB-tree but does require a lot more changes, since separate memory is needed for the tree's nodes outside the vm_page structure. There just hasn't been time to implement it and try it out. Unfortunately I won't have the time to experiment at $work for a few months on this problem. Thanks, matthew
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?AANLkTi=2NXtVN=UyucJbbE4tzYk4DHw%2B31Mbkj9ZtULQ>