From owner-freebsd-current@FreeBSD.ORG Thu Sep 30 18:57:39 2010 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CE7B0106564A; Thu, 30 Sep 2010 18:57:39 +0000 (UTC) (envelope-from bright@elvis.mu.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id 8EAC18FC14; Thu, 30 Sep 2010 18:57:39 +0000 (UTC) Received: by elvis.mu.org (Postfix, from userid 1192) id 51C901A3D23; Thu, 30 Sep 2010 11:38:53 -0700 (PDT) Date: Thu, 30 Sep 2010 11:38:53 -0700 From: Alfred Perlstein To: Andre Oppermann Message-ID: <20100930183853.GX49807@elvis.mu.org> References: <4CA4BCD2.4070303@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4CA4BCD2.4070303@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: freebsd-hackers , freebsd-current@freebsd.org Subject: Re: Examining the VM splay tree effectiveness X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Sep 2010 18:57:39 -0000 Andre, Your observations on the effectiveness of the splay tree mirror the concerns I have with it when I read about it. I have always wondered though if the splay-tree algorithm was modified to only perform rotations when a lookup required more than "N" traversals to reach a node. This would self-balance the tree and maintain cache without the expense of writes for nearly all lookups. I'm wondering if you have the time/interest in rerunning your tests, but modifying the algorithm to only rebalance the splay if a lookup requires more than let's say 3, 5, 7 or 10 traversals. what do you think? -Alfred * Andre Oppermann [100930 09:38] wrote: > Just for the kick of it I decided to take a closer look at the use of > splay trees (inherited from Mach if I read the history correctly) in > the FreeBSD VM system suspecting an interesting journey. > > The VM system has two major structures: > 1) the VM map which is per process and manages its address space > by tracking all regions that are allocated and their permissions. > 2) the VM page table which is per VM map and provides the backing > memory pages to the regions and interfaces with the pmap layer > (virtual->physical). > > Both of these are very frequently accessed through memory allocations > from userspace and page faults from the pmap layer. Their efficiency > and SMP scalability is crucial for many operations beginning with fork/ > exec, going through malloc/mmap and ending with free/exit of a process. > > Both the vmmap and page table make use of splay trees to manage the > entries and to speed up lookups compared to long to traverse linked > lists or more memory expensive hash tables. Some structures though > do have an additional linked list to simplify ordered traversals. > > A splay tree is an interesting binary search tree with insertion, > lookup and removal performed in O(log n) *amortized* time. With > the *amortized* time being the crucial difference to other binary trees. > On every access *including* lookup it rotates the tree to make the > just found element the new root node. For all gory details see: > http://en.wikipedia.org/wiki/Splay_tree > > This behavior has a few important implications: > 1) the root node and constant rotation works like a cache with > least recent looked up node at the top and less recently ones > close to the top; > 2) every lookup that doesn't match the root node, ie. a cache miss, > causes a rotation of the tree to make the newly found node the new > root; > 3) during the rotate it has to re-arrange and touch possibly a large > number of other nodes; > 4) in worst case the tree may resemble a linked list. A splay tree > makes no guarantee that it is balanced. > > For the kernel and the splay tree usage in the VM map and page table > some more implications come up: > a) the amortization effect/cost only balance if the majority of > lookups are root node (cache) hits, otherwise the cost of > rebalancing destroys any advantage and quickly turns into a > large overhead. > b) the overhead becomes even worse due to touching many nodes and > causing a lot of CPU cache line dirtying. For processes with > shared memory or threads across CPU's this overhead becomes > almost excessive because the CPU's stomp on each others cached > nodes and get their own caches invalidated. The penalty is a > high-latency memory refetch not only for the root node lookup > but every hop in the following rebalancing operation => thrashing. > > To quantify the positive and negative effects of the splay tree I > instrumented the code and added counters to track the behavior of > the splay tree in the vmmap and page table. > > The test case is an AMD64 kernel build to get a first overview. > Other system usages may have different results depending on their > fork and memory usage patters. > > The results are very interesting: > > The page table shows a *very* poor root node hit rate and an excessive > amount of rotational node modifications representing almost the > worst case for splay trees. > > http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies > > The vmmap shows a high root node hit rate but still a significant > amount of rotational node modifications. > > http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies > > From these observations I come to the conclusion that a splay tree > is suboptimal for the normal case and quickly horrible even on small > deviations from the normal case in the vmmap. For the page table it > is always horrible. Not to mention the locking complications due to > modifying the tree on lookups. > > Based on an examination of other binary trees I put forward the > hypothesis that a Red-Black tree is a much better, if not the optimal, > fit for the vmmap and page table. RB trees are balanced binary trees > and O(log n) in all cases. The big advantage in this context is that > lookups are pure reads and do not cause CPU cache invalidations on > other CPU's and always only require a read lock without the worst > case properties of the unbalanced splay tree. The high cache locality > of the vmmap lookups can be used with the RB tree as well by simply > adding a pointer to the least recently found node. To prevent write > locking this can be done lazily. More profiling is required to make > a non-speculative statement on this though. In addition a few of > the additional linked lists that currently accompany the vmmap and > page structures are no longer necessary as they easily can be done > with standard RB tree accessors. Our standard userspace jemalloc > also uses RB trees for its internal housekeeping. RB tree details: > http://en.wikipedia.org/wiki/Red-black_tree > > I say hypothesis because I haven't measured the difference to an > RB tree implementation yet. I've hacked up a crude and somewhat > mechanical patch to convert the vmmap and page VM structures to > use RB trees, the vmmap part is not stable yet. The page part > seems to work fine though. > > This is what I've hacked together so far: > http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff > http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff > > The diffs are still in their early stages and do not make use of > any code simplifications becoming possible by using RB trees instead > of the current splay trees. > > Comments on the VM issue and splay vs. RB tree hypothesis welcome. > > -- > Andre > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org" -- - Alfred Perlstein .- VMOA #5191, 03 vmax, 92 gs500, 85 ch250, 07 zx10 .- FreeBSD committer