From owner-freebsd-current@FreeBSD.ORG  Thu Sep 30 18:57:39 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CE7B0106564A;
	Thu, 30 Sep 2010 18:57:39 +0000 (UTC)
	(envelope-from bright@elvis.mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.freebsd.org (Postfix) with ESMTP id 8EAC18FC14;
	Thu, 30 Sep 2010 18:57:39 +0000 (UTC)
Received: by elvis.mu.org (Postfix, from userid 1192)
	id 51C901A3D23; Thu, 30 Sep 2010 11:38:53 -0700 (PDT)
Date: Thu, 30 Sep 2010 11:38:53 -0700
From: Alfred Perlstein <alfred@freebsd.org>
To: Andre Oppermann <andre@freebsd.org>
Message-ID: <20100930183853.GX49807@elvis.mu.org>
References: <4CA4BCD2.4070303@freebsd.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4CA4BCD2.4070303@freebsd.org>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-hackers <freebsd-hackers@freebsd.org>, freebsd-current@freebsd.org
Subject: Re: Examining the VM splay tree effectiveness
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Sep 2010 18:57:39 -0000

Andre,

Your observations on the effectiveness of the splay tree
mirror the concerns I have with it when I read about it.

I have always wondered though if the splay-tree algorithm
was modified to only perform rotations when a lookup required
more than "N" traversals to reach a node.

This would self-balance the tree and maintain cache without 
the expense of writes for nearly all lookups.

I'm wondering if you have the time/interest in rerunning
your tests, but modifying the algorithm to only rebalance
the splay if a lookup requires more than let's say 3, 5, 7
or 10 traversals.

what do you think?

-Alfred

* Andre Oppermann <andre@freebsd.org> [100930 09:38] wrote:
> Just for the kick of it I decided to take a closer look at the use of
> splay trees (inherited from Mach if I read the history correctly) in
> the FreeBSD VM system suspecting an interesting journey.
> 
> The VM system has two major structures:
>  1) the VM map which is per process and manages its address space
>     by tracking all regions that are allocated and their permissions.
>  2) the VM page table which is per VM map and provides the backing
>     memory pages to the regions and interfaces with the pmap layer
>     (virtual->physical).
> 
> Both of these are very frequently accessed through memory allocations
> from userspace and page faults from the pmap layer.  Their efficiency
> and SMP scalability is crucial for many operations beginning with fork/
> exec, going through malloc/mmap and ending with free/exit of a process.
> 
> Both the vmmap and page table make use of splay trees to manage the
> entries and to speed up lookups compared to long to traverse linked
> lists or more memory expensive hash tables.  Some structures though
> do have an additional linked list to simplify ordered traversals.
> 
> A splay tree is an interesting binary search tree with insertion,
> lookup and removal performed in O(log n) *amortized* time.  With
> the *amortized* time being the crucial difference to other binary trees.
> On every access *including* lookup it rotates the tree to make the
> just found element the new root node.  For all gory details see:
>  http://en.wikipedia.org/wiki/Splay_tree
> 
> This behavior has a few important implications:
>  1) the root node and constant rotation works like a cache with
>     least recent looked up node at the top and less recently ones
>     close to the top;
>  2) every lookup that doesn't match the root node, ie. a cache miss,
>     causes a rotation of the tree to make the newly found node the new
>     root;
>  3) during the rotate it has to re-arrange and touch possibly a large
>     number of other nodes;
>  4) in worst case the tree may resemble a linked list.  A splay tree
>     makes no guarantee that it is balanced.
> 
> For the kernel and the splay tree usage in the VM map and page table
> some more implications come up:
>  a) the amortization effect/cost only balance if the majority of
>     lookups are root node (cache) hits, otherwise the cost of
>     rebalancing destroys any advantage and quickly turns into a
>     large overhead.
>  b) the overhead becomes even worse due to touching many nodes and
>     causing a lot of CPU cache line dirtying.  For processes with
>     shared memory or threads across CPU's this overhead becomes
>     almost excessive because the CPU's stomp on each others cached
>     nodes and get their own caches invalidated.  The penalty is a
>     high-latency memory refetch not only for the root node lookup
>     but every hop in the following rebalancing operation => thrashing.
> 
> To quantify the positive and negative effects of the splay tree I
> instrumented the code and added counters to track the behavior of
> the splay tree in the vmmap and page table.
> 
> The test case is an AMD64 kernel build to get a first overview.
> Other system usages may have different results depending on their
> fork and memory usage patters.
> 
> The results are very interesting:
> 
>  The page table shows a *very* poor root node hit rate and an excessive
>  amount of rotational node modifications representing almost the
>  worst case for splay trees.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,127484769&chd=t:16946915,16719791,48872230,131057,74858589,105206121&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
>  The vmmap shows a high root node hit rate but still a significant
>  amount of rotational node modifications.
> 
> http://chart.apis.google.com/chart?chs=700x300&chbh=a,23&chds=0,21881583&chd=t:724293,723701,20881583,19044544,3719582,4553350&cht=bvg&chl=insert|remove|lookup|cachehit|rotates|rotatemodifies
> 
> From these observations I come to the conclusion that a splay tree
> is suboptimal for the normal case and quickly horrible even on small
> deviations from the normal case in the vmmap.  For the page table it
> is always horrible.  Not to mention the locking complications due to
> modifying the tree on lookups.
> 
> Based on an examination of other binary trees I put forward the
> hypothesis that a Red-Black tree is a much better, if not the optimal,
> fit for the vmmap and page table.  RB trees are balanced binary trees
> and O(log n) in all cases.  The big advantage in this context is that
> lookups are pure reads and do not cause CPU cache invalidations on
> other CPU's and always only require a read lock without the worst
> case properties of the unbalanced splay tree.  The high cache locality
> of the vmmap lookups can be used with the RB tree as well by simply
> adding a pointer to the least recently found node.  To prevent write 
> locking this can be done lazily.  More profiling is required to make
> a non-speculative statement on this though.  In addition a few of
> the additional linked lists that currently accompany the vmmap and
> page structures are no longer necessary as they easily can be done
> with standard RB tree accessors.  Our standard userspace jemalloc
> also uses RB trees for its internal housekeeping.  RB tree details:
>  http://en.wikipedia.org/wiki/Red-black_tree
> 
> I say hypothesis because I haven't measured the difference to an
> RB tree implementation yet.  I've hacked up a crude and somewhat
> mechanical patch to convert the vmmap and page VM structures to
> use RB trees, the vmmap part is not stable yet.  The page part
> seems to work fine though.
> 
> This is what I've hacked together so far:
>  http://people.freebsd.org/~andre/vmmap_vmpage_stats-20100930.diff
>  http://people.freebsd.org/~andre/vmmap_vmpage_rbtree-20100930.diff
> 
> The diffs are still in their early stages and do not make use of
> any code simplifications becoming possible by using RB trees instead
> of the current splay trees.
> 
> Comments on the VM issue and splay vs. RB tree hypothesis welcome.
> 
> -- 
> Andre
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"

-- 
- Alfred Perlstein
.- VMOA #5191, 03 vmax, 92 gs500, 85 ch250, 07 zx10
.- FreeBSD committer