From owner-freebsd-arch@FreeBSD.ORG  Wed Dec 10 02:24:34 2008
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 683381065678
	for <arch@freebsd.org>; Wed, 10 Dec 2008 02:24:34 +0000 (UTC)
	(envelope-from jroberson@jroberson.net)
Received: from rn-out-0910.google.com (rn-out-0910.google.com [64.233.170.191])
	by mx1.freebsd.org (Postfix) with ESMTP id 2039D8FC08
	for <arch@freebsd.org>; Wed, 10 Dec 2008 02:24:33 +0000 (UTC)
	(envelope-from jroberson@jroberson.net)
Received: by rn-out-0910.google.com with SMTP id j71so335451rne.12
	for <arch@freebsd.org>; Tue, 09 Dec 2008 18:24:33 -0800 (PST)
Received: by 10.150.145.20 with SMTP id s20mr505217ybd.121.1228875872426;
	Tue, 09 Dec 2008 18:24:32 -0800 (PST)
Received: from ?10.0.1.199? (udp005586uds.hawaiiantel.net [72.234.105.237])
	by mx.google.com with ESMTPS id k35sm2345243rnd.3.2008.12.09.18.24.30
	(version=SSLv3 cipher=RC4-MD5); Tue, 09 Dec 2008 18:24:31 -0800 (PST)
Date: Tue, 9 Dec 2008 16:22:44 -1000 (HST)
From: Jeff Roberson <jroberson@jroberson.net>
X-X-Sender: jroberson@desktop
To: arch@freebsd.org
Message-ID: <20081209155714.K960@desktop>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: 
Subject: UMA & mbuf cache utilization.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Dec 2008 02:24:34 -0000

Hello,

Nokia has graciously allowed me to release a patch which I developed to 
improve general mbuf and cluster cache behavior.  This is based on others 
observations that due to simple alignment at 2k and 256k we achieve a poor 
cache distribution for the header area of packets and the most heavily 
used mbuf header fields.  In addition, modern machines stripe memory 
access across several memories and even memory controllers.  Accessing 
heavily aligned locations such as these can also create load imbalances 
among memories.

To solve this problem I have added two new features to UMA.  The first is 
the zone flag UMA_ZONE_CACHESPREAD.  This flag modifies the meaning of the 
alignment field such that start addresses are staggered by at least align 
+ 1 bytes.  In the case of clusters and mbufs this means adding 
uma_cache_align + 1 bytes to the amount of storage allocated.  This 
creates a certain constant amount of waste, 3% and 12% respectively.  It 
also means we must use contiguous physical and virtual memory consisting 
of several pages to efficiently use the memory and land on as many cache 
lines as possible.

Because contiguous physical memory is not always available, the allocator 
had to have a fallback mechanism.  We don't simply want to have all mbuf 
allocations check two zones as once we deplete available contiguous memory 
the check on the first zone will always fail using the most expensive code 
path.

To resolve this issue, I added the ability for secondary zones to stack on 
top of multiple primary zones.  Secondary zones are zones which get their 
storage from another zone but handle their own caching, ctors, dtors, etc. 
By adding this feature a secondary zone can be created that can allocate 
either from the contiguous memory pool or the non-contiguous single-page 
pool depending on availability.  It is also much faster to fail between 
them deep in the allocator because it is only required when we exhaust the 
already available mbuf memory.

For mbufs and clusters there are now three zones each.  A contigmalloc 
backed zone, a single-page allocator zone, and a secondary zone with the 
original zome_mbuf or zone_clust name.  The packet zone also takes from 
both available mbuf zones.  The individual backend zones are not exposed 
outside of kern_mbuf.c.

Currently, each backend zone can have its own limit.  The secondary zone 
only blocks when both are full.  Statistic wise the limit should be 
reported as the sum of the backend limits, however, that isn't presently 
done.  The secondary zone can not have its own limit independent of the 
backends at this time.  I'm not sure if that's valuable or not.

I have test results from nokia which show a dramatic improvement in 
several workloads but which I am probably not at liberty to discuss.  I'm 
in the process of convincing Kip to help me get some benchmark data on our 
stack.

Also as part of the patch I renamed a few functions since many were 
non-obvious and grew new keg abstractions to tidy things up a bit.  I 
suspect those of you with UMA experience (robert, bosko) will find the 
renaming a welcome improvement.

The patch is available at: 
http://people.freebsd.org/~jeff/mbuf_contig.diff

I would love to hear any feedback you may have.  I have been developing 
this and testing various version off and on for months, however, this is a 
fresh port to current and it is a little green so should be considered 
experimental.

In particular, I'm most nervous about how the vm will respond to new 
pressure on contig physical pages.  I'm also interested in hearing from 
embedded/limited memory people about how we might want to limit or tune 
this.

Thanks,
Jeff