From owner-freebsd-net@FreeBSD.ORG  Thu Jan  3 00:01:41 2008
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id C21EF16A420
	for <freebsd-net@freebsd.org>; Thu,  3 Jan 2008 00:01:41 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 2426A13C45D
	for <freebsd-net@freebsd.org>; Thu,  3 Jan 2008 00:01:40 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: (qmail 81840 invoked from network); 2 Jan 2008 23:26:56 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
	(envelope-sender <andre@freebsd.org>)
	by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
	for <bms@FreeBSD.org>; 2 Jan 2008 23:26:56 -0000
Message-ID: <477C25E2.4080303@freebsd.org>
Date: Thu, 03 Jan 2008 01:01:38 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Thunderbird 1.5.0.14 (Windows/20071210)
MIME-Version: 1.0
To: "Bruce M. Simpson" <bms@FreeBSD.org>
References: <43B45EEF.6060800@x-trader.de>
	<43B47CB5.3C0F1632@freebsd.org>	<b63e753b0712281551u52894ed9mb0dd55a988bc9c7a@mail.gmail.com>
	<477C1434.80106@freebsd.org> <477C1776.2080002@FreeBSD.org>
In-Reply-To: <477C1776.2080002@FreeBSD.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-net@freebsd.org, Tiffany Snyder <tiffany.snyder@gmail.com>
Subject: Re: Routing SMP benefit
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 03 Jan 2008 00:01:41 -0000

Bruce M. Simpson wrote:
> Andre Oppermann wrote:
>> So far the PPS rate limit has primarily been the cache miss penalties
>> on the packet access.  Multiple CPUs can help here of course for bi-
>> directional traffic.  Hardware based packet header cache prefetching as
>> done by some embedded MIPS based network processors at least doubles the
>> performance.  Intel has something like this for a couple of chipset and
>> network chip combinations.  We don't support that feature yet though.
> 
> What sort of work is needed in order to support header prefetch?

Extracting the documentation out of Intel for a first step.  It's
called Direct Cache Access (DCA).  At least in the Linux implementation
it has been intermingled with I/OAT which is an asynchronous memory
controller based DMA copy mechanism.  Don't know if they really have
to be together.  The idea of DCA is to cause the memory controller
upon DMA'ing a packet into main memory to also load it into the
CPU cache(s) right away.  For packet forwarding the first 128 bytes
are sufficient.  For server applications and TCP it may be beneficial
to prefetch the whole packet.  May cause some considerable cache
pollution though depending on usage.

Some pointers:

http://www.stanford.edu/group/comparch/papers/huggahalli05.pdf
http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=tree;f=drivers/dca;hb=HEAD
http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=tree;f=drivers/dma;hb=HEAD
http://download.intel.com/technology/comms/perfnet/download/ServerNetworkIOAccel.pdf
http://download.intel.com/design/network/prodbrf/317796.pdf

>> Many of the things you mention here are planned for FreeBSD 8.0 in the
>> same or different form.  Work in progress is the separation of the ARP
>> table from kernel routing table.  If we can prevent references to radix
>> nodes generally almost all locking can be done away with.  Instead only
>> a global rmlock (read-mostly) could govern the entire routing table.
>> Obtaining the rmlock for reading is essentially free.
> 
> This is exactly what I'm thinking, this feels like the right way forward.
> 
> A single rwlock should be fine, route table updates should generally 
> only be happening from one process, and thus a single thread, at any 
> given time.

rmlocks are even faster and the change to use ratio is also quite right.

>> Table changes
>> are very infrequent compared to lookups (like 700,000 to 300-400) in
>> default free Internet routing.  The radix trie nodes are rather big
>> and could use some more trimming to make the fit a single cache line.
>> I've already removed some stuff a couple of years ago and more can be
>> done.
>>
>> It's very important to keep this in mind: "profile, don't speculate".
> Beware though that functionality isn't sacrificed at the expense of this.
> 
> For example it would be very, very useful to be able to merge the 
> multicast routing implementation with the unicast -- with the proviso of 
> course that mBGP requires that RPF can be performed with a separate set 
> of FIB entries from the unicast FIB.
> 
> Of course if next-hops themselves are held in a container separately 
> referenced
> from the radix node, such as a simple linked list as per the OpenBSD code.

Haven't looked at the multicast code so I can't comment.  The other
stuff is just talk so far.  No work in progress, at least from my side.

> If we ensure the parent radix trie node object fits in a cache line, 
> then that's fine.
> 
> [I am looking at some stuff in the dynamic/ad-hoc/mesh space which is 
> really going to need support for multipath similar to this.]

I was looking at some parallel forwarding table for fastforward
that is highly optimized for IPv4 and cache efficiency.  It was
supposed to be 8-bit stride based (256-ary) with SSE based multi
segment longest prefix match updates.  Never managed to this past
the design state though.  And it's not one of the pressing issues.

The radix trie is pretty efficient though for being architecture
independent.  Even though the depth and variety in destination
addresses matters it never really turned out to become bottleneck
in my profile at the time.  It does have its limitations though
becoming more apparent at very high PPS and very large routing
tables as in the DFZ.

-- 
Andre