From owner-freebsd-arch@FreeBSD.ORG  Sat Aug 20 13:43:18 2011
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id ECA831065670;
	Sat, 20 Aug 2011 13:43:17 +0000 (UTC)
	(envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
	by mx1.freebsd.org (Postfix) with ESMTP id 7922F8FC0A;
	Sat, 20 Aug 2011 13:43:17 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
	id 38CAB7300A; Sat, 20 Aug 2011 15:45:30 +0200 (CEST)
Date: Sat, 20 Aug 2011 15:45:30 +0200
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: Robert Watson <rwatson@FreeBSD.org>
Message-ID: <20110820134530.GA42942@onelab2.iet.unipi.it>
References: <slrnj4oiiq.21rg.vadim_nuclight@kernblitz.nuclight.avtf.net>
	<810527321.20110819123700@serebryakov.spb.ru>
	<201108191401.23083.pieter@degoeje.nl>
	<425884435.20110819175307@serebryakov.spb.ru>
	<20110819172252.GE88904@in-addr.com>
	<368496955.20110820101506@serebryakov.spb.ru>
	<alpine.BSF.2.00.1108201234280.4529@fledge.watson.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <alpine.BSF.2.00.1108201234280.4529@fledge.watson.org>
User-Agent: Mutt/1.4.2.3i
Cc: Lev Serebryakov <lev@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: 10gbps scalability (was: Re: FreeBSD problems and preliminary
	ways to solve)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 20 Aug 2011 13:43:18 -0000

On Sat, Aug 20, 2011 at 12:38:26PM +0100, Robert Watson wrote:
> 
> On Sat, 20 Aug 2011, Lev Serebryakov wrote:
> 
> >>Can you honestly say the same about handling line rate packet forwarding
> >>for multiple 10G cards?
> >
> >I agree with you. I've not say, that 10G routing is very important for 
> >many users. My comment about 10G was answer to statement, that "The niche 
> >for routers & traffic analysis is still ours.". I wanted to say, that it 
> >is so may be now, but not for long.
> 
> Part of the key here will be reworking things like ipfw(4) and pf(4) to 
> scale better than they do currently.  For pf(4), it's particularly 
...
> These are closely related to the issue of userspace networking, which Luigi 
> is starting to explore with netmap.  Ideally, you could use the same NIC 
> for both kernel network stack stuff and userspace applications, using 
> hardware filters to decide whether individual packets go to a descriptor 
> ring in the kernel or userspace.  Solarflare's Open Onload is an 
...

Thanks to Robert for changing the subject (because i believe that
10G operation is at the bottom of the list of issues that Vadim
brought up).

Regarding netmap i wanted to mention that, since the announce
at the beginning of june, we now have a lot more stuff:
- an initial libpcap library, so a number of apps  can run at
  much higher speed;
- OpenvSwitch support, which mean that you can do userspace
  bridging much faster than
- the Click modular router now runs (in userspace) at up to 4Mpps
  per core, which is faster than in-kernel linux;
A userspace version of ipfw should be available in a short time,
and i have some work in progress to bring the forwarding tables
in userspace (but of course you can do the same with Click).
I also see people start using it, which is a good thing because
i am getting useful feedback on features and bugs and patches
for more device drivers.

More (including a recently posted GoogleTechTalk) at
        http://info.iet.unipi.it/~luigi/netmap/
        http://www.youtube.com/watch?v=SPtoXNW9yEQ

I still think that it would have been nice (especially to compare
FreeBSD to Linux) to have netmap into 9.0, as it would have given
us the lead for sw packet processing solutions.

I understand that the timing of the netmap release was unfortunate
(due to the impending code freeze and summer and holidays), but
probably we could have given it a chance, since the code does not
make a single change to the kernel code except for device drivers,
and even those are small and #ifdef'ed out if you don't want
a netmap-enabled kernel.

Let's hope we find a way to import it into RELENG_9 and i will
do my best to distribute patches compatible with recent OS versions.

On the general issue of improving performance of the network stack,
I feel that to achieve significant speed improvements we should
really reconsider the way things are done in the network stack. 
And that comes before support for special HW features. 

In netmap at least, a large performance improvement came from getting
rid of mbufs. Per-packet allocation and deallocation are a huge
cost, and really an unnecessary one if the consumer of the packet
can do the processing inline instead of storing the packet and then
work on it a week after. Think for instance of TCP acks, which could
really be processed inline.  Same goes for firewalled traffic.

For high speed TCP (i.e. sessions trying to stream data) we have a
lot of issues, two of which are below:
- we still have linear lists of buffers, which means
  that the cost of out-of-order incoming segments is O(N) (with N large
  at 1..10Gbps). Fixing that is way more important than improving
  the locking.
- on the outgoing side, the code makes no assumption on what happens
  on the MTU and incoming acks, so every transmission recomputes
  the boundaries of the segment to be sent. Never mind that in the
  real world the MTU is normally stable, and it would be a lot more
  efficient to store (in the socket buffer) and manage (in the stack)
  data as an array of MTU-sized buffers, optimize the fast path for that,
  and trap to a slowpath if something changes.

cheers
luigi