From owner-freebsd-arch@FreeBSD.ORG Sat Aug 20 13:43:18 2011 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id ECA831065670; Sat, 20 Aug 2011 13:43:17 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 7922F8FC0A; Sat, 20 Aug 2011 13:43:17 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 38CAB7300A; Sat, 20 Aug 2011 15:45:30 +0200 (CEST) Date: Sat, 20 Aug 2011 15:45:30 +0200 From: Luigi Rizzo To: Robert Watson Message-ID: <20110820134530.GA42942@onelab2.iet.unipi.it> References: <810527321.20110819123700@serebryakov.spb.ru> <201108191401.23083.pieter@degoeje.nl> <425884435.20110819175307@serebryakov.spb.ru> <20110819172252.GE88904@in-addr.com> <368496955.20110820101506@serebryakov.spb.ru> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.4.2.3i Cc: Lev Serebryakov , freebsd-arch@freebsd.org Subject: Re: 10gbps scalability (was: Re: FreeBSD problems and preliminary ways to solve) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 Aug 2011 13:43:18 -0000 On Sat, Aug 20, 2011 at 12:38:26PM +0100, Robert Watson wrote: > > On Sat, 20 Aug 2011, Lev Serebryakov wrote: > > >>Can you honestly say the same about handling line rate packet forwarding > >>for multiple 10G cards? > > > >I agree with you. I've not say, that 10G routing is very important for > >many users. My comment about 10G was answer to statement, that "The niche > >for routers & traffic analysis is still ours.". I wanted to say, that it > >is so may be now, but not for long. > > Part of the key here will be reworking things like ipfw(4) and pf(4) to > scale better than they do currently. For pf(4), it's particularly ... > These are closely related to the issue of userspace networking, which Luigi > is starting to explore with netmap. Ideally, you could use the same NIC > for both kernel network stack stuff and userspace applications, using > hardware filters to decide whether individual packets go to a descriptor > ring in the kernel or userspace. Solarflare's Open Onload is an ... Thanks to Robert for changing the subject (because i believe that 10G operation is at the bottom of the list of issues that Vadim brought up). Regarding netmap i wanted to mention that, since the announce at the beginning of june, we now have a lot more stuff: - an initial libpcap library, so a number of apps can run at much higher speed; - OpenvSwitch support, which mean that you can do userspace bridging much faster than - the Click modular router now runs (in userspace) at up to 4Mpps per core, which is faster than in-kernel linux; A userspace version of ipfw should be available in a short time, and i have some work in progress to bring the forwarding tables in userspace (but of course you can do the same with Click). I also see people start using it, which is a good thing because i am getting useful feedback on features and bugs and patches for more device drivers. More (including a recently posted GoogleTechTalk) at http://info.iet.unipi.it/~luigi/netmap/ http://www.youtube.com/watch?v=SPtoXNW9yEQ I still think that it would have been nice (especially to compare FreeBSD to Linux) to have netmap into 9.0, as it would have given us the lead for sw packet processing solutions. I understand that the timing of the netmap release was unfortunate (due to the impending code freeze and summer and holidays), but probably we could have given it a chance, since the code does not make a single change to the kernel code except for device drivers, and even those are small and #ifdef'ed out if you don't want a netmap-enabled kernel. Let's hope we find a way to import it into RELENG_9 and i will do my best to distribute patches compatible with recent OS versions. On the general issue of improving performance of the network stack, I feel that to achieve significant speed improvements we should really reconsider the way things are done in the network stack. And that comes before support for special HW features. In netmap at least, a large performance improvement came from getting rid of mbufs. Per-packet allocation and deallocation are a huge cost, and really an unnecessary one if the consumer of the packet can do the processing inline instead of storing the packet and then work on it a week after. Think for instance of TCP acks, which could really be processed inline. Same goes for firewalled traffic. For high speed TCP (i.e. sessions trying to stream data) we have a lot of issues, two of which are below: - we still have linear lists of buffers, which means that the cost of out-of-order incoming segments is O(N) (with N large at 1..10Gbps). Fixing that is way more important than improving the locking. - on the outgoing side, the code makes no assumption on what happens on the MTU and incoming acks, so every transmission recomputes the boundaries of the segment to be sent. Never mind that in the real world the MTU is normally stable, and it would be a lot more efficient to store (in the socket buffer) and manage (in the stack) data as an array of MTU-sized buffers, optimize the fast path for that, and trap to a slowpath if something changes. cheers luigi