From owner-freebsd-net@FreeBSD.ORG Thu Apr 19 20:26:52 2012 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8D13F106566C; Thu, 19 Apr 2012 20:26:52 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 47CF28FC14; Thu, 19 Apr 2012 20:26:52 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 4F4C273029; Thu, 19 Apr 2012 22:46:22 +0200 (CEST) Date: Thu, 19 Apr 2012 22:46:22 +0200 From: Luigi Rizzo To: Andre Oppermann Message-ID: <20120419204622.GA94904@onelab2.iet.unipi.it> References: <20120419133018.GA91364@onelab2.iet.unipi.it> <4F907011.9080602@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4F907011.9080602@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: current@freebsd.org, net@freebsd.org Subject: Re: Some performance measurements on the FreeBSD network stack X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Apr 2012 20:26:52 -0000 On Thu, Apr 19, 2012 at 10:05:37PM +0200, Andre Oppermann wrote: > On 19.04.2012 15:30, Luigi Rizzo wrote: > >I have been running some performance tests on UDP sockets, > >using the netsend program in tools/tools/netrate/netsend > >and instrumenting the source code and the kernel do return in > >various points of the path. Here are some results which > >I hope you find interesting. > > Jumping over very interesting analysis... > > >- the next expensive operation, consuming another 100ns, > > is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator > > seems to scale decently at least with 4 cores. The copyin() is > > relatively inexpensive (not reported in the data below, but > > disabling it saves only 15-20ns for a short packet). > > > > I have not followed the details, but the allocator calls the zone > > allocator and there is at least one critical_enter()/critical_exit() > > pair, and the highly modular architecture invokes long chains of > > indirect function calls both on allocation and release. > > > > It might make sense to keep a small pool of mbufs attached to the > > socket buffer instead of going to the zone allocator. > > Or defer the actual encapsulation to the > > (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways. > > The UMA mbuf allocator is certainly not perfect but rather good. > It has a per-CPU cache of mbuf's that are very fast to allocate > from. Once it has used them it needs to refill from the global > pool which may happen from time to time and show up in the averages. indeed i was pleased to see no difference between 1 and 4 threads. This also suggests that the global pool is accessed very seldom, and for short times, otherwise you'd see the effect with 4 threads. What might be moderately expensive are the critical_enter()/critical_exit() calls around individual allocations. The allocation happens while the code has already an exclusive lock on so->snd_buf so a pool of fresh buffers could be attached there. But the other consideration is that one could defer the mbuf allocation to a later time when the packet is actually built (or anyways right before the thread returns). What i envision (and this would fit nicely with netmap) is the following: - have a (possibly readonly) template for the headers (MAC+IP+UDP) attached to the socket, built on demand, and cached and managed with similar invalidation rules as used by fastforward; - possibly extend the pru_send interface so one can pass down the uio instead of the mbuf; - make an opportunistic buffer allocation in some place downstream, where the code already has an x-lock on some resource (could be the snd_buf, the interface, ...) so the allocation comes for free. > >- another big bottleneck is the route lookup in ip_output() > > (between entries 51 and 56). Not only it eats another > > 100ns+ on an empty routing table, but it also > > causes huge contentions when multiple cores > > are involved. > > This is indeed a big problem. I'm working (rough edges remain) on > changing the routing table locking to an rmlock (read-mostly) which i was wondering, is there a way (and/or any advantage) to use the fastforward code to look up the route for locally sourced packets ? cheers luigi