From owner-freebsd-net@FreeBSD.ORG Thu Apr 19 20:11:30 2012 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 025881065672 for ; Thu, 19 Apr 2012 20:11:30 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id 659818FC0A for ; Thu, 19 Apr 2012 20:11:29 +0000 (UTC) Received: (qmail 15206 invoked from network); 19 Apr 2012 20:00:01 -0000 Received: from unknown (HELO [62.48.0.94]) ([62.48.0.94]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 19 Apr 2012 20:00:01 -0000 Message-ID: <4F907011.9080602@freebsd.org> Date: Thu, 19 Apr 2012 22:05:37 +0200 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:11.0) Gecko/20120327 Thunderbird/11.0.1 MIME-Version: 1.0 To: Luigi Rizzo References: <20120419133018.GA91364@onelab2.iet.unipi.it> In-Reply-To: <20120419133018.GA91364@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: current@freebsd.org, net@freebsd.org Subject: Re: Some performance measurements on the FreeBSD network stack X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Apr 2012 20:11:30 -0000 On 19.04.2012 15:30, Luigi Rizzo wrote: > I have been running some performance tests on UDP sockets, > using the netsend program in tools/tools/netrate/netsend > and instrumenting the source code and the kernel do return in > various points of the path. Here are some results which > I hope you find interesting. Jumping over very interesting analysis... > - the next expensive operation, consuming another 100ns, > is the mbuf allocation in m_uiotombuf(). Nevertheless, the allocator > seems to scale decently at least with 4 cores. The copyin() is > relatively inexpensive (not reported in the data below, but > disabling it saves only 15-20ns for a short packet). > > I have not followed the details, but the allocator calls the zone > allocator and there is at least one critical_enter()/critical_exit() > pair, and the highly modular architecture invokes long chains of > indirect function calls both on allocation and release. > > It might make sense to keep a small pool of mbufs attached to the > socket buffer instead of going to the zone allocator. > Or defer the actual encapsulation to the > (*so->so_proto->pr_usrreqs->pru_send)() which is called inline, anyways. The UMA mbuf allocator is certainly not perfect but rather good. It has a per-CPU cache of mbuf's that are very fast to allocate from. Once it has used them it needs to refill from the global pool which may happen from time to time and show up in the averages. > - another big bottleneck is the route lookup in ip_output() > (between entries 51 and 56). Not only it eats another > 100ns+ on an empty routing table, but it also > causes huge contentions when multiple cores > are involved. This is indeed a big problem. I'm working (rough edges remain) on changing the routing table locking to an rmlock (read-mostly) which doesn't produce any lock contention or cache pollution. Also skipping the per-route lock while the table read-lock is held should help some more. All in all this should give a massive gain in high pps situations at the expense of costlier routing table changes. However changes are seldom to essentially never with a single default route. After that the ARP table will gets same treatment and the low stack lock contention points should be gone for good. -- Andre