From owner-freebsd-net@FreeBSD.ORG  Thu Mar 15 16:23:20 2007
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
X-Original-To: net@FreeBSD.org
Delivered-To: freebsd-net@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 7F2B316A401
	for <net@FreeBSD.org>; Thu, 15 Mar 2007 16:23:20 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
	by mx1.freebsd.org (Postfix) with ESMTP id EA11B13C46E
	for <net@FreeBSD.org>; Thu, 15 Mar 2007 16:23:19 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: (qmail 60039 invoked from network); 15 Mar 2007 15:53:26 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
	(envelope-sender <andre@freebsd.org>)
	by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
	for <kris@obsecurity.org>; 15 Mar 2007 15:53:26 -0000
Message-ID: <45F972F4.8070106@freebsd.org>
Date: Thu, 15 Mar 2007 17:23:16 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Thunderbird 1.5.0.10 (Windows/20070221)
MIME-Version: 1.0
To: Kris Kennaway <kris@obsecurity.org>
References: <20070315011511.GA55003@xor.obsecurity.org>
In-Reply-To: <20070315011511.GA55003@xor.obsecurity.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: qingli@freebsd.org, net@FreeBSD.org
Subject: Re: Scalability problem from route refcounting
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Mar 2007 16:23:20 -0000

Kris Kennaway wrote:
> I have recently started looking at database performance over gigabit
> ethernet, and there seems to be a bottleneck coming from the way route
> reference counting is implemented.  On an 8-core system it looks like
> we spend a lot of time waiting for the rtentry mutex:
> 
>    max        total   wait_total       count   avg wait_avg     cnt_hold     cnt_lock name
> [...]
>    408       950496      1135994      301418     3     3        24876        55936 net/if_ethersubr.c:397 (sleep mutex:bge1)
>    974       968617      1515169      253772     3     5        14741        60581 dev/bge/if_bge.c:2949 (sleep mutex:bge1)
>   2415     18255976      1607511      253841    71     6       125174         3131 netinet/tcp_input.c:770 (sleep mutex:inp)
>    233      1850252      2080506      141817    13    14            0       126897 netinet/tcp_usrreq.c:756 (sleep mutex:inp)
>    384      6895050      2737492      299002    23     9        92100        73942 dev/bge/if_bge.c:3506 (sleep mutex:bge1)
>    626      5342286      2760193      301477    17     9        47616        54158 net/route.c:147 (sleep mutex:radix node head)
>    326      3562050      3381510      301477    11    11       133968       110104 net/route.c:197 (sleep mutex:rtentry)
>    146       947173      5173813      301477     3    17        44578       120961 net/route.c:1290 (sleep mutex:rtentry)
>    146       953718      5501119      301476     3    18        63285       121819 netinet/ip_output.c:610 (sleep mutex:rtentry)
>     50      4530645      7885304     1423098     3     5       642391       788230 kern/subr_turnstile.c:489 (spin mutex:turnstile chain)
> 
> i.e. during a 30 second sample we spend a total of >14 seconds (on all
> cpus) waiting to acquire the rtentry mutex.
> 
> This appears to be because (among other things), we increment and then
> decrement the route refcount for each packet we send, each of which
> requires acquiring the rtentry mutex for that route before adjusting
> the refcount.  So multiplexing traffic for lots of connections over a
> single route is being partly rate-limited by those mutex operations.

The rtentry locking actually isn't that much of a problem in itself
and rtalloc1() in net/route.c only gets the blame because this function
aquires the lock for the routing table entry and returns a locked entry.
It is the job of the callers to unlock it as soon as possible again.
Here arpresolve() in netinet/if_ether.c is the offending function keeping
the lock over an extended period causing the contention and long wait
times.  ARP is a horrible mess and I don't have a quick fix for this.
There is some work in progress for quite some time to replace the current
ARP code with something more adequate.  That's not finished yet though.

> This is not the end of the story though, the bge driver is a serious
> bottleneck on its own (e.g. I nulled out the route locking since it is
> not relevant in my environment, at least for the purposes of this
> test, and that exposed bge as the next problem -- but other drivers
> may not be so bad).

-- 
Andre