From owner-freebsd-net@FreeBSD.ORG  Thu Mar 15 01:15:12 2007
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
X-Original-To: net@FreeBSD.org
Delivered-To: freebsd-net@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id B182916A402
	for <net@FreeBSD.org>; Thu, 15 Mar 2007 01:15:12 +0000 (UTC)
	(envelope-from kris@obsecurity.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
	by mx1.freebsd.org (Postfix) with ESMTP id A231113C468
	for <net@FreeBSD.org>; Thu, 15 Mar 2007 01:15:12 +0000 (UTC)
	(envelope-from kris@obsecurity.org)
Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196])
	by elvis.mu.org (Postfix) with ESMTP id 829611A3C1A
	for <net@FreeBSD.org>; Wed, 14 Mar 2007 18:15:12 -0700 (PDT)
Received: by obsecurity.dyndns.org (Postfix, from userid 1000)
	id C33625132C; Wed, 14 Mar 2007 21:15:11 -0400 (EDT)
Date: Wed, 14 Mar 2007 21:15:11 -0400
From: Kris Kennaway <kris@obsecurity.org>
To: net@FreeBSD.org
Message-ID: <20070315011511.GA55003@xor.obsecurity.org>
Mime-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="tThc/1wpZn/ma/RB"
Content-Disposition: inline
User-Agent: Mutt/1.4.2.2i
Cc: 
Subject: Scalability problem from route refcounting
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Mar 2007 01:15:12 -0000


--tThc/1wpZn/ma/RB
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

I have recently started looking at database performance over gigabit
ethernet, and there seems to be a bottleneck coming from the way route
reference counting is implemented.  On an 8-core system it looks like
we spend a lot of time waiting for the rtentry mutex:

   max        total   wait_total       count   avg wait_avg     cnt_hold     cnt_lock name
[...]
   408       950496      1135994      301418     3     3        24876        55936 net/if_ethersubr.c:397 (sleep mutex:bge1)
   974       968617      1515169      253772     3     5        14741        60581 dev/bge/if_bge.c:2949 (sleep mutex:bge1)
  2415     18255976      1607511      253841    71     6       125174         3131 netinet/tcp_input.c:770 (sleep mutex:inp)
   233      1850252      2080506      141817    13    14            0       126897 netinet/tcp_usrreq.c:756 (sleep mutex:inp)
   384      6895050      2737492      299002    23     9        92100        73942 dev/bge/if_bge.c:3506 (sleep mutex:bge1)
   626      5342286      2760193      301477    17     9        47616        54158 net/route.c:147 (sleep mutex:radix node head)
   326      3562050      3381510      301477    11    11       133968       110104 net/route.c:197 (sleep mutex:rtentry)
   146       947173      5173813      301477     3    17        44578       120961 net/route.c:1290 (sleep mutex:rtentry)
   146       953718      5501119      301476     3    18        63285       121819 netinet/ip_output.c:610 (sleep mutex:rtentry)
    50      4530645      7885304     1423098     3     5       642391       788230 kern/subr_turnstile.c:489 (spin mutex:turnstile chain)

i.e. during a 30 second sample we spend a total of >14 seconds (on all
cpus) waiting to acquire the rtentry mutex.

This appears to be because (among other things), we increment and then
decrement the route refcount for each packet we send, each of which
requires acquiring the rtentry mutex for that route before adjusting
the refcount.  So multiplexing traffic for lots of connections over a
single route is being partly rate-limited by those mutex operations.

This is not the end of the story though, the bge driver is a serious
bottleneck on its own (e.g. I nulled out the route locking since it is
not relevant in my environment, at least for the purposes of this
test, and that exposed bge as the next problem -- but other drivers
may not be so bad).

Kris


--tThc/1wpZn/ma/RB
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)

iD8DBQFF+J4fWry0BWjoQKURAnPuAJ97CLDA/ZlMYKyHUcMGVZWy+4zT4gCghEVM
h1SsQilD6TKOAv5A6FUTPSU=
=dzfz
-----END PGP SIGNATURE-----

--tThc/1wpZn/ma/RB--