From owner-freebsd-net@FreeBSD.ORG Thu Mar 15 01:15:12 2007 Return-Path: X-Original-To: net@FreeBSD.org Delivered-To: freebsd-net@FreeBSD.ORG Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B182916A402 for ; Thu, 15 Mar 2007 01:15:12 +0000 (UTC) (envelope-from kris@obsecurity.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.freebsd.org (Postfix) with ESMTP id A231113C468 for ; Thu, 15 Mar 2007 01:15:12 +0000 (UTC) (envelope-from kris@obsecurity.org) Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196]) by elvis.mu.org (Postfix) with ESMTP id 829611A3C1A for ; Wed, 14 Mar 2007 18:15:12 -0700 (PDT) Received: by obsecurity.dyndns.org (Postfix, from userid 1000) id C33625132C; Wed, 14 Mar 2007 21:15:11 -0400 (EDT) Date: Wed, 14 Mar 2007 21:15:11 -0400 From: Kris Kennaway To: net@FreeBSD.org Message-ID: <20070315011511.GA55003@xor.obsecurity.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="tThc/1wpZn/ma/RB" Content-Disposition: inline User-Agent: Mutt/1.4.2.2i Cc: Subject: Scalability problem from route refcounting X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Mar 2007 01:15:12 -0000 --tThc/1wpZn/ma/RB Content-Type: text/plain; charset=us-ascii Content-Disposition: inline I have recently started looking at database performance over gigabit ethernet, and there seems to be a bottleneck coming from the way route reference counting is implemented. On an 8-core system it looks like we spend a lot of time waiting for the rtentry mutex: max total wait_total count avg wait_avg cnt_hold cnt_lock name [...] 408 950496 1135994 301418 3 3 24876 55936 net/if_ethersubr.c:397 (sleep mutex:bge1) 974 968617 1515169 253772 3 5 14741 60581 dev/bge/if_bge.c:2949 (sleep mutex:bge1) 2415 18255976 1607511 253841 71 6 125174 3131 netinet/tcp_input.c:770 (sleep mutex:inp) 233 1850252 2080506 141817 13 14 0 126897 netinet/tcp_usrreq.c:756 (sleep mutex:inp) 384 6895050 2737492 299002 23 9 92100 73942 dev/bge/if_bge.c:3506 (sleep mutex:bge1) 626 5342286 2760193 301477 17 9 47616 54158 net/route.c:147 (sleep mutex:radix node head) 326 3562050 3381510 301477 11 11 133968 110104 net/route.c:197 (sleep mutex:rtentry) 146 947173 5173813 301477 3 17 44578 120961 net/route.c:1290 (sleep mutex:rtentry) 146 953718 5501119 301476 3 18 63285 121819 netinet/ip_output.c:610 (sleep mutex:rtentry) 50 4530645 7885304 1423098 3 5 642391 788230 kern/subr_turnstile.c:489 (spin mutex:turnstile chain) i.e. during a 30 second sample we spend a total of >14 seconds (on all cpus) waiting to acquire the rtentry mutex. This appears to be because (among other things), we increment and then decrement the route refcount for each packet we send, each of which requires acquiring the rtentry mutex for that route before adjusting the refcount. So multiplexing traffic for lots of connections over a single route is being partly rate-limited by those mutex operations. This is not the end of the story though, the bge driver is a serious bottleneck on its own (e.g. I nulled out the route locking since it is not relevant in my environment, at least for the purposes of this test, and that exposed bge as the next problem -- but other drivers may not be so bad). Kris --tThc/1wpZn/ma/RB Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (FreeBSD) iD8DBQFF+J4fWry0BWjoQKURAnPuAJ97CLDA/ZlMYKyHUcMGVZWy+4zT4gCghEVM h1SsQilD6TKOAv5A6FUTPSU= =dzfz -----END PGP SIGNATURE----- --tThc/1wpZn/ma/RB--