From owner-freebsd-performance@FreeBSD.ORG Sat May 6 22:19:09 2006 Return-Path: X-Original-To: performance@FreeBSD.org Delivered-To: freebsd-performance@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C0F8216A402; Sat, 6 May 2006 22:19:09 +0000 (UTC) (envelope-from kris@obsecurity.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id 6A24943D45; Sat, 6 May 2006 22:19:09 +0000 (GMT) (envelope-from kris@obsecurity.org) Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196]) by elvis.mu.org (Postfix) with ESMTP id 07E231A4D8E; Sat, 6 May 2006 15:19:09 -0700 (PDT) Received: by obsecurity.dyndns.org (Postfix, from userid 1000) id 6B17B51A2C; Sat, 6 May 2006 18:19:08 -0400 (EDT) Date: Sat, 6 May 2006 18:19:08 -0400 From: Kris Kennaway To: Robert Watson Message-ID: <20060506221908.GB51268@xor.obsecurity.org> References: <20060506150622.C17611@fledge.watson.org> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="1LKvkjL3sHcu1TtY" Content-Disposition: inline In-Reply-To: <20060506150622.C17611@fledge.watson.org> User-Agent: Mutt/1.4.2.1i Cc: performance@FreeBSD.org, current@FreeBSD.org Subject: Re: Fine-grained locking for POSIX local sockets (UNIX domain sockets) X-BeenThere: freebsd-performance@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Performance/tuning List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 May 2006 22:19:09 -0000 --1LKvkjL3sHcu1TtY Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, May 06, 2006 at 03:16:48PM +0100, Robert Watson wrote: >=20 > Dear all, >=20 > Attached, please find a patch implementing more fine-grained locking for= =20 > the POSIX local socket subsystem (UNIX domain socket subsystem). Dear Sir, Per your request, please find attached the results of my measurements using super-smack on a 12-cpu E4500. supersmack queries/second with n worker threads: norwatson =3D without your patch (but with some other local locking patches) rwatson =3D also with your patch x norwatson-4 + rwatson-4 +------------------------------------------------------------+ | x xx + + | |x xx xx x + ++++ +++| | |_AM_| |__A___| | +------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 3067.92 3098.05 3086.945 3084.402 8.8815574 + 10 3245.06 3287.8 3270.52 3270.475 13.241953 Difference at 95.0% confidence 186.073 +/- 10.5935 6.03271% +/- 0.343455% (Student's t, pooled s =3D 11.2746) x norwatson-6 + rwatson-6 +------------------------------------------------------------+ | xx x + | |x *xxxx + + + ++ + ++ | | |__A__| |_____________A_____M________|| +------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 3641.11 3693.89 3679.735 3677.083 14.648967 + 10 3672.23 3896.32 3869.415 3845.071 66.826543 Difference at 95.0% confidence 167.988 +/- 45.4534 4.56851% +/- 1.23613% (Student's t, pooled s =3D 48.3755) i.e. in both cases there is a clear net gain in throughput with your patch. Without your patch, 6 clients is the optimum client load on this 12-cpu machine. At higher loads performance drops, even though formally all CPUs are not saturated. This is due to rapidly diverging lock contention (see below). x norwatson-8 + rwatson-8 +------------------------------------------------------------+ | + | | + + + x x| |+ + +++ + x xxxxx x x| | |_________A___M______| |___A___| | +------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 2601.46 2700.26 2650.52 2653.441 30.758034 + 10 2240.86 2516.87 2496.085 2468.468 81.868576 Difference at 95.0% confidence -184.973 +/- 58.1052 -6.97106% +/- 2.1898% (Student's t, pooled s =3D 61.8406) We see the drop in performance in both cases indicating that we are in the "overloaded" regime. The fact that your patch seems to give worse performance is puzzling at first sight. Running mutex profiling (and only keeping the unp mutex entries and the 10 most contended for clarity) shows the following: norwatson, 8 clients: debug.mutex.prof.stats: max total count avg cnt_hold cnt_lock name 5 40 9 4 0 3 kern/uipc_u= srreq.c:170 (unp) 8 8 1 8 0 0 vm/uma_core= .c:2101 (unpcb) 13 283 52 5 0 0 vm/uma_core= .c:890 (unpcb) 14 1075 200 5 0 0 vm/uma_core= .c:1885 (unpcb) 4 52 18 2 4 6 kern/uipc_u= srreq.c:577 (unp) 5 39 9 4 4 2 kern/uipc_u= srreq.c:534 (unp) 5 35 11 3 6 6 kern/uipc_u= srreq.c:974 (unp) 5 45 11 4 7 4 kern/uipc_u= srreq.c:210 (unp) 171 1164 9 129 7 2 kern/uipc_u= srreq.c:917 (unp) 14 78 20 3 11 2872481 kern/uipc_u= srreq.c:709 (unp) 70 156 11 14 13 4 kern/uipc_u= srreq.c:895 (unp) 43 581 20 29 24 6 kern/uipc_u= srreq.c:239 (unp) 44 429 18 23 26 8 kern/uipc_u= srreq.c:518 (unp) 55 491 12 40 30 10 kern/uipc_u= srreq.c:251 (unp) =2E.. 449 20000519 320038 62 15158 0 kern/uipc_u= srreq.c:431 (so_rcv) 459 86616085 2880079 30 15699 4944 kern/uipc_u= srreq.c:319 (so_snd) 146 2273360 640315 3 27918 29789 kern/kern_s= ig.c:1002 (process lock) 387 3325481 640099 5 38143 47670 kern/kern_d= escrip.c:420 (filedesc structure) 150 1881990 640155 2 64111 49033 kern/kern_d= escrip.c:368 (filedesc structure) 496 13792853 3685885 3 101692 132480 kern/kern_d= escrip.c:1988 (filedesc structure) 207 4061793 551604 7 115427 118242 kern/kern_s= ynch.c:220 (process lock) 391 10332282 3685885 2 194387 129547 kern/kern_d= escrip.c:1967 (filedesc structure) 465 25504709 320042 79 1632192 294498 kern/uipc_u= srreq.c:364 (unp) 470 124263922 2880084 43 13222757 2685853 kern/uipc_u= srreq.c:309 (unp) i.e. there is indeed heavy contention on the unp lock (column 5 counts the number of times we tried to acquire it and failed because someone else had the lock) - in fact about 5 times as many contentions as successful acquisitions. With your patch and the same load: 3 20 9 2 0 0 kern/uipc_u= srreq.c:1028 (unp_mtx) 3 22 9 2 0 0 kern/uipc_u= srreq.c:1161 (unp_mtx) 5 29 9 3 0 2 kern/uipc_u= srreq.c:1065 (unp_global_mtx) 5 53 18 2 0 76488 kern/uipc_u= srreq.c:287 (unp_global_mtx) 6 33 9 3 0 0 kern/uipc_u= srreq.c:236 (unp_mtx) 6 37 9 4 0 0 kern/uipc_u= srreq.c:819 (unp_mtx) 7 7 1 7 0 0 vm/uma_core= .c:2101 (unpcb) 8 49 9 5 0 0 kern/uipc_u= srreq.c:1101 (unp_mtx) 11 136 18 7 0 1 kern/uipc_u= srreq.c:458 (unp_global_mtx) 32 143 9 15 0 1 kern/uipc_u= srreq.c:1160 (unp_global_mtx) 44 472 18 26 0 0 kern/uipc_u= srreq.c:801 (unp_mtx) 123 310 9 34 0 0 kern/uipc_u= srreq.c:1100 (unp_mtx) 147 452 9 50 0 0 kern/uipc_u= srreq.c:1099 (unp_mtx) 172 748 9 83 0 0 kern/uipc_u= srreq.c:473 (unp_mtx) 337 1592 9 176 0 0 kern/uipc_u= srreq.c:1147 (unp_mtx) 350 1790 9 198 0 0 kern/uipc_u= srreq.c:1146 (unp_mtx) 780 39405928 320038 123 0 0 kern/uipc_u= srreq.c:618 (unp_mtx) 18 140 9 15 1 0 kern/uipc_u= srreq.c:235 (unp_global_mtx) 70 717 18 39 1 3 kern/uipc_u= srreq.c:800 (unp_global_mtx) 528 2444 9 271 1 1 kern/uipc_u= srreq.c:1089 (unp_global_mtx) 158 616 9 68 2 2 kern/uipc_u= srreq.c:476 (unp_mtx) 794 175382857 2880084 60 2 7686 kern/uipc_u= srreq.c:574 (unp_mtx) 4 25 9 2 3 2 kern/uipc_u= srreq.c:422 (unp_global_mtx) 186 874 9 97 3 3 kern/uipc_u= srreq.c:472 (unp_global_mtx) 768 33783759 320038 105 7442 0 kern/uipc_u= srreq.c:696 (unp_mtx) =2E.. 465 913127 320045 2 43130 35046 kern/uipc_s= ocket.c:1101 (so_snd) 483 2453927 628737 3 44768 46177 kern/kern_s= ig.c:1002 (process lock) 767 124298544 2880082 43 70037 59994 kern/uipc_u= srreq.c:581 (so_snd) 794 45176699 320038 141 83252 72140 kern/uipc_u= srreq.c:617 (unp_global_mtx) 549 9858281 3200210 3 579269 712643 kern/kern_r= esource.c:1172 (sleep mtxpool) 554 17122245 631715 27 641888 268243 kern/kern_d= escrip.c:420 (filedesc structure) 388 3009912 631753 4 653540 260590 kern/kern_d= escrip.c:368 (filedesc structure) 642 49626755 3681446 13 1642954 682669 kern/kern_d= escrip.c:1988 (filedesc structure) 530 13802687 3681446 3 1663244 616899 kern/kern_d= escrip.c:1967 (filedesc structure) 477 23472709 2810986 8 5671248 1900047 kern/kern_s= ynch.c:220 (process lock) The top 10 heavily contended mutexes are very different (but note the number of mutex acquisitions, column 3, is about the same). There is not much contention on unp_global_mtx any longer, but there is a lot more on some of the other mutexes, especially the process lock via msleep(). Off-hand I don't know what is the cause of this bottleneck (note: libthr is used as threading library and libpthread is not ported to sparc64). Also, a lot of the contention that used to be on the unp lock seems to have fallen through onto contending *two* of the filedesc locks (all about 1.6 million contentions). This may also help to explain the performance drop. With only 6 clients, the contention is about an order of magnitude less on most of the top 10, even though the number of mutex calls is only about 25% fewer than with 8 clients: 195 715786 240037 2 47462 48821 kern/uipc_s= ocket.c:1101 (so_snd) 524 3456427 480079 7 50257 53368 kern/kern_d= escrip.c:420 (filedesc structure) 647 21650810 240030 90 50609 2 kern/uipc_u= srreq.c:705 (so_rcv) 710 37962453 240031 158 63743 57814 kern/uipc_u= srreq.c:617 (unp_global_mtx) 345 1624193 488866 3 80349 62950 kern/kern_d= escrip.c:368 (filedesc structure) 595 108074003 2160067 50 83327 63451 kern/uipc_u= srreq.c:581 (so_snd) 453 3706420 519735 7 119947 181434 kern/kern_s= ynch.c:220 (process lock) 469 13085667 2800771 4 122344 132478 kern/kern_d= escrip.c:1988 (filedesc structure) 320 8814736 2800771 3 200492 148967 kern/kern_d= escrip.c:1967 (filedesc structure) 440 7591194 2400171 3 544692 507583 kern/kern_r= esource.c:1172 (sleep mtxpool) In summary, this is a good test case since it shows both the benefits of your patch and the areas of remaining concern. Yours sincerely, Kristian D. Kennaway --1LKvkjL3sHcu1TtY Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (FreeBSD) iD8DBQFEXSDbWry0BWjoQKURAh8HAJkBGINyDyuC30ghHYqo1oSi6F25hACg9Yy+ ggHk4zGOlXNVL8NAZtCnp8g= =Sc2u -----END PGP SIGNATURE----- --1LKvkjL3sHcu1TtY--