From owner-freebsd-net@FreeBSD.ORG Thu May 23 16:45:51 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 51A52848; Thu, 23 May 2013 16:45:51 +0000 (UTC) (envelope-from MBentkofsky@verisign.com) Received: from exprod6og109.obsmtp.com (exprod6og109.obsmtp.com [64.18.1.23]) by mx1.freebsd.org (Postfix) with ESMTP id A9665FB9; Thu, 23 May 2013 16:45:49 +0000 (UTC) Received: from peregrine.verisign.com ([216.168.239.74]) (using TLSv1) by exprod6ob109.postini.com ([64.18.5.12]) with SMTP ID DSNKUZ5Ht+HM2L2i92KNUw3SEap7O/KXED45@postini.com; Thu, 23 May 2013 09:45:50 PDT Received: from brn1wnexcas01.vcorp.ad.vrsn.com (brn1wnexcas01.vcorp.ad.vrsn.com [10.173.152.205]) by peregrine.verisign.com (8.13.6/8.13.4) with ESMTP id r4NGi1kk028037 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL); Thu, 23 May 2013 12:44:01 -0400 Received: from BRN1WNEXMBX01.vcorp.ad.vrsn.com ([::1]) by brn1wnexcas01.vcorp.ad.vrsn.com ([::1]) with mapi id 14.02.0342.003; Thu, 23 May 2013 12:44:00 -0400 From: "Bentkofsky, Michael" To: Jeff Roberson , John Baldwin Subject: RE: Followup from Verisign after last week's developer summit Thread-Topic: Followup from Verisign after last week's developer summit Thread-Index: Ac5VXnht9RnhzeKKSL6DcCEAPOWbPABCopwAAAyDTYAAC5LNAAA/pTtw Date: Thu, 23 May 2013 16:44:00 +0000 Message-ID: <080FBD5B7A09F845842100A6DE79623321F703B5@BRN1WNEXMBX01.vcorp.ad.vrsn.com> References: <080FBD5B7A09F845842100A6DE79623321F6E70C@BRN1WNEXMBX01.vcorp.ad.vrsn.com> <201305211320.26818.jhb@freebsd.org> In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.173.152.4] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "freebsd-net@freebsd.org" , "jeff@freebsd.org" , "rwatson@freebsd.org" , "Charbon, Julien" X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 23 May 2013 16:45:51 -0000 I am adding freebsd-net to this and will re-summarize to get additional inp= ut. Thanks for all of the initial suggestions. For benefit of those on freebsd-net@, we are noticing significant locking c= ontention on the V_tcpinfo lock under moderately high connection establishm= ent and teardown rates (around 45-50k connections per second). Our profilin= g suggests the lock contention on V_tcpinfo effectively single-threads all = TCP connections. Similar testing on Linux with equivalent hardware does not= show this contention and can get a much higher connection establishment ra= te. We can attach profiling and test details if anyone would like. JHB recommends: - He has seen similar results in other kinds of testing.=20 - Linux uses RCU for the locking on the equivalent table (we've confirmed t= his to be the case). - Looking into a lock per bucket on the PCB lookup. Jeff recommends: - Changing the lock strategy so the hash lookup can be effectively pushed f= urther down into the stack. - Making the [list] iterators more complex like those in use in the hash lo= okup now. We are starting down these paths to try to break the locking down. We'll po= st some initial patch ideas soon. Meanwhile, any additional suggestions are= certainly welcome. Finally, I will mention that we have enabled PCBGROUPS in some of our testi= ng with 9.1 and found no change for our particular workload with high conne= ction establishment rates. Thanks, Mike -----Original Message----- From: Jeff Roberson [mailto:jroberson@jroberson.net]=20 Sent: Wednesday, May 22, 2013 12:50 AM To: John Baldwin Cc: Bentkofsky, Michael; rwatson@freebsd.org; jeff@freebsd.org; Charbon, Ju= lien Subject: Re: Followup from Verisign after last week's developer summit On Tue, 21 May 2013, Jeff Roberson wrote: > On Tue, 21 May 2013, John Baldwin wrote: > >> On Monday, May 20, 2013 9:48:02 am Bentkofsky, Michael wrote: >>> Greetings gentlemen, >>>=20 >>> It was a pleasure to meet you all last week at the FreeBSD developer=20 >>> summit. >> I would like to thank you for spending the time to discuss all the=20 >> wonderful internals of the network stack. We also thoroughly enjoyed=20 >> the discussion on receive side scaling. >>>=20 >>> I'm sure you will remember both Julien Charbon and me asking=20 >>> questions >> regarding the TCP stack implementation, specifically around the=20 >> locking internals. I am hoping to follow-up with a path forward so we=20 >> might be able to enhance the connection rate performance. Our=20 >> internal testing has found that the V_tcpinfo lock prevents TCP=20 >> scaling under high connection setup and teardown rates. In fact, we=20 >> surmise that a new "FIN flood" attack may be possible to degrade=20 >> server connections significantly. >>>=20 >>> In short, we are interested in changing this locking strategy and=20 >>> hope to >> get input from someone with more familiarity with the implementation.=20 >> We're willing to be part of the coding effort and are willing to=20 >> submit our suggestions to the community. I think we might just need=20 >> some occasional input. >>>=20 >>> Also, I will point out that our similar testing on Linux shows that=20 >>> the >> comparable performance between the two operating systems on the same=20 >> multi- core hardware is significantly different. We're able to drive=20 >> over 200,000 connections per second on a Linux server compared to=20 >> fewer than 50,000 on the FreeBSD server. We have kernel profiling=20 >> details that we can share if you'd like. >>=20 >> I have seen similar results with a redis cluster at work (we ended up=20 >> deploying proxies to allow applications to reuse existing connections=20 >> to avoid this). I believe Linux uses RCU for this table. You could=20 >> perhaps use an rm lock instead of an rw lock. On idea I considered=20 >> was to split the the pcbhash lock up further so you had one lock per=20 >> hash bucket so that you could allow concurrent connection=20 >> setup/teardown so long as they were referencing different buckets. =20 >> However, I did not think this would have been useful for the case at=20 >> work since those connections were insane (single packet request=20 >> followed by single packet reply with all the setup/teardown overhead)=20 >> and all going to the same listening socket (so all the setup's would=20 >> hash to the same bucket). Handling concurrent setup on the same=20 >> listen socket is a PITA but is in fact the common case. > > I don't think it's simply a synchronization primitive problem. It=20 > looks to me like the fundamental issue is that the lock order for the=20 > tables is prior to the inp lock which means we have to grab it very=20 > early. Presumably this is the classic sort of container ->=20 > datastructure, datastructure -> container lock order problem. This=20 > seems to be made more complex by protecting the list of all pcbs, the=20 > port allocation, and parts of the hash by the same lock. > > Have we tried to further decompose this lock? I would experiment with=20 > that as a first step. Is this grabbed in so many places just due to=20 > the complex lock order issue? That seems to be the case. There are=20 > only a handful of fields marked as protected by the inp info lock. Do=20 > we know that this list is complete? > > My second step would be to attempt to turn the locking on its head.=20 > Change the lock order from inp lock to inp info lock. You can resolve=20 > the lookup problem by adding an atomic reference count that holds the=20 > datastructure while you drop the hash lock and before you acquire the=20 > inp lock. Then you could re-validate the inp after lookup. I suspect=20 > it's not that simple and there are higher level races that you'll=20 > discover are being serialized by this big lock but that's just a hunch. > I read some more. We have already done this lookup/ref/etc. dance for the = hash lock. It handles the hard cases of multiple inp_* calls and synchroni= zing the ports, bind, connect, etc. It looks like the list locks have been= optimized to make the iterators simple. I think this is backwards now. W= e should make the iterators complex and the normal setup/teardown path simp= le. The iterators can follow a model like the hash lock using sentinels to= hold their place. We have the same pattern elsewhere. It would allow you= to acquire the INP_INFO lock after the INP lock and push it much deeper in= to the stack. Jeff > What do you think Robert? If it would make improving the tcb locking=20 > simpler it may fall under the umbrella of what Isilon needs but I'm=20 > not sure that's the case. Certainly my earlier attempts at deferred=20 > processing were made more complex by this arrangement. > > Thanks, > Jeff > >>=20 >> The best forum for discussing this is probably on net@ as there are=20 >> likely other interested parties who might have additional ideas. =20 >> Also, it might be interesting to look at how connection groups try to=20 >> handle this. I believe they use an altenate method of decomposing=20 >> the global lock into smaller chunks, and I think they might do=20 >> something to help mitigate the listen socket problem (perhaps they=20 >> duplicate listen sockets in all groups)? Robert would be able to=20 >> chime in on that, but I believe he is not really back home until next=20 >> week. >>=20 >> -- >> John Baldwin >>=20 >