Date: Thu, 23 May 2013 16:44:00 +0000 From: "Bentkofsky, Michael" <MBentkofsky@verisign.com> To: Jeff Roberson <jroberson@jroberson.net>, John Baldwin <jhb@freebsd.org> Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>, "jeff@freebsd.org" <jeff@freebsd.org>, "rwatson@freebsd.org" <rwatson@freebsd.org>, "Charbon, Julien" <jcharbon@verisign.com> Subject: RE: Followup from Verisign after last week's developer summit Message-ID: <080FBD5B7A09F845842100A6DE79623321F703B5@BRN1WNEXMBX01.vcorp.ad.vrsn.com> In-Reply-To: <alpine.BSF.2.00.1305211846470.2005@desktop> References: <080FBD5B7A09F845842100A6DE79623321F6E70C@BRN1WNEXMBX01.vcorp.ad.vrsn.com> <201305211320.26818.jhb@freebsd.org> <alpine.BSF.2.00.1305211204360.2005@desktop> <alpine.BSF.2.00.1305211846470.2005@desktop>
next in thread | previous in thread | raw e-mail | index | archive | help
I am adding freebsd-net to this and will re-summarize to get additional inp= ut. Thanks for all of the initial suggestions. For benefit of those on freebsd-net@, we are noticing significant locking c= ontention on the V_tcpinfo lock under moderately high connection establishm= ent and teardown rates (around 45-50k connections per second). Our profilin= g suggests the lock contention on V_tcpinfo effectively single-threads all = TCP connections. Similar testing on Linux with equivalent hardware does not= show this contention and can get a much higher connection establishment ra= te. We can attach profiling and test details if anyone would like. JHB recommends: - He has seen similar results in other kinds of testing.=20 - Linux uses RCU for the locking on the equivalent table (we've confirmed t= his to be the case). - Looking into a lock per bucket on the PCB lookup. Jeff recommends: - Changing the lock strategy so the hash lookup can be effectively pushed f= urther down into the stack. - Making the [list] iterators more complex like those in use in the hash lo= okup now. We are starting down these paths to try to break the locking down. We'll po= st some initial patch ideas soon. Meanwhile, any additional suggestions are= certainly welcome. Finally, I will mention that we have enabled PCBGROUPS in some of our testi= ng with 9.1 and found no change for our particular workload with high conne= ction establishment rates. Thanks, Mike -----Original Message----- From: Jeff Roberson [mailto:jroberson@jroberson.net]=20 Sent: Wednesday, May 22, 2013 12:50 AM To: John Baldwin Cc: Bentkofsky, Michael; rwatson@freebsd.org; jeff@freebsd.org; Charbon, Ju= lien Subject: Re: Followup from Verisign after last week's developer summit On Tue, 21 May 2013, Jeff Roberson wrote: > On Tue, 21 May 2013, John Baldwin wrote: > >> On Monday, May 20, 2013 9:48:02 am Bentkofsky, Michael wrote: >>> Greetings gentlemen, >>>=20 >>> It was a pleasure to meet you all last week at the FreeBSD developer=20 >>> summit. >> I would like to thank you for spending the time to discuss all the=20 >> wonderful internals of the network stack. We also thoroughly enjoyed=20 >> the discussion on receive side scaling. >>>=20 >>> I'm sure you will remember both Julien Charbon and me asking=20 >>> questions >> regarding the TCP stack implementation, specifically around the=20 >> locking internals. I am hoping to follow-up with a path forward so we=20 >> might be able to enhance the connection rate performance. Our=20 >> internal testing has found that the V_tcpinfo lock prevents TCP=20 >> scaling under high connection setup and teardown rates. In fact, we=20 >> surmise that a new "FIN flood" attack may be possible to degrade=20 >> server connections significantly. >>>=20 >>> In short, we are interested in changing this locking strategy and=20 >>> hope to >> get input from someone with more familiarity with the implementation.=20 >> We're willing to be part of the coding effort and are willing to=20 >> submit our suggestions to the community. I think we might just need=20 >> some occasional input. >>>=20 >>> Also, I will point out that our similar testing on Linux shows that=20 >>> the >> comparable performance between the two operating systems on the same=20 >> multi- core hardware is significantly different. We're able to drive=20 >> over 200,000 connections per second on a Linux server compared to=20 >> fewer than 50,000 on the FreeBSD server. We have kernel profiling=20 >> details that we can share if you'd like. >>=20 >> I have seen similar results with a redis cluster at work (we ended up=20 >> deploying proxies to allow applications to reuse existing connections=20 >> to avoid this). I believe Linux uses RCU for this table. You could=20 >> perhaps use an rm lock instead of an rw lock. On idea I considered=20 >> was to split the the pcbhash lock up further so you had one lock per=20 >> hash bucket so that you could allow concurrent connection=20 >> setup/teardown so long as they were referencing different buckets. =20 >> However, I did not think this would have been useful for the case at=20 >> work since those connections were insane (single packet request=20 >> followed by single packet reply with all the setup/teardown overhead)=20 >> and all going to the same listening socket (so all the setup's would=20 >> hash to the same bucket). Handling concurrent setup on the same=20 >> listen socket is a PITA but is in fact the common case. > > I don't think it's simply a synchronization primitive problem. It=20 > looks to me like the fundamental issue is that the lock order for the=20 > tables is prior to the inp lock which means we have to grab it very=20 > early. Presumably this is the classic sort of container ->=20 > datastructure, datastructure -> container lock order problem. This=20 > seems to be made more complex by protecting the list of all pcbs, the=20 > port allocation, and parts of the hash by the same lock. > > Have we tried to further decompose this lock? I would experiment with=20 > that as a first step. Is this grabbed in so many places just due to=20 > the complex lock order issue? That seems to be the case. There are=20 > only a handful of fields marked as protected by the inp info lock. Do=20 > we know that this list is complete? > > My second step would be to attempt to turn the locking on its head.=20 > Change the lock order from inp lock to inp info lock. You can resolve=20 > the lookup problem by adding an atomic reference count that holds the=20 > datastructure while you drop the hash lock and before you acquire the=20 > inp lock. Then you could re-validate the inp after lookup. I suspect=20 > it's not that simple and there are higher level races that you'll=20 > discover are being serialized by this big lock but that's just a hunch. > I read some more. We have already done this lookup/ref/etc. dance for the = hash lock. It handles the hard cases of multiple inp_* calls and synchroni= zing the ports, bind, connect, etc. It looks like the list locks have been= optimized to make the iterators simple. I think this is backwards now. W= e should make the iterators complex and the normal setup/teardown path simp= le. The iterators can follow a model like the hash lock using sentinels to= hold their place. We have the same pattern elsewhere. It would allow you= to acquire the INP_INFO lock after the INP lock and push it much deeper in= to the stack. Jeff > What do you think Robert? If it would make improving the tcb locking=20 > simpler it may fall under the umbrella of what Isilon needs but I'm=20 > not sure that's the case. Certainly my earlier attempts at deferred=20 > processing were made more complex by this arrangement. > > Thanks, > Jeff > >>=20 >> The best forum for discussing this is probably on net@ as there are=20 >> likely other interested parties who might have additional ideas. =20 >> Also, it might be interesting to look at how connection groups try to=20 >> handle this. I believe they use an altenate method of decomposing=20 >> the global lock into smaller chunks, and I think they might do=20 >> something to help mitigate the listen socket problem (perhaps they=20 >> duplicate listen sockets in all groups)? Robert would be able to=20 >> chime in on that, but I believe he is not really back home until next=20 >> week. >>=20 >> -- >> John Baldwin >>=20 >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?080FBD5B7A09F845842100A6DE79623321F703B5>