From owner-freebsd-net@FreeBSD.ORG  Thu May 23 16:45:51 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 51A52848;
 Thu, 23 May 2013 16:45:51 +0000 (UTC)
 (envelope-from MBentkofsky@verisign.com)
Received: from exprod6og109.obsmtp.com (exprod6og109.obsmtp.com [64.18.1.23])
 by mx1.freebsd.org (Postfix) with ESMTP id A9665FB9;
 Thu, 23 May 2013 16:45:49 +0000 (UTC)
Received: from peregrine.verisign.com ([216.168.239.74]) (using TLSv1) by
 exprod6ob109.postini.com ([64.18.5.12]) with SMTP
 ID DSNKUZ5Ht+HM2L2i92KNUw3SEap7O/KXED45@postini.com;
 Thu, 23 May 2013 09:45:50 PDT
Received: from brn1wnexcas01.vcorp.ad.vrsn.com
 (brn1wnexcas01.vcorp.ad.vrsn.com [10.173.152.205])
 by peregrine.verisign.com (8.13.6/8.13.4) with ESMTP id r4NGi1kk028037
 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL);
 Thu, 23 May 2013 12:44:01 -0400
Received: from BRN1WNEXMBX01.vcorp.ad.vrsn.com ([::1]) by
 brn1wnexcas01.vcorp.ad.vrsn.com ([::1]) with mapi id 14.02.0342.003; Thu, 23
 May 2013 12:44:00 -0400
From: "Bentkofsky, Michael" <MBentkofsky@verisign.com>
To: Jeff Roberson <jroberson@jroberson.net>, John Baldwin <jhb@freebsd.org>
Subject: RE: Followup from Verisign after last week's developer summit
Thread-Topic: Followup from Verisign after last week's developer summit
Thread-Index: Ac5VXnht9RnhzeKKSL6DcCEAPOWbPABCopwAAAyDTYAAC5LNAAA/pTtw
Date: Thu, 23 May 2013 16:44:00 +0000
Message-ID: <080FBD5B7A09F845842100A6DE79623321F703B5@BRN1WNEXMBX01.vcorp.ad.vrsn.com>
References: <080FBD5B7A09F845842100A6DE79623321F6E70C@BRN1WNEXMBX01.vcorp.ad.vrsn.com>
 <201305211320.26818.jhb@freebsd.org>
 <alpine.BSF.2.00.1305211204360.2005@desktop>
 <alpine.BSF.2.00.1305211846470.2005@desktop>
In-Reply-To: <alpine.BSF.2.00.1305211846470.2005@desktop>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.173.152.4]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "freebsd-net@freebsd.org" <freebsd-net@freebsd.org>,
 "jeff@freebsd.org" <jeff@freebsd.org>,
 "rwatson@freebsd.org" <rwatson@freebsd.org>, "Charbon,
 Julien" <jcharbon@verisign.com>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 23 May 2013 16:45:51 -0000

I am adding freebsd-net to this and will re-summarize to get additional inp=
ut. Thanks for all of the initial suggestions.

For benefit of those on freebsd-net@, we are noticing significant locking c=
ontention on the V_tcpinfo lock under moderately high connection establishm=
ent and teardown rates (around 45-50k connections per second). Our profilin=
g suggests the lock contention on V_tcpinfo effectively single-threads all =
TCP connections. Similar testing on Linux with equivalent hardware does not=
 show this contention and can get a much higher connection establishment ra=
te. We can attach profiling and test details if anyone would like.

JHB recommends:
- He has seen similar results in other kinds of testing.=20
- Linux uses RCU for the locking on the equivalent table (we've confirmed t=
his to be the case).
- Looking into a lock per bucket on the PCB lookup.

Jeff recommends:
- Changing the lock strategy so the hash lookup can be effectively pushed f=
urther down into the stack.
- Making the [list] iterators more complex like those in use in the hash lo=
okup now.

We are starting down these paths to try to break the locking down. We'll po=
st some initial patch ideas soon. Meanwhile, any additional suggestions are=
 certainly welcome.

Finally, I will mention that we have enabled PCBGROUPS in some of our testi=
ng with 9.1 and found no change for our particular workload with high conne=
ction establishment rates.

Thanks,
Mike

-----Original Message-----
From: Jeff Roberson [mailto:jroberson@jroberson.net]=20
Sent: Wednesday, May 22, 2013 12:50 AM
To: John Baldwin
Cc: Bentkofsky, Michael; rwatson@freebsd.org; jeff@freebsd.org; Charbon, Ju=
lien
Subject: Re: Followup from Verisign after last week's developer summit

On Tue, 21 May 2013, Jeff Roberson wrote:

> On Tue, 21 May 2013, John Baldwin wrote:
>
>> On Monday, May 20, 2013 9:48:02 am Bentkofsky, Michael wrote:
>>> Greetings gentlemen,
>>>=20
>>> It was a pleasure to meet you all last week at the FreeBSD developer=20
>>> summit.
>> I would like to thank you for spending the time to discuss all the=20
>> wonderful internals of the network stack. We also thoroughly enjoyed=20
>> the discussion on receive side scaling.
>>>=20
>>> I'm sure you will remember both Julien Charbon and me asking=20
>>> questions
>> regarding the TCP stack implementation, specifically around the=20
>> locking internals. I am hoping to follow-up with a path forward so we=20
>> might be able to enhance the connection rate performance. Our=20
>> internal testing has found that the V_tcpinfo lock prevents TCP=20
>> scaling under high connection setup and teardown rates. In fact, we=20
>> surmise that a new "FIN flood" attack may be possible to degrade=20
>> server connections significantly.
>>>=20
>>> In short, we are interested in changing this locking strategy and=20
>>> hope to
>> get input from someone with more familiarity with the implementation.=20
>> We're willing to be part of the coding effort and are willing to=20
>> submit our suggestions to the community. I think we might just need=20
>> some occasional input.
>>>=20
>>> Also, I will point out that our similar testing on Linux shows that=20
>>> the
>> comparable performance between the two operating systems on the same=20
>> multi- core hardware is significantly different. We're able to drive=20
>> over 200,000 connections per second on a Linux server compared to=20
>> fewer than 50,000 on the FreeBSD server. We have kernel profiling=20
>> details that we can share if you'd like.
>>=20
>> I have seen similar results with a redis cluster at work (we ended up=20
>> deploying proxies to allow applications to reuse existing connections=20
>> to avoid this).  I believe Linux uses RCU for this table.  You could=20
>> perhaps use an rm lock instead of an rw lock.  On idea I considered=20
>> was to split the the pcbhash lock up further so you had one lock per=20
>> hash bucket so that you could allow concurrent connection=20
>> setup/teardown so long as they were referencing different buckets. =20
>> However, I did not think this would have been useful for the case at=20
>> work since those connections were insane (single packet request=20
>> followed by single packet reply with all the setup/teardown overhead)=20
>> and all going to the same listening socket (so all the setup's would=20
>> hash to the same bucket).  Handling concurrent setup on the same=20
>> listen socket is a PITA but is in fact the common case.
>
> I don't think it's simply a synchronization primitive problem.  It=20
> looks to me like the fundamental issue is that the lock order for the=20
> tables is prior to the inp lock which means we have to grab it very=20
> early. Presumably this is the classic sort of container ->=20
> datastructure, datastructure -> container lock order problem.  This=20
> seems to be made more complex by protecting the list of all pcbs, the=20
> port allocation, and parts of the hash by the same lock.
>
> Have we tried to further decompose this lock?  I would experiment with=20
> that as a first step.  Is this grabbed in so many places just due to=20
> the complex lock order issue?  That seems to be the case.  There are=20
> only a handful of fields marked as protected by the inp info lock.  Do=20
> we know that this list is complete?
>
> My second step would be to attempt to turn the locking on its head.=20
> Change the lock order from inp lock to inp info lock.  You can resolve=20
> the lookup problem by adding an atomic reference count that holds the=20
> datastructure while you drop the hash lock and before you acquire the=20
> inp lock.  Then you could re-validate the inp after lookup.  I suspect=20
> it's not that simple and there are higher level races that you'll=20
> discover are being serialized by this big lock but that's just a hunch.
>

I read some more.  We have already done this lookup/ref/etc. dance for the =
hash lock.  It handles the hard cases of multiple inp_* calls and synchroni=
zing the ports, bind, connect, etc.  It looks like the list locks have been=
 optimized to make the iterators simple.  I think this is backwards now.  W=
e should make the iterators complex and the normal setup/teardown path simp=
le.  The iterators can follow a model like the hash lock using sentinels to=
 hold their place.  We have the same pattern elsewhere.  It would allow you=
 to acquire the INP_INFO lock after the INP lock and push it much deeper in=
to the stack.

Jeff


> What do you think Robert?  If it would make improving the tcb locking=20
> simpler it may fall under the umbrella of what Isilon needs but I'm=20
> not sure that's the case.  Certainly my earlier attempts at deferred=20
> processing were made more complex by this arrangement.
>
> Thanks,
> Jeff
>
>>=20
>> The best forum for discussing this is probably on net@ as there are=20
>> likely other interested parties who might have additional ideas. =20
>> Also, it might be interesting to look at how connection groups try to=20
>> handle this.  I believe they use an altenate method of decomposing=20
>> the global lock into smaller chunks, and I think they might do=20
>> something to help mitigate the listen socket problem (perhaps they=20
>> duplicate listen sockets in all groups)?  Robert would be able to=20
>> chime in on that, but I believe he is not really back home until next=20
>> week.
>>=20
>> --
>> John Baldwin
>>=20
>