From owner-freebsd-net@FreeBSD.ORG  Mon May 26 13:37:01 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id CEBEECB4
 for <freebsd-net@freebsd.org>; Mon, 26 May 2014 13:37:01 +0000 (UTC)
Received: from plane.gmane.org (plane.gmane.org [80.91.229.3])
 (using TLSv1 with cipher AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 5B11323EE
 for <freebsd-net@freebsd.org>; Mon, 26 May 2014 13:37:01 +0000 (UTC)
Received: from list by plane.gmane.org with local (Exim 4.69)
 (envelope-from <freebsd-net@m.gmane.org>) id 1Wov53-0007TY-37
 for freebsd-net@freebsd.org; Mon, 26 May 2014 15:36:45 +0200
Received: from h87.s239.verisign.com ([216.168.239.87])
 by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
 id 1AlnuQ-0007hv-00
 for <freebsd-net@freebsd.org>; Mon, 26 May 2014 15:36:45 +0200
Received: from jcharbon by h87.s239.verisign.com with local (Gmexim 0.1
 (Debian)) id 1AlnuQ-0007hv-00
 for <freebsd-net@freebsd.org>; Mon, 26 May 2014 15:36:45 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: freebsd-net@freebsd.org
From: Julien Charbon <jcharbon@verisign.com>
Subject: Re: TCP stack lock contention with short-lived connections
Date: Mon, 26 May 2014 15:36:40 +0200
Lines: 100
Message-ID: <53834368.6070103@verisign.com>
References: <op.w51mxed6ak5tgc@fri2jcharbon-m1.local>
 <op.w56mamc0ak5tgc@dul1rjacobso-l3.vcorp.ad.vrsn.com>
 <len481$sfv$2@ger.gmane.org> <537F39DF.1090900@verisign.com>
 <537FB51D.2060401@verisign.com> <537FBFA4.1010902@FreeBSD.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: h87.s239.verisign.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:24.0) Gecko/20100101 Thunderbird/24.5.0
In-Reply-To: <537FBFA4.1010902@FreeBSD.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 26 May 2014 13:37:01 -0000


  Hi Navdeep

On 23/05/14 23:37, Navdeep Parhar wrote:
> On 05/23/14 13:52, Julien Charbon wrote:
>> On 23/05/14 14:06, Julien Charbon wrote:
>>> On 27/02/14 11:32, Julien Charbon wrote:
>>>> On 07/11/13 14:55, Julien Charbon wrote:
>>>>> On Mon, 04 Nov 2013 22:21:04 +0100, Julien Charbon
>>>>> <jcharbon@verisign.com> wrote:
>>>>>> I have put technical and how-to-repeat details in below
>>>>>> PR:
>>>>>>
>>>>>> kern/183659: TCP stack lock contention with short-lived
>>>>>> connections
>>>>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=183659
>>>>>>
>>>>>> We are currently working on this performance improvement
>>>>>> effort;  it will impact only the TCP locking strategy not
>>>>>> the TCP stack logic itself.  We will share on freebsd-net
>>>>>> the patches we made for reviewing and improvement
>>>>>> propositions;  anyway this change might also require enough
>>>>>> eyeballs to avoid tricky race conditions introduction in
>>>>>> TCP stack.
>>
>> Joined the two cumulative patches (tcp-scale-inp-list-v1.patch and
>> tcp-scale-pcbinfo-rlock-v1.patch) we discussed the most at BSDCan
>> 2014.
>>
>> [...]
>>
>> _However_ it would be a miracle that this change does not introduce
>> new race condition(s) (hence the 'alpha' tag in commit message).
>> Most of TCP stack locking strategy being now on inpcb lock
>> shoulders.  That said, from our tests point of view, this change is
>> completely stable: No kernel/lock assertion, no unexpected TCP
>> behavior, stable performance results.  Moreover, before tagging
>> this change as 'beta' we need to test more thoroughly these
>> features:
>>
>> - VNET, - PCBGROUP/RSS/TCP timer per cpu, - TCP Offloading (we need
>> a NIC with a good TCP offloading support)
>
> I can assess the impact (and fix any fallout) on the parts of the
> kernel that deal with TCP_OFFLOAD, and the TOE driver in
> dev/cxgbe/tom.  But I was hoping to do that only after there was
> general agreement on net@ that these locking changes are sound and
> should be taken into HEAD. Lack of reviews seems to be holding this
> back, correct?

  Correct, these changes have been reviewed and tested only internally 
yet.  As we aren't finding any issues, we share them for wider testing, 
comments and reviews.

  An advice for reviewers:  When reading code don't be fooled by 
remaining ipi_lock read lock (INP_INFO_RLOCK), you should consider 
ipi_lock as completely deleted in TCP stack.  All TCP code that was 
under ipi_lock umbrella is now only protected by inp_lock.  Just keep 
that in mind.

  Below, just for your information, more details on context of these 
changes:

  o The rough consensus at BSDCan was that there is a shared interest 
for scalability improvement of TCP workloads with potential high rate of 
connections establishment and tear-down.

  o Our requirements for this task were:
   - Achieve more than 100k TCP connections per second without dropping 
a single packet in reception
   - Use a strategy that does not require to change all network stacks 
in a row (TCP, UDP, RAW, etc.)
   - Be able to progressively introduce better performance, leveraging 
already in place mutex strategy
   - Keep the TCP stack stable (obviously)

  o Solutions we did try to implement and that failed:

   #1 Decompose ipi_lock in finer grained locks on ipi_hashbase's bucket 
(i.e. add a mutex in struct inpcbhead).  Did not work as the induced 
change was quite big, and keeping network stacks shared code (in 
in_pcb.{h, c}) clean was much more difficult than expected.

   #2 Revert the lock ordering from:

ipi_lock > inpcb lock > ipi_hash_lock, pcbgroup locks

  to:

inpcb lock > ipi_lock, ipi_hash_lock, pcbgroup locks

   The change was just a all-or-nothing giant patch, it did not handle 
the full inp list traversal functions and having a different lock 
ordering between TCP stack and other network stacks was quite a 
challenge to maintain.

  My 2 cents.

--
Julien