From owner-freebsd-net@FreeBSD.ORG Mon May 26 13:37:01 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CEBEECB4 for ; Mon, 26 May 2014 13:37:01 +0000 (UTC) Received: from plane.gmane.org (plane.gmane.org [80.91.229.3]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5B11323EE for ; Mon, 26 May 2014 13:37:01 +0000 (UTC) Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1Wov53-0007TY-37 for freebsd-net@freebsd.org; Mon, 26 May 2014 15:36:45 +0200 Received: from h87.s239.verisign.com ([216.168.239.87]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 26 May 2014 15:36:45 +0200 Received: from jcharbon by h87.s239.verisign.com with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Mon, 26 May 2014 15:36:45 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-net@freebsd.org From: Julien Charbon Subject: Re: TCP stack lock contention with short-lived connections Date: Mon, 26 May 2014 15:36:40 +0200 Lines: 100 Message-ID: <53834368.6070103@verisign.com> References: <537F39DF.1090900@verisign.com> <537FB51D.2060401@verisign.com> <537FBFA4.1010902@FreeBSD.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: h87.s239.verisign.com User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 In-Reply-To: <537FBFA4.1010902@FreeBSD.org> X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 26 May 2014 13:37:01 -0000 Hi Navdeep On 23/05/14 23:37, Navdeep Parhar wrote: > On 05/23/14 13:52, Julien Charbon wrote: >> On 23/05/14 14:06, Julien Charbon wrote: >>> On 27/02/14 11:32, Julien Charbon wrote: >>>> On 07/11/13 14:55, Julien Charbon wrote: >>>>> On Mon, 04 Nov 2013 22:21:04 +0100, Julien Charbon >>>>> wrote: >>>>>> I have put technical and how-to-repeat details in below >>>>>> PR: >>>>>> >>>>>> kern/183659: TCP stack lock contention with short-lived >>>>>> connections >>>>>> http://www.freebsd.org/cgi/query-pr.cgi?pr=183659 >>>>>> >>>>>> We are currently working on this performance improvement >>>>>> effort; it will impact only the TCP locking strategy not >>>>>> the TCP stack logic itself. We will share on freebsd-net >>>>>> the patches we made for reviewing and improvement >>>>>> propositions; anyway this change might also require enough >>>>>> eyeballs to avoid tricky race conditions introduction in >>>>>> TCP stack. >> >> Joined the two cumulative patches (tcp-scale-inp-list-v1.patch and >> tcp-scale-pcbinfo-rlock-v1.patch) we discussed the most at BSDCan >> 2014. >> >> [...] >> >> _However_ it would be a miracle that this change does not introduce >> new race condition(s) (hence the 'alpha' tag in commit message). >> Most of TCP stack locking strategy being now on inpcb lock >> shoulders. That said, from our tests point of view, this change is >> completely stable: No kernel/lock assertion, no unexpected TCP >> behavior, stable performance results. Moreover, before tagging >> this change as 'beta' we need to test more thoroughly these >> features: >> >> - VNET, - PCBGROUP/RSS/TCP timer per cpu, - TCP Offloading (we need >> a NIC with a good TCP offloading support) > > I can assess the impact (and fix any fallout) on the parts of the > kernel that deal with TCP_OFFLOAD, and the TOE driver in > dev/cxgbe/tom. But I was hoping to do that only after there was > general agreement on net@ that these locking changes are sound and > should be taken into HEAD. Lack of reviews seems to be holding this > back, correct? Correct, these changes have been reviewed and tested only internally yet. As we aren't finding any issues, we share them for wider testing, comments and reviews. An advice for reviewers: When reading code don't be fooled by remaining ipi_lock read lock (INP_INFO_RLOCK), you should consider ipi_lock as completely deleted in TCP stack. All TCP code that was under ipi_lock umbrella is now only protected by inp_lock. Just keep that in mind. Below, just for your information, more details on context of these changes: o The rough consensus at BSDCan was that there is a shared interest for scalability improvement of TCP workloads with potential high rate of connections establishment and tear-down. o Our requirements for this task were: - Achieve more than 100k TCP connections per second without dropping a single packet in reception - Use a strategy that does not require to change all network stacks in a row (TCP, UDP, RAW, etc.) - Be able to progressively introduce better performance, leveraging already in place mutex strategy - Keep the TCP stack stable (obviously) o Solutions we did try to implement and that failed: #1 Decompose ipi_lock in finer grained locks on ipi_hashbase's bucket (i.e. add a mutex in struct inpcbhead). Did not work as the induced change was quite big, and keeping network stacks shared code (in in_pcb.{h, c}) clean was much more difficult than expected. #2 Revert the lock ordering from: ipi_lock > inpcb lock > ipi_hash_lock, pcbgroup locks to: inpcb lock > ipi_lock, ipi_hash_lock, pcbgroup locks The change was just a all-or-nothing giant patch, it did not handle the full inp list traversal functions and having a different lock ordering between TCP stack and other network stacks was quite a challenge to maintain. My 2 cents. -- Julien