From owner-freebsd-net@FreeBSD.ORG Fri May 23 12:07:17 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 23CB6356 for ; Fri, 23 May 2014 12:07:17 +0000 (UTC) Received: from plane.gmane.org (plane.gmane.org [80.91.229.3]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A24A924E9 for ; Fri, 23 May 2014 12:07:16 +0000 (UTC) Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1WnoFg-0003Hf-4p for freebsd-net@freebsd.org; Fri, 23 May 2014 14:07:08 +0200 Received: from h87.s239.verisign.com ([216.168.239.87]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 23 May 2014 14:07:08 +0200 Received: from jcharbon by h87.s239.verisign.com with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 23 May 2014 14:07:08 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: freebsd-net@freebsd.org From: Julien Charbon Subject: Re: TCP stack lock contention with short-lived connections Date: Fri, 23 May 2014 14:06:55 +0200 Lines: 232 Message-ID: <537F39DF.1090900@verisign.com> References: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------030706050106030100090504" X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: h87.s239.verisign.com User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.5.0 In-Reply-To: Cc: George Neville-Neil X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 May 2014 12:07:17 -0000 This is a multi-part message in MIME format. --------------030706050106030100090504 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Hi, On 27/02/14 11:32, Julien Charbon wrote: > On 07/11/13 14:55, Julien Charbon wrote: >> On Mon, 04 Nov 2013 22:21:04 +0100, Julien Charbon >> wrote: >>> I have put technical and how-to-repeat details in below PR: >>> >>> kern/183659: TCP stack lock contention with short-lived connections >>> http://www.freebsd.org/cgi/query-pr.cgi?pr=183659 >>> >>> We are currently working on this performance improvement effort; it >>> will impact only the TCP locking strategy not the TCP stack logic >>> itself. We will share on freebsd-net the patches we made for >>> reviewing and improvement propositions; anyway this change might also >>> require enough eyeballs to avoid tricky race conditions introduction >>> in TCP stack. > > [...] > > 2. We studied an another lock contention related to INP_INFO when TCP > connections in TIME_WAIT state are cleaned-up. [...] First a follow-up on TCP performance improvements and BSDCan 2014 discussions: - First, changes related to TCP lock connection and TIME_WAIT list have been reviewed, fixed, improved and accepted in HEAD: http://svnweb.freebsd.org/base?view=revision&revision=264321 http://svnweb.freebsd.org/base?view=revision&revision=264342 http://svnweb.freebsd.org/base?view=revision&revision=264351 http://svnweb.freebsd.org/base?view=revision&revision=264356 Thanks to testers, reviewers and committers (jhb, rwatson, rrs, adrian, glebius, gnn, bde, jmg, rrs, etc.). - Second, it has been discussed in BSDCan that it would nice to also share Verisign patches in public github. This is done here: https://github.com/verisign/freebsd We currently maintain two branches: - One that follows 10.0-RELENG branch: https://github.com/verisign/freebsd/commits/share/10.0-next - One that follows HEAD: https://github.com/verisign/freebsd/commits/share/head-next These branches are cloned from on the handy github FreeBSD repository mirror: https://github.com/freebsd/freebsd We pushed our stable enough TCP related patches in this repo. - Third, joined a patch (tcp-scale-syn-v1.patch) we would propose for review. Obviously, it is also available here: https://github.com/verisign/freebsd/commit/38679f5f66b470d069d635bfc8d9ec6c39eb787b This patch is based (like most of our TCP changes) on Robert Watson major change to decompose inpcbinfo lock (aka ipi_lock or INP_INFO:) Decompose the current single inpcbinfo lock into two locks: http://svnweb.freebsd.org/base?view=revision&revision=222488 https://github.com/freebsd/freebsd/commit/fdfdadb612a4b077d62094c7d4aa65de3524cf62 In particular on this comment: " - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. " Basically this change prevents to take the TCP ipi_lock lock when a SYN targeting a listening socket is received. We expected a need to add another syncache mutex to take over ipi_lock role, but surprisingly we don't find any requirements for a new mutex (and it doesn't mean that this requirement doesn't exist). This patch is quite stable under our workload, improves only a bit the short-lived TCP connection rate (~+2%), however it can become useful under SYN flood attack pressure. We still have to test more thoroughly this patch in two contexts: - VNET and, - syncache.hashsize sets very low relative to TCP workload As usual, tests, reviews and comments more than welcome. Thanks. -- Julien --------------030706050106030100090504 Content-Type: text/plain; charset=UTF-8; name="tcp-scale-syn-v1.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="tcp-scale-syn-v1.patch" diff --git a/sys/netinet/tcp_input.c b/sys/netinet/tcp_input.c index f9a751c..4b6f41f 100644 --- a/sys/netinet/tcp_input.c +++ b/sys/netinet/tcp_input.c @@ -588,6 +588,7 @@ tcp_input(struct mbuf *m, int off0) #endif /* INET6 */ struct tcpopt to; /* options in this segment */ char *s = NULL; /* address and port logging */ + int needlock; int ti_locked; #define TI_UNLOCKED 1 #define TI_WLOCKED 2 @@ -770,12 +771,12 @@ tcp_input(struct mbuf *m, int off0) /* * Locate pcb for segment; if we're likely to add or remove a - * connection then first acquire pcbinfo lock. There are two cases + * connection then first acquire pcbinfo lock. There are three cases * where we might discover later we need a write lock despite the * flags: ACKs moving a connection out of the syncache, and ACKs for - * a connection in TIMEWAIT. + * a connection in TIMEWAIT, SYNs for a non-listening socket. */ - if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0) { + if ((thflags & (TH_FIN | TH_RST)) != 0) { INP_INFO_WLOCK(&V_tcbinfo); ti_locked = TI_WLOCKED; } else @@ -1004,10 +1005,18 @@ tcp_input(struct mbuf *m, int off0) * now be in TIMEWAIT. */ #ifdef INVARIANTS - if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0) + if ((thflags & (TH_FIN | TH_RST)) != 0) INP_INFO_WLOCK_ASSERT(&V_tcbinfo); #endif - if (tp->t_state != TCPS_ESTABLISHED) { + needlock = 1; + if (tp->t_state == TCPS_ESTABLISHED) { + if ((thflags & TH_SYN) == 0) + needlock = 0; + } else { + if (tp->t_state == TCPS_LISTEN && (thflags & TH_SYN)) + needlock = 0; + } + if (needlock) { if (ti_locked == TI_UNLOCKED) { if (INP_INFO_TRY_WLOCK(&V_tcbinfo) == 0) { in_pcbref(inp); @@ -1048,17 +1057,13 @@ tcp_input(struct mbuf *m, int off0) /* * When the socket is accepting connections (the INPCB is in LISTEN * state) we look into the SYN cache if this is a new connection - * attempt or the completion of a previous one. Because listen - * sockets are never in TCPS_ESTABLISHED, the V_tcbinfo lock will be - * held in this case. + * attempt or the completion of a previous one. */ if (so->so_options & SO_ACCEPTCONN) { struct in_conninfo inc; KASSERT(tp->t_state == TCPS_LISTEN, ("%s: so accepting but " "tp not listening", __func__)); - INP_INFO_WLOCK_ASSERT(&V_tcbinfo); - bzero(&inc, sizeof(inc)); #ifdef INET6 if (isipv6) { @@ -1081,6 +1086,8 @@ tcp_input(struct mbuf *m, int off0) * socket appended to the listen queue in SYN_RECEIVED state. */ if ((thflags & (TH_RST|TH_ACK|TH_SYN)) == TH_ACK) { + + INP_INFO_WLOCK_ASSERT(&V_tcbinfo); /* * Parse the TCP options here because * syncookies need access to the reflected @@ -1361,8 +1368,12 @@ tcp_input(struct mbuf *m, int off0) syncache_add(&inc, &to, th, inp, &so, m, NULL, NULL); /* * Entry added to syncache and mbuf consumed. - * Everything already unlocked by syncache_add(). + * Nothing is unlocked by syncache_add(). */ + if (ti_locked == TI_WLOCKED) { + INP_INFO_WUNLOCK(&V_tcbinfo); + ti_locked = TI_UNLOCKED; + } INP_INFO_UNLOCK_ASSERT(&V_tcbinfo); return; } else if (tp->t_state == TCPS_LISTEN) { diff --git a/sys/netinet/tcp_syncache.c b/sys/netinet/tcp_syncache.c index beb6749..9b981d3 100644 --- a/sys/netinet/tcp_syncache.c +++ b/sys/netinet/tcp_syncache.c @@ -1119,7 +1119,6 @@ syncache_add(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th, struct syncache scs; struct ucred *cred; - INP_INFO_WLOCK_ASSERT(&V_tcbinfo); INP_WLOCK_ASSERT(inp); /* listen socket */ KASSERT((th->th_flags & (TH_RST|TH_ACK|TH_SYN)) == TH_SYN, ("%s: unexpected tcp flags", __func__)); @@ -1150,13 +1149,11 @@ syncache_add(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th, #ifdef MAC if (mac_syncache_init(&maclabel) != 0) { INP_WUNLOCK(inp); - INP_INFO_WUNLOCK(&V_tcbinfo); goto done; } else mac_syncache_create(maclabel, inp); #endif INP_WUNLOCK(inp); - INP_INFO_WUNLOCK(&V_tcbinfo); /* * Remember the IP options, if any. --------------030706050106030100090504--