Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 Mar 2014 13:18:06 +0100
From:      Julien Charbon <jcharbon@verisign.com>
To:        freebsd-net@freebsd.org
Subject:   Re: TCP stack lock contention with short-lived connections
Message-ID:  <5322F37E.5040700@verisign.com>
In-Reply-To: <5319BEE7.607@verisign.com>
References:  <op.w51mxed6ak5tgc@fri2jcharbon-m1.local> <op.w56mamc0ak5tgc@dul1rjacobso-l3.vcorp.ad.vrsn.com> <len481$sfv$2@ger.gmane.org> <20140306215743.GB47921@funkthat.com> <5319BEE7.607@verisign.com>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------090708080906090808070008
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit


  Hi John,

On 07/03/14 13:43, Julien Charbon wrote:
> On 06/03/14 22:57, John-Mark Gurney wrote:
>> Julien Charbon wrote this message on Thu, Feb 27, 2014 at 11:32 +0100:
>>> [...]
>>>   Any thoughts on this particular behavior?
>>
>> One thing that I noticed is that you now lock/unlock the tw and inp
>> lock a
>> lot...  Have you thought about grabing the TW lock once, grabbing
>> some/all
>> of the ones necessary to process and then process them in a second step?
>> If the bulk processing is still an issue, then you could process them in
>> blocks of 50 or so, that way the number of lock/unlock cycles is
>> reduced...
>
>   First thanks, feedback are highly valuable to us.  In first place, I
> indeed tried a kind of bulk processing enforcement with something like:
>
>   tcp_tw_2msl_scan() {
>
>          struct tcptw *tw;
>          int i, end = 0, count = 100;
>
>          for (;;) {
>                  INP_INFO_WLOCK(&V_tcbinfo);
>                  for (i = 0; i < count; ++i) {
>                          tw = TAILQ_FIRST(&V_twq_2msl);
>                          if (tw == NULL || (tw->tw_time - ticks) > 0)) {
>                                  end = 1;
>                                  break;
>                          }
>                          INP_WLOCK(tw->tw_inpcb);
>                          tcp_twclose(tw, 0);
>                  }
>                  if (end)
>                          break;
>                 INP_INFO_WUNLOCK(&V_tcbinfo);
>          }
>          return (NULL);
> }
>
>   And I got best result with 'count' set to 1, this led us to current
> proposed patch. [...]

  I would like to continue the discussion about tcp_tw_2msl_scan().

  The proposed patch makes tcp_tw_2msl_scan() less disturbing for the 
TCP stack (another way to say it makes tcp_tw_2msl_scan() less efficient 
to prioritize other TCP related tasks), and continuing on this way I 
have tested below change on top of previous patch:

diff --git a/sys/netinet/tcp_timewait.c b/sys/netinet/tcp_timewait.c
index 0230b88..da79120 100644
--- a/sys/netinet/tcp_timewait.c
+++ b/sys/netinet/tcp_timewait.c
@@ -728,7 +728,7 @@ tcp_tw_2msl_scan(void)
                 TW_RUNLOCK(V_tw_lock);

                 /* Close timewait state */
-               INP_INFO_WLOCK(&V_tcbinfo);
+               if(INP_INFO_TRY_WLOCK(&V_tcbinfo)) {

                         TW_WLOCK(V_tw_lock);
                         if(tw_pcbrele(tw))
@@ -740,5 +740,9 @@ tcp_tw_2msl_scan(void)
                         INP_WLOCK(tw->tw_inpcb);
                         tcp_twclose(tw, 0);
                         INP_INFO_WUNLOCK(&V_tcbinfo);
+               } else {
+                       /* INP_INFO lock is busy, continue later */
+                       break;
+               }
         }
  }

  Basically, it changes tcp_tw_2msl_scan() purpose from:

  - Let's clean up all expired tw objects in a row.

  to:

  - Let's clean up expired tw objects as long as the INP_INFO lock is 
not busy.

  (See joined tw-lock-v2.patch).  And corresponding performance results 
being:  Maximum short-lived TCP connection rate without dropping a 
single packet:

  - FreeBSD 10.0-RELEASE:        40.0k
  - FreeBSD 10.0 + tw patch-v1:  52.8k (+32%)
  - FreeBSD 10.0 + tw patch-v2:  56.8k (+42%)

  Thus, a significant improvement and it makes the tcp_tw_2msl_scan() 
purpose clear.

  Moreover, it seems that two members of the struct tcptw are never 
used:  'tw_pspare' and 'tw_spare':

struct tcptw {
         [...]
         u_int           t_starttime;
         int             tw_time;
         TAILQ_ENTRY(tcptw) tw_2msl;
         void            *tw_pspare;     /* TCP_SIGNATURE */
         u_int           *tw_spare;      /* TCP_SIGNATURE */
         u_int           tw_refcount;    /* refcount */
  };

  I found no references of these members anywhere, even when setting the 
TCP_SIGNATURE kernel option the kernel builds and runs just fine.  As 
the tcptw struct is private to kernel and its size should be as small as 
possible, I would propose to remove them.

  Joined tw-clock-v2.patch which includes:

  - Use TRY_WLOCK                  (Performance improvements)
  - Fix comment                    (After John-Mark Gurney's review)
  - Two unused fields removed      (After Mike Bentkofsky's review)

--
Julien

--------------090708080906090808070008
Content-Type: text/plain; charset=UTF-8; x-mac-type="0"; x-mac-creator="0";
 name="tw-lock-v2.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="tw-lock-v2.patch"

diff --git a/sys/netinet/tcp_timer.c b/sys/netinet/tcp_timer.c
index bde7503..b45a9ea 100644
--- a/sys/netinet/tcp_timer.c
+++ b/sys/netinet/tcp_timer.c
@@ -144,9 +144,7 @@ tcp_slowtimo(void)
 	VNET_LIST_RLOCK_NOSLEEP();
 	VNET_FOREACH(vnet_iter) {
 		CURVNET_SET(vnet_iter);
-		INP_INFO_WLOCK(&V_tcbinfo);
-		(void) tcp_tw_2msl_scan(0);
-		INP_INFO_WUNLOCK(&V_tcbinfo);
+		tcp_tw_2msl_scan();
 		CURVNET_RESTORE();
 	}
 	VNET_LIST_RUNLOCK_NOSLEEP();
diff --git a/sys/netinet/tcp_timer.h b/sys/netinet/tcp_timer.h
index 3115fb3..c04723a 100644
--- a/sys/netinet/tcp_timer.h
+++ b/sys/netinet/tcp_timer.h
@@ -178,7 +178,8 @@ extern int tcp_fast_finwait2_recycle;
 void	tcp_timer_init(void);
 void	tcp_timer_2msl(void *xtp);
 struct tcptw *
-	tcp_tw_2msl_scan(int _reuse);		/* XXX temporary */
+	tcp_tw_2msl_reuse(void);	/* XXX temporary? */
+void	tcp_tw_2msl_scan(void);
 void	tcp_timer_keep(void *xtp);
 void	tcp_timer_persist(void *xtp);
 void	tcp_timer_rexmt(void *xtp);
diff --git a/sys/netinet/tcp_timewait.c b/sys/netinet/tcp_timewait.c
index 7e6128b..afaf5f9 100644
--- a/sys/netinet/tcp_timewait.c
+++ b/sys/netinet/tcp_timewait.c
@@ -49,6 +49,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/socketvar.h>
 #include <sys/protosw.h>
 #include <sys/random.h>
+#include <sys/refcount.h>
 
 #include <vm/uma.h>
 
@@ -98,13 +99,59 @@ static int	maxtcptw;
  * The timed wait queue contains references to each of the TCP sessions
  * currently in the TIME_WAIT state.  The queue pointers, including the
  * queue pointers in each tcptw structure, are protected using the global
- * tcbinfo lock, which must be held over queue iteration and modification.
+ * timewait lock, which must be held over queue iteration and modification.
  */
 static VNET_DEFINE(TAILQ_HEAD(, tcptw), twq_2msl);
 #define	V_twq_2msl			VNET(twq_2msl)
 
-static void	tcp_tw_2msl_reset(struct tcptw *, int);
-static void	tcp_tw_2msl_stop(struct tcptw *);
+/* Global timewait lock */
+static VNET_DEFINE(struct rwlock, tw_lock);
+#define	V_tw_lock			VNET(tw_lock)
+
+#define TW_LOCK_INIT(tw, d)	rw_init_flags(&(tw), (d), 0)
+#define TW_LOCK_DESTROY(tw)	rw_destroy(&(tw))
+#define TW_RLOCK(tw)		rw_rlock(&(tw))
+#define TW_WLOCK(tw)		rw_wlock(&(tw))
+#define TW_RUNLOCK(tw)		rw_runlock(&(tw))
+#define TW_WUNLOCK(tw)		rw_wunlock(&(tw))
+#define TW_LOCK_ASSERT(tw)	rw_assert(&(tw), RA_LOCKED)
+#define TW_RLOCK_ASSERT(tw)	rw_assert(&(tw), RA_RLOCKED)
+#define TW_WLOCK_ASSERT(tw)	rw_assert(&(tw), RA_WLOCKED)
+#define TW_UNLOCK_ASSERT(tw)	rw_assert(&(tw), RA_UNLOCKED)
+
+/*
+ * tw_pcbref() bumps the reference count on an tw in order to maintain
+ * stability of an tw pointer despite the tw lock being released.
+ */
+static void
+tw_pcbref(struct tcptw *tw)
+{
+	KASSERT(tw->tw_refcount > 0, ("%s: refcount 0", __func__));
+	refcount_acquire(&tw->tw_refcount);
+}
+
+/*
+ * Drop a refcount on an tw elevated using tw_pcbref().  Return
+ * the tw lock released.
+ */
+static int
+tw_pcbrele(struct tcptw *tw)
+{
+	TW_WLOCK_ASSERT(V_tw_lock);
+	KASSERT(tw->tw_refcount > 0, ("%s: refcount 0", __func__));
+
+	if (!refcount_release(&tw->tw_refcount)) {
+		TW_WUNLOCK(V_tw_lock);
+		return (0);
+	}
+
+	uma_zfree(V_tcptw_zone, tw);
+	TW_WUNLOCK(V_tw_lock);
+	return (1);
+}
+
+static void	tcp_tw_2msl_reset(struct tcptw *, int ream);
+static void	tcp_tw_2msl_stop(struct tcptw *, int reuse);
 
 static int
 tcptw_auto_size(void)
@@ -171,6 +218,7 @@ tcp_tw_init(void)
 	else
 		uma_zone_set_max(V_tcptw_zone, maxtcptw);
 	TAILQ_INIT(&V_twq_2msl);
+	TW_LOCK_INIT(V_tw_lock, "tcptw");
 }
 
 #ifdef VIMAGE
@@ -179,11 +227,14 @@ tcp_tw_destroy(void)
 {
 	struct tcptw *tw;
 
-	INP_INFO_WLOCK(&V_tcbinfo);
-	while((tw = TAILQ_FIRST(&V_twq_2msl)) != NULL)
-		tcp_twclose(tw, 0);
-	INP_INFO_WUNLOCK(&V_tcbinfo);
+	TW_WLOCK(V_tw_lock);
+	while((tw = TAILQ_FIRST(&V_twq_2msl)) != NULL) {
+		tcp_twclose(tw, 0, 1);
+		TW_WLOCK(V_tw_lock);
+	}
+	TW_WUNLOCK(V_tw_lock);
 
+	TW_LOCK_DESTROY(V_tw_lock);
 	uma_zdestroy(V_tcptw_zone);
 }
 #endif
@@ -204,7 +255,7 @@ tcp_twstart(struct tcpcb *tp)
 	int isipv6 = inp->inp_inc.inc_flags & INC_ISIPV6;
 #endif
 
-	INP_INFO_WLOCK_ASSERT(&V_tcbinfo);	/* tcp_tw_2msl_reset(). */
+	INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
 	INP_WLOCK_ASSERT(inp);
 
 	if (V_nolocaltimewait) {
@@ -229,7 +280,7 @@ tcp_twstart(struct tcpcb *tp)
 
 	tw = uma_zalloc(V_tcptw_zone, M_NOWAIT);
 	if (tw == NULL) {
-		tw = tcp_tw_2msl_scan(1);
+		tw = tcp_tw_2msl_reuse();
 		if (tw == NULL) {
 			tp = tcp_close(tp);
 			if (tp != NULL)
@@ -238,6 +289,7 @@ tcp_twstart(struct tcpcb *tp)
 		}
 	}
 	tw->tw_inpcb = inp;
+	refcount_init(&tw->tw_refcount, 1);
 
 	/*
 	 * Recover last window size sent.
@@ -356,7 +408,6 @@ tcp_twcheck(struct inpcb *inp, struct tcpopt *to, struct tcphdr *th,
 	int thflags;
 	tcp_seq seq;
 
-	/* tcbinfo lock required for tcp_twclose(), tcp_tw_2msl_reset(). */
 	INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
 	INP_WLOCK_ASSERT(inp);
 
@@ -458,11 +509,11 @@ tcp_twclose(struct tcptw *tw, int reuse)
 	inp = tw->tw_inpcb;
 	KASSERT((inp->inp_flags & INP_TIMEWAIT), ("tcp_twclose: !timewait"));
 	KASSERT(intotw(inp) == tw, ("tcp_twclose: inp_ppcb != tw"));
-	INP_INFO_WLOCK_ASSERT(&V_tcbinfo);	/* tcp_tw_2msl_stop(). */
+	INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
 	INP_WLOCK_ASSERT(inp);
 
 	tw->tw_inpcb = NULL;
-	tcp_tw_2msl_stop(tw);
+	tcp_tw_2msl_stop(tw, reuse);
 	inp->inp_ppcb = NULL;
 	in_pcbdrop(inp);
 
@@ -494,11 +545,6 @@ tcp_twclose(struct tcptw *tw, int reuse)
 	} else
 		in_pcbfree(inp);
 	TCPSTAT_INC(tcps_closed);
-	crfree(tw->tw_cred);
-	tw->tw_cred = NULL;
-	if (reuse)
-		return;
-	uma_zfree(V_tcptw_zone, tw);
 }
 
 int
@@ -616,34 +662,87 @@ tcp_tw_2msl_reset(struct tcptw *tw, int rearm)
 
 	INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
 	INP_WLOCK_ASSERT(tw->tw_inpcb);
+
+	TW_WLOCK(V_tw_lock);
 	if (rearm)
 		TAILQ_REMOVE(&V_twq_2msl, tw, tw_2msl);
 	tw->tw_time = ticks + 2 * tcp_msl;
 	TAILQ_INSERT_TAIL(&V_twq_2msl, tw, tw_2msl);
+	TW_WUNLOCK(V_tw_lock);
 }
 
 static void
-tcp_tw_2msl_stop(struct tcptw *tw)
+tcp_tw_2msl_stop(struct tcptw *tw, int reuse)
 {
 
 	INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
+
+	TW_WLOCK(V_tw_lock);
 	TAILQ_REMOVE(&V_twq_2msl, tw, tw_2msl);
+	crfree(tw->tw_cred);
+	tw->tw_cred = NULL;
+
+	if (!reuse) {
+		tw_pcbrele(tw);
+		return;
+	}
+
+	TW_WUNLOCK(V_tw_lock);
 }
 
 struct tcptw *
-tcp_tw_2msl_scan(int reuse)
+tcp_tw_2msl_reuse(void)
 {
-	struct tcptw *tw;
 
 	INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
+
+	struct tcptw *tw;
+
+	TW_WLOCK(V_tw_lock);
+	tw = TAILQ_FIRST(&V_twq_2msl);
+	if (tw == NULL) {
+		TW_WUNLOCK(V_tw_lock);
+		return NULL;
+	}
+	TW_WUNLOCK(V_tw_lock);
+
+	INP_WLOCK(tw->tw_inpcb);
+	tcp_twclose(tw, 1);
+
+	return (tw);
+}
+
+void
+tcp_tw_2msl_scan(void)
+{
+
+	struct tcptw *tw;
 	for (;;) {
+		TW_RLOCK(V_tw_lock);
 		tw = TAILQ_FIRST(&V_twq_2msl);
-		if (tw == NULL || (!reuse && (tw->tw_time - ticks) > 0))
+		if (tw == NULL || ((tw->tw_time - ticks) > 0)) {
+			TW_RUNLOCK(V_tw_lock);
+			break;
+		}
+		tw_pcbref(tw);
+		TW_RUNLOCK(V_tw_lock);
+
+		/* Close timewait state */
+		if(INP_INFO_TRY_WLOCK(&V_tcbinfo)) {
+
+			TW_WLOCK(V_tw_lock);
+			if(tw_pcbrele(tw))
+				continue;
+
+			KASSERT(tw->tw_inpcb != NULL,
+			    ("%s: tw->tw_inpcb == NULL", __func__));
+
+			INP_WLOCK(tw->tw_inpcb);
+			tcp_twclose(tw, 0);
+			INP_INFO_WUNLOCK(&V_tcbinfo);
+		} else {
+			/* INP_INFO lock is busy, continue later */
 			break;
-		INP_WLOCK(tw->tw_inpcb);
-		tcp_twclose(tw, reuse);
-		if (reuse)
-			return (tw);
+		}
 	}
-	return (NULL);
 }
diff --git a/sys/netinet/tcp_var.h b/sys/netinet/tcp_var.h
index e3197e5..f13c1f4 100644
--- a/sys/netinet/tcp_var.h
+++ b/sys/netinet/tcp_var.h
@@ -353,8 +353,7 @@ struct tcptw {
 	u_int		t_starttime;
 	int		tw_time;
 	TAILQ_ENTRY(tcptw) tw_2msl;
-	void		*tw_pspare;	/* TCP_SIGNATURE */
-	u_int		*tw_spare;	/* TCP_SIGNATURE */
+	u_int		tw_refcount;	/* refcount */
 };
 
 #define	intotcpcb(ip)	((struct tcpcb *)(ip)->inp_ppcb)

--------------090708080906090808070008--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5322F37E.5040700>