From owner-freebsd-net@FreeBSD.ORG Tue Jan 22 20:11:10 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 936BAC7C for ; Tue, 22 Jan 2013 20:11:10 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net [IPv6:2001:470:1f10:75::2]) by mx1.freebsd.org (Postfix) with ESMTP id 3EFAC10E for ; Tue, 22 Jan 2013 20:11:10 +0000 (UTC) Received: from pakbsde14.localnet (unknown [38.105.238.108]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 8572EB96E for ; Tue, 22 Jan 2013 15:11:08 -0500 (EST) From: John Baldwin To: net@freebsd.org Subject: [PATCH] Add a new TCP_IGNOREIDLE socket option Date: Tue, 22 Jan 2013 15:11:02 -0500 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p22; KDE/4.5.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Message-Id: <201301221511.02496.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 22 Jan 2013 15:11:09 -0500 (EST) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Jan 2013 20:11:10 -0000 As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. Index: share/man/man4/tcp.4 =================================================================== --- share/man/man4/tcp.4 (revision 245742) +++ share/man/man4/tcp.4 (working copy) @@ -205,6 +205,18 @@ in the .Sx MIB Variables section further down. +.It Dv TCP_IGNOREIDLE +If a TCP connection is idle for more than one retransmit timeout, +it enters slow start when new data is available to transmit. +This avoids flooding the network with a full window of traffic at line rate. +It also allows the connection to adjust to changes to network conditions +that occurred while the connection was idle. A connection that sends +bursts of data separated by large idle periods can be permamently stuck in +slow start as a result. +The boolean option +.Dv TCP_IGNOREIDLE +disables the idle connection handling allowing connections to maintain the +existing congestion window when restarting after an idle period. .It Dv TCP_NODELAY Under most circumstances, .Tn TCP Index: sys/netinet/tcp_var.h =================================================================== --- sys/netinet/tcp_var.h (revision 245742) +++ sys/netinet/tcp_var.h (working copy) @@ -230,6 +230,7 @@ #define TF_NEEDFIN 0x000800 /* send FIN (implicit state) */ #define TF_NOPUSH 0x001000 /* don't push */ #define TF_PREVVALID 0x002000 /* saved values for bad rxmit valid */ +#define TF_IGNOREIDLE 0x004000 /* connection is never idle */ #define TF_MORETOCOME 0x010000 /* More data to be appended to sock */ #define TF_LQ_OVERFLOW 0x020000 /* listen queue overflow */ #define TF_LASTIDLE 0x040000 /* connection was previously idle */ Index: sys/netinet/tcp_output.c =================================================================== --- sys/netinet/tcp_output.c (revision 245742) +++ sys/netinet/tcp_output.c (working copy) @@ -206,7 +206,8 @@ * to send, then transmit; otherwise, investigate further. */ idle = (tp->t_flags & TF_LASTIDLE) || (tp->snd_max == tp->snd_una); - if (idle && ticks - tp->t_rcvtime >= tp->t_rxtcur) + if (!(tp->t_flags & TF_IGNOREIDLE) && + idle && ticks - tp->t_rcvtime >= tp->t_rxtcur) cc_after_idle(tp); tp->t_flags &= ~TF_LASTIDLE; if (idle) { Index: sys/netinet/tcp.h =================================================================== --- sys/netinet/tcp.h (revision 245823) +++ sys/netinet/tcp.h (working copy) @@ -156,6 +156,7 @@ #define TCP_NODELAY 1 /* don't delay send to coalesce packets */ #if __BSD_VISIBLE #define TCP_MAXSEG 2 /* set maximum segment size */ +#define TCP_IGNOREIDLE 3 /* disable idle connection handling */ #define TCP_NOPUSH 4 /* don't push last block of write */ #define TCP_NOOPT 8 /* don't use TCP options */ #define TCP_MD5SIG 16 /* use MD5 digests (RFC2385) */ Index: sys/netinet/tcp_usrreq.c =================================================================== --- sys/netinet/tcp_usrreq.c (revision 245742) +++ sys/netinet/tcp_usrreq.c (working copy) @@ -1354,6 +1354,7 @@ case TCP_NODELAY: case TCP_NOOPT: + case TCP_IGNOREIDLE: INP_WUNLOCK(inp); error = sooptcopyin(sopt, &optval, sizeof optval, sizeof optval); @@ -1368,6 +1369,9 @@ case TCP_NOOPT: opt = TF_NOOPT; break; + case TCP_IGNOREIDLE: + opt = TF_IGNOREIDLE; + break; default: opt = 0; /* dead code to fool gcc */ break; @@ -1578,6 +1582,11 @@ INP_WUNLOCK(inp); error = sooptcopyout(sopt, buf, TCP_CA_NAME_MAX); break; + case TCP_IGNOREIDLE: + optval = tp->t_flags & TF_IGNOREIDLE; + INP_WUNLOCK(inp); + error = sooptcopyout(sopt, &optval, sizeof optval); + break; default: INP_WUNLOCK(inp); error = ENOPROTOOPT; -- John Baldwin