From owner-freebsd-net@FreeBSD.ORG  Thu Jan 24 21:10:58 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 9B1C0F34;
 Thu, 24 Jan 2013 21:10:58 +0000 (UTC) (envelope-from bright@mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
 by mx1.freebsd.org (Postfix) with ESMTP id 7C1E7743;
 Thu, 24 Jan 2013 21:10:58 +0000 (UTC)
Received: from Alfreds-MacBook-Pro-9.local (207.110.29.135.ptr.us.xo.net
 [207.110.29.135])
 by elvis.mu.org (Postfix) with ESMTPSA id EAF521A3C77;
 Thu, 24 Jan 2013 13:10:51 -0800 (PST)
Message-ID: <5101A35B.2060104@mu.org>
Date: Thu, 24 Jan 2013 16:10:51 -0500
From: Alfred Perlstein <bright@mu.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:17.0) Gecko/20130107 Thunderbird/17.0.2
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
Subject: Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
References: <201301221511.02496.jhb@freebsd.org>
 <CAMOc5cwhEEpZn0AM2hiXjpQYujLu+nZAb+p+=USaE5JsQs6LLQ@mail.gmail.com>
 <5100EAD3.2090006@networx.ch> <201301241114.40734.jhb@freebsd.org>
In-Reply-To: <201301241114.40734.jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Sepherosa Ziehau <sepherosa@gmail.com>, freebsd-net@freebsd.org,
 Bjoern Zeeb <bz@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 24 Jan 2013 21:10:58 -0000

On 1/24/13 11:14 AM, John Baldwin wrote:
> On Thursday, January 24, 2013 3:03:31 am Andre Oppermann wrote:
>> On 24.01.2013 03:31, Sepherosa Ziehau wrote:
>>> On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin <jhb@freebsd.org> wrote:
>>>> On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote:
>>>>> On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin <jhb@freebsd.org> wrote:
>>>>>> As I mentioned in an earlier thread, I recently had to debug an issue we were
>>>>>> seeing across a link with a high bandwidth-delay product (both high bandwidth
>>>>>> and high RTT).  Our specific use case was to use a TCP connection to reliably
>>>>>> forward a latency-sensitive datagram stream across a WAN connection.  We would
>>>>>> often see spikes in the latency of individual datagrams.  I eventually tracked
>>>>>> this down to the connection entering slow start when it would transmit data
>>>>>> after being idle.  The data stream was quite bursty and would often attempt to
>>>>>> transmit a burst of data after being idle for far longer than a retransmit
>>>>>> timeout.
>>>>>>
>>>>>> In 7.x we had worked around this in the past by disabling RFC 3390 and jacking
>>>>>> the slow start window size up via a sysctl.  On 8.x this no longer worked.
>>>>>> The solution I came up with was to add a new socket option to disable idle
>>>>>> handling completely.  That is, when an idle connection restarts with this new
>>>>>> option enabled, it keeps its current congestion window and doesn't enter slow
>>>>>> start.
>>>>>>
>>>>>> There are only a few cases where such an option is useful, but if anyone else
>>>>>> thinks this might be useful I'd be happy to add the option to FreeBSD.
>>>>> I think what you need is the RFC2861, however, you probably should
>>>>> ignore the "application-limited period" part of RFC2861.
>>>> Hummm.  It appears btw, that Linux uses RFC 2861, but has a global knob to
>>>> disable it due to applictions having problems.  When it is disabled,
>>>> it doesn't decay the congestion window at all during idle handling.  That is,
>>>> it appears to act the same as if TCP_IGNOREIDLE were enabled.
>>>>
>>>>   From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html:
>>>>
>>>>          tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 2.6.18)
>>>>                 If enabled, provide RFC 2861 behavior and time out the congestion
>>>>                 window after an idle period.  An idle period is defined as the current
>>>>                 RTO (retransmission timeout).  If disabled, the congestion window will
>>>>                 not be timed out after an idle period.
>>>>
>>>> Also, in this thread on tcp-m it appears no one on that list realizes that
>>>> there are any implementations which follow the "SHOULD" in RFC 2581 for idle
>>>> handling (which is what we do currently):
>>> Nah, I don't think the idle detection in FreeBSD follows the
>>> RFC2581/RFC5681 4.1 (the paragraph before the "SHOULD").  IMHO, that's
>>> probably why the author in the following email requestioned about the
>>> implementation of "SHOULD" in RFC2581/RFC5681.
>>>
>>>> http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html
>>>>
>>>> So if we were to implement RFC 2861, the new socket option would be equivalent
>>>> to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket
>>>> basis rather than globally.
>>> Agree, per-socket option could be useful than global sysctls under
>>> certain situation.  However, in addition to the per-socket option,
>>> could global sysctl nodes to disable idle_restart/idle_cwv help too?
>> No.  This is far too dangerous once it makes it into some tuning guide.
>> The threat of congestion breakdown is real.  The Internet, or any packet
>> network, can only survive in the long term if almost all follow the rules
>> and self-constrain to remain fair to the others.  What would happen if
>> nobody would respect the traffic lights anymore?
> The problem with this argument is Linux has already had this as a tunable
> option for years and the Internet hasn't melted as a result.
>   
>> Besides that bursting into unknown network conditions is very likely to
>> result in burst losses as well.  TCP isn't good at recovering from it.
>> In the end you most likely come out ahead if you decay the restartCWND.
>>
>> We have two cases primarily: a) long distance, medium to high RTT, and
>> wildly varying bandwidth (a.k.a. the Internet); b) short distance, low
>> RTT and mostly plenty of bandwidth (a.k.a. Datacenter).  The former
>> absolutely definately requires a decayed restartCWND.  The latter less
>> so but even there bursting at 10Gig TSO assisted wirespeed isn't going
>> to end too happy more often than not.
> You forgot my case: c) dedicated long distance links with high bandwidth.
>
>> Since this seems to be a burning issue I'll come up with a patch in the
>> next days to add a decaying restartCWND that'll be fair and allow a very
>> quick ramp up if no loss occurs.
> I think this could be useful.  OTOH, I still think the TCP_IGNOREIDLE option
> is useful both with and without a decaying restartCWND?
>
Linux seems to be doing just fine with it for what seems to be a long 
while.  Can we get this committed?

-Alfred