From owner-freebsd-net@FreeBSD.ORG Wed Feb 6 11:32:30 2013 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 4B2C8821; Wed, 6 Feb 2013 11:32:30 +0000 (UTC) (envelope-from rrs@lakerest.net) Received: from lakerest.net (lakerest.net [70.155.160.98]) by mx1.freebsd.org (Postfix) with ESMTP id BE15121E; Wed, 6 Feb 2013 11:32:29 +0000 (UTC) Received: from [10.1.1.101] (bsd4.lakerest.net [70.155.160.102]) (authenticated bits=0) by lakerest.net (8.14.4/8.14.3) with ESMTP id r16BWjsU052830 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT); Wed, 6 Feb 2013 06:32:45 -0500 (EST) (envelope-from rrs@lakerest.net) Subject: Re: [PATCH] Add a new TCP_IGNOREIDLE socket option Mime-Version: 1.0 (Apple Message framework v1283) Content-Type: text/plain; charset=us-ascii From: Randall Stewart In-Reply-To: <201301241114.40734.jhb@freebsd.org> Date: Wed, 6 Feb 2013 06:32:28 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: References: <201301221511.02496.jhb@freebsd.org> <5100EAD3.2090006@networx.ch> <201301241114.40734.jhb@freebsd.org> To: John Baldwin X-Mailer: Apple Mail (2.1283) Cc: Sepherosa Ziehau , freebsd-net@freebsd.org, Bjoern Zeeb X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Feb 2013 11:32:30 -0000 John: In-line On Jan 24, 2013, at 11:14 AM, John Baldwin wrote: > On Thursday, January 24, 2013 3:03:31 am Andre Oppermann wrote: >> On 24.01.2013 03:31, Sepherosa Ziehau wrote: >>> On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin = wrote: >>>> On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote: >>>>> On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin = wrote: >>>>>> As I mentioned in an earlier thread, I recently had to debug an = issue we were >>>>>> seeing across a link with a high bandwidth-delay product (both = high bandwidth >>>>>> and high RTT). Our specific use case was to use a TCP connection = to reliably >>>>>> forward a latency-sensitive datagram stream across a WAN = connection. We would >>>>>> often see spikes in the latency of individual datagrams. I = eventually tracked >>>>>> this down to the connection entering slow start when it would = transmit data >>>>>> after being idle. The data stream was quite bursty and would = often attempt to >>>>>> transmit a burst of data after being idle for far longer than a = retransmit >>>>>> timeout. >>>>>>=20 >>>>>> In 7.x we had worked around this in the past by disabling RFC = 3390 and jacking >>>>>> the slow start window size up via a sysctl. On 8.x this no = longer worked. >>>>>> The solution I came up with was to add a new socket option to = disable idle >>>>>> handling completely. That is, when an idle connection restarts = with this new >>>>>> option enabled, it keeps its current congestion window and = doesn't enter slow >>>>>> start. >>>>>>=20 >>>>>> There are only a few cases where such an option is useful, but if = anyone else >>>>>> thinks this might be useful I'd be happy to add the option to = FreeBSD. >>>>>=20 >>>>> I think what you need is the RFC2861, however, you probably should >>>>> ignore the "application-limited period" part of RFC2861. >>>>=20 >>>> Hummm. It appears btw, that Linux uses RFC 2861, but has a global = knob to >>>> disable it due to applictions having problems. When it is = disabled, >>>> it doesn't decay the congestion window at all during idle handling. = That is, >>>> it appears to act the same as if TCP_IGNOREIDLE were enabled. >>>>=20 >>>> =46rom = http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html: >>>>=20 >>>> tcp_slow_start_after_idle (Boolean; default: enabled; since = Linux 2.6.18) >>>> If enabled, provide RFC 2861 behavior and time out = the congestion >>>> window after an idle period. An idle period is = defined as the current >>>> RTO (retransmission timeout). If disabled, the = congestion window will >>>> not be timed out after an idle period. >>>>=20 >>>> Also, in this thread on tcp-m it appears no one on that list = realizes that >>>> there are any implementations which follow the "SHOULD" in RFC 2581 = for idle >>>> handling (which is what we do currently): >>>=20 >>> Nah, I don't think the idle detection in FreeBSD follows the >>> RFC2581/RFC5681 4.1 (the paragraph before the "SHOULD"). IMHO, = that's >>> probably why the author in the following email requestioned about = the >>> implementation of "SHOULD" in RFC2581/RFC5681. >>>=20 >>>>=20 >>>> http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html >>>>=20 >>>> So if we were to implement RFC 2861, the new socket option would be = equivalent >>>> to setting Linux's 'tcp_slow_start_after_idle' to false, but on a = per-socket >>>> basis rather than globally. >>>=20 >>> Agree, per-socket option could be useful than global sysctls under >>> certain situation. However, in addition to the per-socket option, >>> could global sysctl nodes to disable idle_restart/idle_cwv help too? >>=20 >> No. This is far too dangerous once it makes it into some tuning = guide. >> The threat of congestion breakdown is real. The Internet, or any = packet >> network, can only survive in the long term if almost all follow the = rules >> and self-constrain to remain fair to the others. What would happen = if >> nobody would respect the traffic lights anymore? >=20 > The problem with this argument is Linux has already had this as a = tunable > option for years and the Internet hasn't melted as a result. Just because Linux does bad-behaviour does *not* mean that we have to. They also put Bic CC in by default, and this makes things bad for users even more so than RFC2581 in the buffer-bloat sense. The buffer-bloat problems reported by John Getty would not near has been as bad (they still would have existed) if he had been using standard RFC2581 CC. There are much better (and safer) ways to handle this type of network. Putting this in is not a good idea IMO. >=20 >> Besides that bursting into unknown network conditions is very likely = to >> result in burst losses as well. TCP isn't good at recovering from = it. >> In the end you most likely come out ahead if you decay the = restartCWND. >>=20 >> We have two cases primarily: a) long distance, medium to high RTT, = and >> wildly varying bandwidth (a.k.a. the Internet); b) short distance, = low >> RTT and mostly plenty of bandwidth (a.k.a. Datacenter). The former >> absolutely definately requires a decayed restartCWND. The latter = less >> so but even there bursting at 10Gig TSO assisted wirespeed isn't = going >> to end too happy more often than not. >=20 > You forgot my case: c) dedicated long distance links with high = bandwidth. And it may help a little, but you are *far* likely, depending on what is going on in that link, to overflow your router queues. Hurting that flow even more. R >=20 >> Since this seems to be a burning issue I'll come up with a patch in = the >> next days to add a decaying restartCWND that'll be fair and allow a = very >> quick ramp up if no loss occurs. >=20 > I think this could be useful. OTOH, I still think the TCP_IGNOREIDLE = option > is useful both with and without a decaying restartCWND? >=20 > --=20 > John Baldwin > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >=20 ------------------------------ Randall Stewart 803-317-4952 (cell)