From owner-freebsd-net@FreeBSD.ORG  Wed Aug 14 07:23:06 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 6FA6F381;
 Wed, 14 Aug 2013 07:23:06 +0000 (UTC)
 (envelope-from lstewart@freebsd.org)
Received: from lauren.room52.net (lauren.room52.net [210.50.193.198])
 by mx1.freebsd.org (Postfix) with ESMTP id F21DE243C;
 Wed, 14 Aug 2013 07:23:05 +0000 (UTC)
Received: from lstewart.caia.swin.edu.au (lstewart.caia.swin.edu.au
 [136.186.229.95])
 by lauren.room52.net (Postfix) with ESMTPSA id 880D57E84A;
 Wed, 14 Aug 2013 17:23:02 +1000 (EST)
Message-ID: <520B3056.1000804@freebsd.org>
Date: Wed, 14 Aug 2013 17:23:02 +1000
From: Lawrence Stewart <lstewart@freebsd.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:17.0) Gecko/20130801 Thunderbird/17.0.7
MIME-Version: 1.0
To: Julian Elischer <julian@freebsd.org>
Subject: Re: TSO and FreeBSD vs Linux
References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org>
 <520B24A0.4000706@freebsd.org>
In-Reply-To: <520B24A0.4000706@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY
 autolearn=unavailable version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on lauren.room52.net
Cc: FreeBSD Net <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Aug 2013 07:23:06 -0000

On 08/14/13 16:33, Julian Elischer wrote:
> On 8/14/13 11:39 AM, Lawrence Stewart wrote:
>> On 08/14/13 03:29, Julian Elischer wrote:
>>> I have been tracking down a performance embarrassment on AMAZON EC2 and
>>> have found it I think.
>> Let us please avoid conflating performance with throughput. The
>> behaviour you go on to describe as a performance embarrassment is
>> actually a throughput difference, and the FreeBSD behaviour you're
>> describing is essentially sacrificing throughput and CPU cycles for
>> lower latency. That may not be a trade-off you like, but it is an
>> important factor in this discussion.
> it was an embarrassment in that in one class of test we performed very
> poorly.
> It was not a disaster or a show-stopper, but for our product it is a
> critical number.

Sure, there's nothing wrong with holding throughput up as a key
performance metric for your use case.

I'm just trying to pre-empt a discussion that focuses on one metric and
fails to consider the bigger picture.

> It is a throughput difference, as you say but that is a very important
> part of performance...
> The latency of linux didn't seem to be any worse
> than FreeBSD, just the throughput was a lot higher in the same scenario.

The latency must increase when you delay packets in order to coalesce
them. Whether you were able to perceive that or not with the bulk send
type testing and measurement that you're doing is a separate issue.

>> Don't fall into the trap of labelling Linux's propensity for maximising
>> throughput as superior to an alternative approach which strikes a
>> different balance. It all depends on the use case.
> well the linux balance seems t be "be better all around" at this moment
> so that is
> embarrassing. :-) 

Better all round for you? Seems to be the case. Better all round for
everyone? No. Linux's choice of CUBIC as the default congestion control
algorithm is also a choice that maximises throughput but has side
effects which IMO are quite unfortunate.

> I could see no latency reversion.

You wouldn't because it would be practically invisible in the sorts of
tests/measurements you're doing. Our good friends over at HRT on the
other hand would be far more likely to care about latency on the order
of microseconds. Again, the use case matters a lot.

[snip]
>>> Notice that this behaviour in Linux seems to be modal.. it seems to
>>> 'switch on' a little bit
>>> into the 'starting' trace.
>>>
>>> In addition, you can see also that Linux gets going faster even in the
>>> beginning where
>>> TSO isn't in play, by sending a lot more packets up-front. (of course
>>> the wisdom of this
>>> can be argued).
>> They switched to using an initial window of 10 segments some time ago.
>> FreeBSD starts with 3 or more recently, 10 if you're running recent
>> 9-STABLE or 10-CURRENT.
> I tried setting initial values as shown:
>   net.inet.tcp.local_slowstart_flightsize: 10
>   net.inet.tcp.slowstart_flightsize: 10
> it didn't seem to make too much difference but I will redo the test.

Assuming this is still FreeBSD 8.0 as you mentioned out-of-band,
changing those variables without disabling rfc3390 will have no effect.

>>> Has anyone done any work on aggregating ACKs, or delaying responding to
>>> them?
>> As noted by Navdeep, we already have the code to aggregate ACKs in our
>> software LRO implementation. The bigger problem is that appropriate byte
>> counting places a default 2*MSS limit on the amount of ACKed data the
>> window can grow by i.e. if an ACK for 64k of data comes up the stack,
>> we'll grow the window by 2 segments worth of data in response. That
>> needs to be addressed - we could send the ACK count up with the
>> aggregated single ACK or just ignore abc_l_var when LRO is in use for a
>> connection.

BTW, as a work around for the appropriate byte counting issue, you can
crank net.inet.tcp.abc_l_var up to the number of MSS segments you wish
to allow it to increase the window by (per ack). If LRO is aggregating
64 on-wire acks into 1 mega ack, you should set abc_l_var=64.

> so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to
> see this?

I think (check the driver code in question as I'm not sure) that if you
"ifconfig <if> lro" and the driver has hardware support or has been made
aware of our software implementation, it should DTRT.

Cheers,
Lawrence