From owner-freebsd-net@FreeBSD.ORG  Wed Aug 14 06:33:15 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 531BCB69;
 Wed, 14 Aug 2013 06:33:15 +0000 (UTC)
 (envelope-from julian@freebsd.org)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 3132D2214;
 Wed, 14 Aug 2013 06:33:14 +0000 (UTC)
Received: from jre-mbp.elischer.org (etroy.elischer.org [121.45.226.51])
 (authenticated bits=0)
 by vps1.elischer.org (8.14.7/8.14.6) with ESMTP id r7E6XADl015797
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Tue, 13 Aug 2013 23:33:12 -0700 (PDT)
 (envelope-from julian@freebsd.org)
Message-ID: <520B24A0.4000706@freebsd.org>
Date: Wed, 14 Aug 2013 14:33:04 +0800
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8;
 rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: Lawrence Stewart <lstewart@freebsd.org>
Subject: Re: TSO and FreeBSD vs Linux
References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org>
In-Reply-To: <520AFBE8.1090109@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: FreeBSD Net <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Aug 2013 06:33:15 -0000

On 8/14/13 11:39 AM, Lawrence Stewart wrote:
> On 08/14/13 03:29, Julian Elischer wrote:
>> I have been tracking down a performance embarrassment on AMAZON EC2 and
>> have found it I think.
> Let us please avoid conflating performance with throughput. The
> behaviour you go on to describe as a performance embarrassment is
> actually a throughput difference, and the FreeBSD behaviour you're
> describing is essentially sacrificing throughput and CPU cycles for
> lower latency. That may not be a trade-off you like, but it is an
> important factor in this discussion.
it was an embarrassment in that in one class of test we performed very 
poorly.
It was not a disaster or a show-stopper, but for our product it is a 
critical number.
It is a throughput difference, as you say but that is a very important 
part of performance...
The latency of linux didn't seem to be any worse
than FreeBSD, just the throughput was a lot higher in the same scenario.
>
> Don't fall into the trap of labelling Linux's propensity for maximising
> throughput as superior to an alternative approach which strikes a
> different balance. It all depends on the use case.
well the linux balance seems t be "be better all around" at this 
moment so that is
embarrassing. :-) I could see no latency reversion.

>
>> Our OS cousins over at Linux land have implemented some interesting
>> behaviour when TSO is in use.
>>
>> They seem to aggregate ACKS when there is a lot of traffic so that they
>> can create the
>> largest possible TSO packet. We on the other hand respond to each and
>> every returning ACK, as it arrives and thus generally fall into the
>> behaviour of sending a bunch of small packets, the size of each ack.
> There's a thing controlled by ethtool called GRO (generic receive
> offload) which appears to be enabled by default on at least Ubuntu and I
> guess other Linux's too. It's responsible for aggregating ACKs and data
> to batch them up the stack if the driver doesn't provide a hardware
> offload implementation. Try rerunning your experiments with the ACK
> batching disabled on the Linux host to get an additional comparison point.
I will try that as soon as I get back to the machines in question.
>> for two examples look at:
>>
>>
>> http://www.freebsd.org/~julian/LvsF-tcp-start.tiff
>> and
>> http://www.freebsd.org/~julian/LvsF-tcp.tiff
>>
>> in each case, we can see FreeBSD on the left and Linux on the right.
>>
>> The first case shows the case as the sessions start, and the second case
>> shows
>> some distance later (when the sequence numbers wrap around.. no particular
>> reason to use that, it was just fun to see).
>> In both cases you can see that each Linux packet (white)(once they have got
>> going) is responding to multiple bumps in the send window sequence
>> number (green and yellow lines) (representing the arrival of several ACKs)
>> while FreeBSD produces a whole bunch of smaller packets, slavishly
>> following
>> exactly the size of each incoming ack.. This gives us quite  a
>> performance debt.
> Again, please s/performance/what-you-really-mean/ here.
ok, In my tests this makes FreeBSD data transfers much slower, by as 
much as 60%.
>
>> Notice that this behaviour in Linux seems to be modal.. it seems to
>> 'switch on' a little bit
>> into the 'starting' trace.
>>
>> In addition, you can see also that Linux gets going faster even in the
>> beginning where
>> TSO isn't in play, by sending a lot more packets up-front. (of course
>> the wisdom of this
>> can be argued).
> They switched to using an initial window of 10 segments some time ago.
> FreeBSD starts with 3 or more recently, 10 if you're running recent
> 9-STABLE or 10-CURRENT.
I tried setting initial values as shown:
   net.inet.tcp.local_slowstart_flightsize: 10
   net.inet.tcp.slowstart_flightsize: 10
it didn't seem to make too much difference but I will redo the test.

>
>> Has anyone done any work on aggregating ACKs, or delaying responding to
>> them?
> As noted by Navdeep, we already have the code to aggregate ACKs in our
> software LRO implementation. The bigger problem is that appropriate byte
> counting places a default 2*MSS limit on the amount of ACKed data the
> window can grow by i.e. if an ACK for 64k of data comes up the stack,
> we'll grow the window by 2 segments worth of data in response. That
> needs to be addressed - we could send the ACK count up with the
> aggregated single ACK or just ignore abc_l_var when LRO is in use for a
> connection.
so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF 
to see this?


>
> Cheers,
> Lawrence
>
>