From owner-freebsd-net@FreeBSD.ORG  Fri Aug 16 08:54:43 2013
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTP id 91F4C6B9;
 Fri, 16 Aug 2013 08:54:43 +0000 (UTC)
 (envelope-from julian@freebsd.org)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6DC072901;
 Fri, 16 Aug 2013 08:54:43 +0000 (UTC)
Received: from jre-mbp.elischer.org (etroy.elischer.org [121.45.226.51])
 (authenticated bits=0)
 by vps1.elischer.org (8.14.7/8.14.6) with ESMTP id r7G8sYnV062506
 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO);
 Fri, 16 Aug 2013 01:54:40 -0700 (PDT)
 (envelope-from julian@freebsd.org)
Message-ID: <520DE8C5.8070508@freebsd.org>
Date: Fri, 16 Aug 2013 16:54:29 +0800
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8;
 rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: Luigi Rizzo <rizzo@iet.unipi.it>
Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux)
References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org>
 <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org>
 <20130814102109.GA63246@onelab2.iet.unipi.it>
In-Reply-To: <20130814102109.GA63246@onelab2.iet.unipi.it>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Lawrence Stewart <lstewart@freebsd.org>, FreeBSD Net <net@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 16 Aug 2013 08:54:43 -0000

On 8/14/13 6:21 PM, Luigi Rizzo wrote:
> On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote:
>> On 08/14/13 16:33, Julian Elischer wrote:
>>> On 8/14/13 11:39 AM, Lawrence Stewart wrote:
>>>> On 08/14/13 03:29, Julian Elischer wrote:
>>>>> I have been tracking down a performance embarrassment on AMAZON EC2 and
>>>>> have found it I think.
>>>> Let us please avoid conflating performance with throughput. The
>>>> behaviour you go on to describe as a performance embarrassment is
>>>> actually a throughput difference, and the FreeBSD behaviour you're
>>>> describing is essentially sacrificing throughput and CPU cycles for
>>>> lower latency. That may not be a trade-off you like, but it is an
>>>> important factor in this discussion.
> ...
>> Sure, there's nothing wrong with holding throughput up as a key
>> performance metric for your use case.
>>
>> I'm just trying to pre-empt a discussion that focuses on one metric and
>> fails to consider the bigger picture.
> ...
>>> I could see no latency reversion.
>> You wouldn't because it would be practically invisible in the sorts of
>> tests/measurements you're doing. Our good friends over at HRT on the
>> other hand would be far more likely to care about latency on the order
>> of microseconds. Again, the use case matters a lot.
> ...
>>> so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to
>>> see this?
>> I think (check the driver code in question as I'm not sure) that if you
>> "ifconfig <if> lro" and the driver has hardware support or has been made
>> aware of our software implementation, it should DTRT.
> The "lower throughput than linux" that julian was seeing is either
> because of a slow (CPU-bound) sender or slow receiver. Given that
> the FreeBSD tx path is quite expensive (redoing route and arp lookups
> on every packet, etc.) I highly suspect the sender side is at fault.

if we send bigger packets then we do less lookups do we not?

>
> Ack coalescing, LRO, GRO are limited to the set of packets that you
> receive in the same batch, which in turn is upper bounded by the
> interrupt moderation delay. Apart from simple benchmarks with only
> a few flows, it is very hard that ack/lro/gro can coalesce more
> than a few segments for the same flow.
>
> 	But the real fix is in tcp_output.
>
> In fact, it has never been the case that an ack (single or coalesced)
> triggers an immediate transmission in the output path.  We had this
> in the past (Silly Window Syndrome) and there is code that avoids
> sending less than 1-mtu under appropriate conditions (there is more
> data to push out anyways, no NODELAY, there are outstanding acks,
> the window can open further).  In all these cases there is no
> reasonable way to experience the difference in terms of latency.
>
> If one really cares, e.g. the High Speed Trading example, this is
> a non issue because any reasonable person would run with TCP_NODELAY
> (and possibly disable interrupt moderation), and optimize for latency
> even on a per flow basis.
>
> In terms of coding effort, i suspect that by replacing the 1-mtu
> limit (t_maxseg i believe is the variable that we use in the SWS
> avoidance code) with 1-max-tso-segment we can probably achieve good
> results with little programming effort.
>
> Then the problem remains that we should keep a copy of route and
> arp information in the socket instead of redoing the lookups on
> every single transmission, as they consume some 25% of the time of
> a sendto(), and probably even more when it comes to large tcp
> segments, sendfile() and the like.
>
> 	cheers
> 	luigi
>
>