From owner-freebsd-current@FreeBSD.ORG  Fri Dec  9 00:33:11 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 459E51065676
	for <current@freebsd.org>; Fri,  9 Dec 2011 00:33:11 +0000 (UTC)
	(envelope-from andre@freebsd.org)
Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2])
	by mx1.freebsd.org (Postfix) with ESMTP id 85B0B8FC12
	for <current@freebsd.org>; Fri,  9 Dec 2011 00:33:10 +0000 (UTC)
Received: (qmail 74621 invoked from network); 8 Dec 2011 23:03:39 -0000
Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2])
	(envelope-sender <andre@freebsd.org>)
	by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP
	for <rizzo@iet.unipi.it>; 8 Dec 2011 23:03:39 -0000
Message-ID: <4EE15740.9030505@freebsd.org>
Date: Fri, 09 Dec 2011 01:33:04 +0100
From: Andre Oppermann <andre@freebsd.org>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
	rv:8.0) Gecko/20111105 Thunderbird/8.0
MIME-Version: 1.0
To: Luigi Rizzo <rizzo@iet.unipi.it>
References: <2D87D847-A2B7-4E77-B6C1-61D73C9F582F@digsys.bg>
	<20111205222834.GA50285@onelab2.iet.unipi.it>
	<4EDDF9F4.9070508@digsys.bg> <4EDE259B.4010502@digsys.bg>
	<CAFOYbcmVR_K0iZU_Z4TxDVzPzx6-GZuzfCxUZbf6KQn4siF2UA@mail.gmail.com>
	<F5BCA7E9-6A61-4492-9F18-423178E9C9B4@digsys.bg>
	<20111206210625.GB62605@onelab2.iet.unipi.it>
	<4EDF471F.1030202@freebsd.org>
	<20111207180807.GA71878@onelab2.iet.unipi.it>
	<4EE0B796.3050800@freebsd.org>
	<20111208153454.GA80979@onelab2.iet.unipi.it>
In-Reply-To: <20111208153454.GA80979@onelab2.iet.unipi.it>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Lawrence Stewart <lstewart@freebsd.org>, Daniel Kalchev <daniel@digsys.bg>,
	Jack Vogel <jfvogel@gmail.com>, current@freebsd.org, np@freebsd.org
Subject: Re: quick summary results with ixgbe (was Re: datapoints on 10G
 throughput with TCP ?
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 09 Dec 2011 00:33:11 -0000

On 08.12.2011 16:34, Luigi Rizzo wrote:
> On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote:
>> On 12/08/11 05:08, Luigi Rizzo wrote:
> ...
>>> I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which
>>> seems slightly faster than HEAD) using MTU=1500 and various
>>> combinations of card capabilities (hwcsum,tso,lro), different window
>>> sizes and interrupt mitigation configurations.
>>>
>>> default latency is 16us, l=0 means no interrupt mitigation.
>>> lro is the software implementation of lro (tcp_lro.c)
>>> hwlro is the hardware one (on 82599). Using a window of 100 Kbytes
>>> seems to give the best results.
>>>
>>> Summary:
>>
>> [snip]
>>
>>> - enabling software lro on the transmit side actually slows
>>>    down the throughput (4-5Gbit/s instead of 8.0).
>>>    I am not sure why (perhaps acks are delayed too much) ?
>>>    Adding a couple of lines in tcp_lro to reject
>>>    pure acks seems to have much better effect.
>>>
>>> The tcp_lro patch below might actually be useful also for
>>> other cards.
>>>
>>> --- tcp_lro.c   (revision 228284)
>>> +++ tcp_lro.c   (working copy)
>>> @@ -245,6 +250,8 @@
>>>
>>>          ip_len = ntohs(ip->ip_len);
>>>          tcp_data_len = ip_len - (tcp->th_off<<   2) - sizeof (*ip);
>>> +       if (tcp_data_len == 0)
>>> +               return -1;      /* not on ack */
>>>
>>>
>>>          /*
>>
>> There is a bug with our LRO implementation (first noticed by Jeff
>> Roberson) that I started fixing some time back but dropped the ball on.
>> The crux of the problem is that we currently only send an ACK for the
>> entire LRO chunk instead of all the segments contained therein. Given
>> that most stacks rely on the ACK clock to keep things ticking over, the
>> current behaviour kills performance. It may well be the cause of the
>> performance loss you have observed.
>
> I should clarify better.
> First of all, i tested two different LRO implementations: our
> "Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented
> by the 82599 (called RSC or receive-side-coalescing in the 82599
> data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can
> probably comment on the logic of both.
>
> In my tests, either SW or HW LRO on the receive side HELPED A LOT,
> not just in terms of raw throughput but also in terms of system
> load on the receiver. On the receive side, LRO packs multiple data
> segments into one that is passed up the stack.
>
> As you mentioned this also reduces the number of acks generated,
> but not dramatically (consider, the LRO is bounded by the number
> of segments received in the mitigation interval).
> In my tests the number of reads() on the receiver was reduced by
> approx a factor of 3 compared to the !LRO case, meaning 4-5 segment
> merged by LRO. Navdeep reported some numbers for cxgbe with similar
> numbers.
>
> Using Hardware LRO on the transmit side had no ill effect.
> Being done in hardware i have no idea how it is implemented.
>
> Using Software LRO on the transmit side did give a significant
> throughput reduction. I can't explain the exact cause, though it
> is possible that between reducing the number of segments to the
> receiver and collapsing ACKs that it generates, the sender starves.
> But it could well be that it is the extra delay on passing up the ACKs
> that limits performance.
> Either way, since the HW LRO did a fine job, i was trying to figure
> out whether avoiding LRO on pure acks could help, and the two-line
> patch above did help.
>
> Note, my patch was just a proof-of-concept, and may cause
> reordering if a data segment is followed by a pure ack.
> But this can be fixed easily, handling a pure ack as
> an out-of-sequence packet in tcp_lro_rx().
>
>>                                      WIP patch is at:
>> http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch
>>
>> Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have
>> LRO capable hardware setup locally to figure out what I've missed. Most
>> of the machines in my lab are running em(4) NICs which don't support
>> LRO, but I'll see if I can find something which does and perhaps
>> resurrect this patch.

LRO can always be done in software.  You can do it at driver, ether_input
or ip_input level.

> a few comments:
> 1. i don't think it makes sense to send multiple acks on
>     coalesced segments (and the 82599 does not seem to do that).
>     First of all, the acks would get out with minimal spacing (ideally
>     less than 100ns) so chances are that the remote end will see
>     them in a single burst anyways. Secondly, the remote end can
>     easily tell that a single ACK is reporting multiple MSS and
>     behave as if an equivalent number of acks had arrived.

ABC (appropriate byte counting) gets in the way though.

> 2. i am a big fan of LRO (and similar solutions), because it can save
>     a lot of repeated work when passing packets up the stack, and the
>     mechanism becomes more and more effective as the system load increases,
>     which is a wonderful property in terms of system stability.
>
>     For this reason, i think it would be useful to add support for software
>     LRO in the generic code (sys/net/if.c) so that drivers can directly use
>     the software implementation even without hardware support.

It hurts on higher RTT links in the general case.  For LAN RTT's
it's good.

> 3. similar to LRO, it would make sense to implement a "software TSO"
>     mechanism where the TCP sender pushes a large segment down to
>     ether_output, and having code in if_ethersubr.c do the segmentation
>     and checksum computation. This would save multiple traversals of
>     the various layers on the stack recomputing essentially the same
>     information on all segments.

All modern NIC's support hardware TSO.  There's little benefit in
having a parallel software implementation.  And then you run into
the mbuf chain copying issue further down the layer.  The win won't
be much.

-- 
Andre