From owner-freebsd-current@FreeBSD.ORG Fri Dec 9 00:33:11 2011 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 459E51065676 for ; Fri, 9 Dec 2011 00:33:11 +0000 (UTC) (envelope-from andre@freebsd.org) Received: from c00l3r.networx.ch (c00l3r.networx.ch [62.48.2.2]) by mx1.freebsd.org (Postfix) with ESMTP id 85B0B8FC12 for ; Fri, 9 Dec 2011 00:33:10 +0000 (UTC) Received: (qmail 74621 invoked from network); 8 Dec 2011 23:03:39 -0000 Received: from c00l3r.networx.ch (HELO [127.0.0.1]) ([62.48.2.2]) (envelope-sender ) by c00l3r.networx.ch (qmail-ldap-1.03) with SMTP for ; 8 Dec 2011 23:03:39 -0000 Message-ID: <4EE15740.9030505@freebsd.org> Date: Fri, 09 Dec 2011 01:33:04 +0100 From: Andre Oppermann User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0) Gecko/20111105 Thunderbird/8.0 MIME-Version: 1.0 To: Luigi Rizzo References: <2D87D847-A2B7-4E77-B6C1-61D73C9F582F@digsys.bg> <20111205222834.GA50285@onelab2.iet.unipi.it> <4EDDF9F4.9070508@digsys.bg> <4EDE259B.4010502@digsys.bg> <20111206210625.GB62605@onelab2.iet.unipi.it> <4EDF471F.1030202@freebsd.org> <20111207180807.GA71878@onelab2.iet.unipi.it> <4EE0B796.3050800@freebsd.org> <20111208153454.GA80979@onelab2.iet.unipi.it> In-Reply-To: <20111208153454.GA80979@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Lawrence Stewart , Daniel Kalchev , Jack Vogel , current@freebsd.org, np@freebsd.org Subject: Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Dec 2011 00:33:11 -0000 On 08.12.2011 16:34, Luigi Rizzo wrote: > On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote: >> On 12/08/11 05:08, Luigi Rizzo wrote: > ... >>> I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which >>> seems slightly faster than HEAD) using MTU=1500 and various >>> combinations of card capabilities (hwcsum,tso,lro), different window >>> sizes and interrupt mitigation configurations. >>> >>> default latency is 16us, l=0 means no interrupt mitigation. >>> lro is the software implementation of lro (tcp_lro.c) >>> hwlro is the hardware one (on 82599). Using a window of 100 Kbytes >>> seems to give the best results. >>> >>> Summary: >> >> [snip] >> >>> - enabling software lro on the transmit side actually slows >>> down the throughput (4-5Gbit/s instead of 8.0). >>> I am not sure why (perhaps acks are delayed too much) ? >>> Adding a couple of lines in tcp_lro to reject >>> pure acks seems to have much better effect. >>> >>> The tcp_lro patch below might actually be useful also for >>> other cards. >>> >>> --- tcp_lro.c (revision 228284) >>> +++ tcp_lro.c (working copy) >>> @@ -245,6 +250,8 @@ >>> >>> ip_len = ntohs(ip->ip_len); >>> tcp_data_len = ip_len - (tcp->th_off<< 2) - sizeof (*ip); >>> + if (tcp_data_len == 0) >>> + return -1; /* not on ack */ >>> >>> >>> /* >> >> There is a bug with our LRO implementation (first noticed by Jeff >> Roberson) that I started fixing some time back but dropped the ball on. >> The crux of the problem is that we currently only send an ACK for the >> entire LRO chunk instead of all the segments contained therein. Given >> that most stacks rely on the ACK clock to keep things ticking over, the >> current behaviour kills performance. It may well be the cause of the >> performance loss you have observed. > > I should clarify better. > First of all, i tested two different LRO implementations: our > "Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented > by the 82599 (called RSC or receive-side-coalescing in the 82599 > data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can > probably comment on the logic of both. > > In my tests, either SW or HW LRO on the receive side HELPED A LOT, > not just in terms of raw throughput but also in terms of system > load on the receiver. On the receive side, LRO packs multiple data > segments into one that is passed up the stack. > > As you mentioned this also reduces the number of acks generated, > but not dramatically (consider, the LRO is bounded by the number > of segments received in the mitigation interval). > In my tests the number of reads() on the receiver was reduced by > approx a factor of 3 compared to the !LRO case, meaning 4-5 segment > merged by LRO. Navdeep reported some numbers for cxgbe with similar > numbers. > > Using Hardware LRO on the transmit side had no ill effect. > Being done in hardware i have no idea how it is implemented. > > Using Software LRO on the transmit side did give a significant > throughput reduction. I can't explain the exact cause, though it > is possible that between reducing the number of segments to the > receiver and collapsing ACKs that it generates, the sender starves. > But it could well be that it is the extra delay on passing up the ACKs > that limits performance. > Either way, since the HW LRO did a fine job, i was trying to figure > out whether avoiding LRO on pure acks could help, and the two-line > patch above did help. > > Note, my patch was just a proof-of-concept, and may cause > reordering if a data segment is followed by a pure ack. > But this can be fixed easily, handling a pure ack as > an out-of-sequence packet in tcp_lro_rx(). > >> WIP patch is at: >> http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch >> >> Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have >> LRO capable hardware setup locally to figure out what I've missed. Most >> of the machines in my lab are running em(4) NICs which don't support >> LRO, but I'll see if I can find something which does and perhaps >> resurrect this patch. LRO can always be done in software. You can do it at driver, ether_input or ip_input level. > a few comments: > 1. i don't think it makes sense to send multiple acks on > coalesced segments (and the 82599 does not seem to do that). > First of all, the acks would get out with minimal spacing (ideally > less than 100ns) so chances are that the remote end will see > them in a single burst anyways. Secondly, the remote end can > easily tell that a single ACK is reporting multiple MSS and > behave as if an equivalent number of acks had arrived. ABC (appropriate byte counting) gets in the way though. > 2. i am a big fan of LRO (and similar solutions), because it can save > a lot of repeated work when passing packets up the stack, and the > mechanism becomes more and more effective as the system load increases, > which is a wonderful property in terms of system stability. > > For this reason, i think it would be useful to add support for software > LRO in the generic code (sys/net/if.c) so that drivers can directly use > the software implementation even without hardware support. It hurts on higher RTT links in the general case. For LAN RTT's it's good. > 3. similar to LRO, it would make sense to implement a "software TSO" > mechanism where the TCP sender pushes a large segment down to > ether_output, and having code in if_ethersubr.c do the segmentation > and checksum computation. This would save multiple traversals of > the various layers on the stack recomputing essentially the same > information on all segments. All modern NIC's support hardware TSO. There's little benefit in having a parallel software implementation. And then you run into the mbuf chain copying issue further down the layer. The win won't be much. -- Andre