From owner-freebsd-current@FreeBSD.ORG Thu Dec 8 15:18:46 2011 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CA745106564A; Thu, 8 Dec 2011 15:18:46 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 4E0BD8FC08; Thu, 8 Dec 2011 15:18:46 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id E62A97300A; Thu, 8 Dec 2011 16:34:54 +0100 (CET) Date: Thu, 8 Dec 2011 16:34:54 +0100 From: Luigi Rizzo To: Lawrence Stewart Message-ID: <20111208153454.GA80979@onelab2.iet.unipi.it> References: <2D87D847-A2B7-4E77-B6C1-61D73C9F582F@digsys.bg> <20111205222834.GA50285@onelab2.iet.unipi.it> <4EDDF9F4.9070508@digsys.bg> <4EDE259B.4010502@digsys.bg> <20111206210625.GB62605@onelab2.iet.unipi.it> <4EDF471F.1030202@freebsd.org> <20111207180807.GA71878@onelab2.iet.unipi.it> <4EE0B796.3050800@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4EE0B796.3050800@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: Daniel Kalchev , Jack Vogel , Andre Oppermann , current@freebsd.org, np@freebsd.org Subject: Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Dec 2011 15:18:46 -0000 On Fri, Dec 09, 2011 at 12:11:50AM +1100, Lawrence Stewart wrote: > On 12/08/11 05:08, Luigi Rizzo wrote: ... > >I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which > >seems slightly faster than HEAD) using MTU=1500 and various > >combinations of card capabilities (hwcsum,tso,lro), different window > >sizes and interrupt mitigation configurations. > > > >default latency is 16us, l=0 means no interrupt mitigation. > >lro is the software implementation of lro (tcp_lro.c) > >hwlro is the hardware one (on 82599). Using a window of 100 Kbytes > >seems to give the best results. > > > >Summary: > > [snip] > > >- enabling software lro on the transmit side actually slows > > down the throughput (4-5Gbit/s instead of 8.0). > > I am not sure why (perhaps acks are delayed too much) ? > > Adding a couple of lines in tcp_lro to reject > > pure acks seems to have much better effect. > > > >The tcp_lro patch below might actually be useful also for > >other cards. > > > >--- tcp_lro.c (revision 228284) > >+++ tcp_lro.c (working copy) > >@@ -245,6 +250,8 @@ > > > > ip_len = ntohs(ip->ip_len); > > tcp_data_len = ip_len - (tcp->th_off<< 2) - sizeof (*ip); > >+ if (tcp_data_len == 0) > >+ return -1; /* not on ack */ > > > > > > /* > > There is a bug with our LRO implementation (first noticed by Jeff > Roberson) that I started fixing some time back but dropped the ball on. > The crux of the problem is that we currently only send an ACK for the > entire LRO chunk instead of all the segments contained therein. Given > that most stacks rely on the ACK clock to keep things ticking over, the > current behaviour kills performance. It may well be the cause of the > performance loss you have observed. I should clarify better. First of all, i tested two different LRO implementations: our "Software LRO" (tcp_lro.c), and the "Hardware LRO" which is implemented by the 82599 (called RSC or receive-side-coalescing in the 82599 data sheets). Jack Vogel and Navdeep Parhar (both in Cc) can probably comment on the logic of both. In my tests, either SW or HW LRO on the receive side HELPED A LOT, not just in terms of raw throughput but also in terms of system load on the receiver. On the receive side, LRO packs multiple data segments into one that is passed up the stack. As you mentioned this also reduces the number of acks generated, but not dramatically (consider, the LRO is bounded by the number of segments received in the mitigation interval). In my tests the number of reads() on the receiver was reduced by approx a factor of 3 compared to the !LRO case, meaning 4-5 segment merged by LRO. Navdeep reported some numbers for cxgbe with similar numbers. Using Hardware LRO on the transmit side had no ill effect. Being done in hardware i have no idea how it is implemented. Using Software LRO on the transmit side did give a significant throughput reduction. I can't explain the exact cause, though it is possible that between reducing the number of segments to the receiver and collapsing ACKs that it generates, the sender starves. But it could well be that it is the extra delay on passing up the ACKs that limits performance. Either way, since the HW LRO did a fine job, i was trying to figure out whether avoiding LRO on pure acks could help, and the two-line patch above did help. Note, my patch was just a proof-of-concept, and may cause reordering if a data segment is followed by a pure ack. But this can be fixed easily, handling a pure ack as an out-of-sequence packet in tcp_lro_rx(). > WIP patch is at: > http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch > > Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have > LRO capable hardware setup locally to figure out what I've missed. Most > of the machines in my lab are running em(4) NICs which don't support > LRO, but I'll see if I can find something which does and perhaps > resurrect this patch. a few comments: 1. i don't think it makes sense to send multiple acks on coalesced segments (and the 82599 does not seem to do that). First of all, the acks would get out with minimal spacing (ideally less than 100ns) so chances are that the remote end will see them in a single burst anyways. Secondly, the remote end can easily tell that a single ACK is reporting multiple MSS and behave as if an equivalent number of acks had arrived. 2. i am a big fan of LRO (and similar solutions), because it can save a lot of repeated work when passing packets up the stack, and the mechanism becomes more and more effective as the system load increases, which is a wonderful property in terms of system stability. For this reason, i think it would be useful to add support for software LRO in the generic code (sys/net/if.c) so that drivers can directly use the software implementation even without hardware support. 3. similar to LRO, it would make sense to implement a "software TSO" mechanism where the TCP sender pushes a large segment down to ether_output, and having code in if_ethersubr.c do the segmentation and checksum computation. This would save multiple traversals of the various layers on the stack recomputing essentially the same information on all segments. cheers luigi