From owner-freebsd-current@FreeBSD.ORG Thu Dec 8 13:30:52 2011 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AA920106564A; Thu, 8 Dec 2011 13:30:51 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id 449838FC14; Thu, 8 Dec 2011 13:30:51 +0000 (UTC) Received: from lstewart1.loshell.room52.net (ppp59-167-184-191.static.internode.on.net [59.167.184.191]) by lauren.room52.net (Postfix) with ESMTPSA id E413D7E824; Fri, 9 Dec 2011 00:11:50 +1100 (EST) Message-ID: <4EE0B796.3050800@freebsd.org> Date: Fri, 09 Dec 2011 00:11:50 +1100 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:7.0.1) Gecko/20111016 Thunderbird/7.0.1 MIME-Version: 1.0 To: Luigi Rizzo References: <20111205192703.GA49118@onelab2.iet.unipi.it> <2D87D847-A2B7-4E77-B6C1-61D73C9F582F@digsys.bg> <20111205222834.GA50285@onelab2.iet.unipi.it> <4EDDF9F4.9070508@digsys.bg> <4EDE259B.4010502@digsys.bg> <20111206210625.GB62605@onelab2.iet.unipi.it> <4EDF471F.1030202@freebsd.org> <20111207180807.GA71878@onelab2.iet.unipi.it> In-Reply-To: <20111207180807.GA71878@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=unavailable version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on lauren.room52.net Cc: Daniel Kalchev , Andre Oppermann , current@freebsd.org, Jack Vogel Subject: Re: quick summary results with ixgbe (was Re: datapoints on 10G throughput with TCP ? X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 08 Dec 2011 13:30:52 -0000 On 12/08/11 05:08, Luigi Rizzo wrote: > On Wed, Dec 07, 2011 at 11:59:43AM +0100, Andre Oppermann wrote: >> On 06.12.2011 22:06, Luigi Rizzo wrote: > ... >>> Even in my experiments there is a lot of instability in the results. >>> I don't know exactly where the problem is, but the high number of >>> read syscalls, and the huge impact of setting interrupt_rate=0 >>> (defaults at 16us on the ixgbe) makes me think that there is something >>> that needs investigation in the protocol stack. >>> >>> Of course we don't want to optimize specifically for the one-flow-at-10G >>> case, but devising something that makes the system less affected >>> by short timing variations, and can pass upstream interrupt mitigation >>> delays would help. >> >> I'm not sure the variance is only coming from the network card and >> driver side of things. The TCP processing and interactions with >> scheduler and locking probably play a big role as well. There have >> been many changes to TCP recently and maybe an inefficiency that >> affects high-speed single sessions throughput has crept in. That's >> difficult to debug though. > > I ran a bunch of tests on the ixgbe (82599) using RELENG_8 (which > seems slightly faster than HEAD) using MTU=1500 and various > combinations of card capabilities (hwcsum,tso,lro), different window > sizes and interrupt mitigation configurations. > > default latency is 16us, l=0 means no interrupt mitigation. > lro is the software implementation of lro (tcp_lro.c) > hwlro is the hardware one (on 82599). Using a window of 100 Kbytes > seems to give the best results. > > Summary: [snip] > - enabling software lro on the transmit side actually slows > down the throughput (4-5Gbit/s instead of 8.0). > I am not sure why (perhaps acks are delayed too much) ? > Adding a couple of lines in tcp_lro to reject > pure acks seems to have much better effect. > > The tcp_lro patch below might actually be useful also for > other cards. > > --- tcp_lro.c (revision 228284) > +++ tcp_lro.c (working copy) > @@ -245,6 +250,8 @@ > > ip_len = ntohs(ip->ip_len); > tcp_data_len = ip_len - (tcp->th_off<< 2) - sizeof (*ip); > + if (tcp_data_len == 0) > + return -1; /* not on ack */ > > > /* There is a bug with our LRO implementation (first noticed by Jeff Roberson) that I started fixing some time back but dropped the ball on. The crux of the problem is that we currently only send an ACK for the entire LRO chunk instead of all the segments contained therein. Given that most stacks rely on the ACK clock to keep things ticking over, the current behaviour kills performance. It may well be the cause of the performance loss you have observed. WIP patch is at: http://people.freebsd.org/~lstewart/patches/misctcp/tcplro_multiack_9.x.r219723.patch Jeff tested the WIP patch and it *doesn't* fix the issue. I don't have LRO capable hardware setup locally to figure out what I've missed. Most of the machines in my lab are running em(4) NICs which don't support LRO, but I'll see if I can find something which does and perhaps resurrect this patch. If anyone has any ideas what I'm missing in the patch to make it work, please let me know. Cheers, Lawrence