From owner-freebsd-net@FreeBSD.ORG Wed Aug 14 06:33:15 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 531BCB69; Wed, 14 Aug 2013 06:33:15 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 3132D2214; Wed, 14 Aug 2013 06:33:14 +0000 (UTC) Received: from jre-mbp.elischer.org (etroy.elischer.org [121.45.226.51]) (authenticated bits=0) by vps1.elischer.org (8.14.7/8.14.6) with ESMTP id r7E6XADl015797 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 13 Aug 2013 23:33:12 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <520B24A0.4000706@freebsd.org> Date: Wed, 14 Aug 2013 14:33:04 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: Lawrence Stewart Subject: Re: TSO and FreeBSD vs Linux References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> In-Reply-To: <520AFBE8.1090109@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Aug 2013 06:33:15 -0000 On 8/14/13 11:39 AM, Lawrence Stewart wrote: > On 08/14/13 03:29, Julian Elischer wrote: >> I have been tracking down a performance embarrassment on AMAZON EC2 and >> have found it I think. > Let us please avoid conflating performance with throughput. The > behaviour you go on to describe as a performance embarrassment is > actually a throughput difference, and the FreeBSD behaviour you're > describing is essentially sacrificing throughput and CPU cycles for > lower latency. That may not be a trade-off you like, but it is an > important factor in this discussion. it was an embarrassment in that in one class of test we performed very poorly. It was not a disaster or a show-stopper, but for our product it is a critical number. It is a throughput difference, as you say but that is a very important part of performance... The latency of linux didn't seem to be any worse than FreeBSD, just the throughput was a lot higher in the same scenario. > > Don't fall into the trap of labelling Linux's propensity for maximising > throughput as superior to an alternative approach which strikes a > different balance. It all depends on the use case. well the linux balance seems t be "be better all around" at this moment so that is embarrassing. :-) I could see no latency reversion. > >> Our OS cousins over at Linux land have implemented some interesting >> behaviour when TSO is in use. >> >> They seem to aggregate ACKS when there is a lot of traffic so that they >> can create the >> largest possible TSO packet. We on the other hand respond to each and >> every returning ACK, as it arrives and thus generally fall into the >> behaviour of sending a bunch of small packets, the size of each ack. > There's a thing controlled by ethtool called GRO (generic receive > offload) which appears to be enabled by default on at least Ubuntu and I > guess other Linux's too. It's responsible for aggregating ACKs and data > to batch them up the stack if the driver doesn't provide a hardware > offload implementation. Try rerunning your experiments with the ACK > batching disabled on the Linux host to get an additional comparison point. I will try that as soon as I get back to the machines in question. >> for two examples look at: >> >> >> http://www.freebsd.org/~julian/LvsF-tcp-start.tiff >> and >> http://www.freebsd.org/~julian/LvsF-tcp.tiff >> >> in each case, we can see FreeBSD on the left and Linux on the right. >> >> The first case shows the case as the sessions start, and the second case >> shows >> some distance later (when the sequence numbers wrap around.. no particular >> reason to use that, it was just fun to see). >> In both cases you can see that each Linux packet (white)(once they have got >> going) is responding to multiple bumps in the send window sequence >> number (green and yellow lines) (representing the arrival of several ACKs) >> while FreeBSD produces a whole bunch of smaller packets, slavishly >> following >> exactly the size of each incoming ack.. This gives us quite a >> performance debt. > Again, please s/performance/what-you-really-mean/ here. ok, In my tests this makes FreeBSD data transfers much slower, by as much as 60%. > >> Notice that this behaviour in Linux seems to be modal.. it seems to >> 'switch on' a little bit >> into the 'starting' trace. >> >> In addition, you can see also that Linux gets going faster even in the >> beginning where >> TSO isn't in play, by sending a lot more packets up-front. (of course >> the wisdom of this >> can be argued). > They switched to using an initial window of 10 segments some time ago. > FreeBSD starts with 3 or more recently, 10 if you're running recent > 9-STABLE or 10-CURRENT. I tried setting initial values as shown: net.inet.tcp.local_slowstart_flightsize: 10 net.inet.tcp.slowstart_flightsize: 10 it didn't seem to make too much difference but I will redo the test. > >> Has anyone done any work on aggregating ACKs, or delaying responding to >> them? > As noted by Navdeep, we already have the code to aggregate ACKs in our > software LRO implementation. The bigger problem is that appropriate byte > counting places a default 2*MSS limit on the amount of ACKed data the > window can grow by i.e. if an ACK for 64k of data comes up the stack, > we'll grow the window by 2 segments worth of data in response. That > needs to be addressed - we could send the ACK count up with the > aggregated single ACK or just ignore abc_l_var when LRO is in use for a > connection. so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to see this? > > Cheers, > Lawrence > >