From owner-freebsd-net@FreeBSD.ORG Wed Aug 14 03:39:26 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id C375D1B6; Wed, 14 Aug 2013 03:39:26 +0000 (UTC) (envelope-from lstewart@freebsd.org) Received: from lauren.room52.net (lauren.room52.net [210.50.193.198]) by mx1.freebsd.org (Postfix) with ESMTP id 520B62AAD; Wed, 14 Aug 2013 03:39:25 +0000 (UTC) Received: from lstewart.caia.swin.edu.au (lstewart.caia.swin.edu.au [136.186.229.95]) by lauren.room52.net (Postfix) with ESMTPSA id 889B77E81E; Wed, 14 Aug 2013 13:39:21 +1000 (EST) Message-ID: <520AFBE8.1090109@freebsd.org> Date: Wed, 14 Aug 2013 13:39:20 +1000 From: Lawrence Stewart User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:17.0) Gecko/20130801 Thunderbird/17.0.7 MIME-Version: 1.0 To: Julian Elischer Subject: Re: TSO and FreeBSD vs Linux References: <520A6D07.5080106@freebsd.org> In-Reply-To: <520A6D07.5080106@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=unavailable version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on lauren.room52.net Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 Aug 2013 03:39:26 -0000 On 08/14/13 03:29, Julian Elischer wrote: > I have been tracking down a performance embarrassment on AMAZON EC2 and > have found it I think. Let us please avoid conflating performance with throughput. The behaviour you go on to describe as a performance embarrassment is actually a throughput difference, and the FreeBSD behaviour you're describing is essentially sacrificing throughput and CPU cycles for lower latency. That may not be a trade-off you like, but it is an important factor in this discussion. Don't fall into the trap of labelling Linux's propensity for maximising throughput as superior to an alternative approach which strikes a different balance. It all depends on the use case. > Our OS cousins over at Linux land have implemented some interesting > behaviour when TSO is in use. > > They seem to aggregate ACKS when there is a lot of traffic so that they > can create the > largest possible TSO packet. We on the other hand respond to each and > every returning ACK, as it arrives and thus generally fall into the > behaviour of sending a bunch of small packets, the size of each ack. There's a thing controlled by ethtool called GRO (generic receive offload) which appears to be enabled by default on at least Ubuntu and I guess other Linux's too. It's responsible for aggregating ACKs and data to batch them up the stack if the driver doesn't provide a hardware offload implementation. Try rerunning your experiments with the ACK batching disabled on the Linux host to get an additional comparison point. > for two examples look at: > > > http://www.freebsd.org/~julian/LvsF-tcp-start.tiff > and > http://www.freebsd.org/~julian/LvsF-tcp.tiff > > in each case, we can see FreeBSD on the left and Linux on the right. > > The first case shows the case as the sessions start, and the second case > shows > some distance later (when the sequence numbers wrap around.. no particular > reason to use that, it was just fun to see). > In both cases you can see that each Linux packet (white)(once they have got > going) is responding to multiple bumps in the send window sequence > number (green and yellow lines) (representing the arrival of several ACKs) > while FreeBSD produces a whole bunch of smaller packets, slavishly > following > exactly the size of each incoming ack.. This gives us quite a > performance debt. Again, please s/performance/what-you-really-mean/ here. > Notice that this behaviour in Linux seems to be modal.. it seems to > 'switch on' a little bit > into the 'starting' trace. > > In addition, you can see also that Linux gets going faster even in the > beginning where > TSO isn't in play, by sending a lot more packets up-front. (of course > the wisdom of this > can be argued). They switched to using an initial window of 10 segments some time ago. FreeBSD starts with 3 or more recently, 10 if you're running recent 9-STABLE or 10-CURRENT. > Has anyone done any work on aggregating ACKs, or delaying responding to > them? As noted by Navdeep, we already have the code to aggregate ACKs in our software LRO implementation. The bigger problem is that appropriate byte counting places a default 2*MSS limit on the amount of ACKed data the window can grow by i.e. if an ACK for 64k of data comes up the stack, we'll grow the window by 2 segments worth of data in response. That needs to be addressed - we could send the ACK count up with the aggregated single ACK or just ignore abc_l_var when LRO is in use for a connection. Cheers, Lawrence