Date: Wed, 14 Aug 2013 14:33:04 +0800 From: Julian Elischer <julian@freebsd.org> To: Lawrence Stewart <lstewart@freebsd.org> Cc: FreeBSD Net <net@freebsd.org> Subject: Re: TSO and FreeBSD vs Linux Message-ID: <520B24A0.4000706@freebsd.org> In-Reply-To: <520AFBE8.1090109@freebsd.org> References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 8/14/13 11:39 AM, Lawrence Stewart wrote: > On 08/14/13 03:29, Julian Elischer wrote: >> I have been tracking down a performance embarrassment on AMAZON EC2 and >> have found it I think. > Let us please avoid conflating performance with throughput. The > behaviour you go on to describe as a performance embarrassment is > actually a throughput difference, and the FreeBSD behaviour you're > describing is essentially sacrificing throughput and CPU cycles for > lower latency. That may not be a trade-off you like, but it is an > important factor in this discussion. it was an embarrassment in that in one class of test we performed very poorly. It was not a disaster or a show-stopper, but for our product it is a critical number. It is a throughput difference, as you say but that is a very important part of performance... The latency of linux didn't seem to be any worse than FreeBSD, just the throughput was a lot higher in the same scenario. > > Don't fall into the trap of labelling Linux's propensity for maximising > throughput as superior to an alternative approach which strikes a > different balance. It all depends on the use case. well the linux balance seems t be "be better all around" at this moment so that is embarrassing. :-) I could see no latency reversion. > >> Our OS cousins over at Linux land have implemented some interesting >> behaviour when TSO is in use. >> >> They seem to aggregate ACKS when there is a lot of traffic so that they >> can create the >> largest possible TSO packet. We on the other hand respond to each and >> every returning ACK, as it arrives and thus generally fall into the >> behaviour of sending a bunch of small packets, the size of each ack. > There's a thing controlled by ethtool called GRO (generic receive > offload) which appears to be enabled by default on at least Ubuntu and I > guess other Linux's too. It's responsible for aggregating ACKs and data > to batch them up the stack if the driver doesn't provide a hardware > offload implementation. Try rerunning your experiments with the ACK > batching disabled on the Linux host to get an additional comparison point. I will try that as soon as I get back to the machines in question. >> for two examples look at: >> >> >> http://www.freebsd.org/~julian/LvsF-tcp-start.tiff >> and >> http://www.freebsd.org/~julian/LvsF-tcp.tiff >> >> in each case, we can see FreeBSD on the left and Linux on the right. >> >> The first case shows the case as the sessions start, and the second case >> shows >> some distance later (when the sequence numbers wrap around.. no particular >> reason to use that, it was just fun to see). >> In both cases you can see that each Linux packet (white)(once they have got >> going) is responding to multiple bumps in the send window sequence >> number (green and yellow lines) (representing the arrival of several ACKs) >> while FreeBSD produces a whole bunch of smaller packets, slavishly >> following >> exactly the size of each incoming ack.. This gives us quite a >> performance debt. > Again, please s/performance/what-you-really-mean/ here. ok, In my tests this makes FreeBSD data transfers much slower, by as much as 60%. > >> Notice that this behaviour in Linux seems to be modal.. it seems to >> 'switch on' a little bit >> into the 'starting' trace. >> >> In addition, you can see also that Linux gets going faster even in the >> beginning where >> TSO isn't in play, by sending a lot more packets up-front. (of course >> the wisdom of this >> can be argued). > They switched to using an initial window of 10 segments some time ago. > FreeBSD starts with 3 or more recently, 10 if you're running recent > 9-STABLE or 10-CURRENT. I tried setting initial values as shown: net.inet.tcp.local_slowstart_flightsize: 10 net.inet.tcp.slowstart_flightsize: 10 it didn't seem to make too much difference but I will redo the test. > >> Has anyone done any work on aggregating ACKs, or delaying responding to >> them? > As noted by Navdeep, we already have the code to aggregate ACKs in our > software LRO implementation. The bigger problem is that appropriate byte > counting places a default 2*MSS limit on the amount of ACKed data the > window can grow by i.e. if an ACK for 64k of data comes up the stack, > we'll grow the window by 2 segments worth of data in response. That > needs to be addressed - we could send the ACK count up with the > aggregated single ACK or just ignore abc_l_var when LRO is in use for a > connection. so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to see this? > > Cheers, > Lawrence > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?520B24A0.4000706>