Date: Wed, 14 Aug 2013 17:23:02 +1000 From: Lawrence Stewart <lstewart@freebsd.org> To: Julian Elischer <julian@freebsd.org> Cc: FreeBSD Net <net@freebsd.org> Subject: Re: TSO and FreeBSD vs Linux Message-ID: <520B3056.1000804@freebsd.org> In-Reply-To: <520B24A0.4000706@freebsd.org> References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 08/14/13 16:33, Julian Elischer wrote: > On 8/14/13 11:39 AM, Lawrence Stewart wrote: >> On 08/14/13 03:29, Julian Elischer wrote: >>> I have been tracking down a performance embarrassment on AMAZON EC2 and >>> have found it I think. >> Let us please avoid conflating performance with throughput. The >> behaviour you go on to describe as a performance embarrassment is >> actually a throughput difference, and the FreeBSD behaviour you're >> describing is essentially sacrificing throughput and CPU cycles for >> lower latency. That may not be a trade-off you like, but it is an >> important factor in this discussion. > it was an embarrassment in that in one class of test we performed very > poorly. > It was not a disaster or a show-stopper, but for our product it is a > critical number. Sure, there's nothing wrong with holding throughput up as a key performance metric for your use case. I'm just trying to pre-empt a discussion that focuses on one metric and fails to consider the bigger picture. > It is a throughput difference, as you say but that is a very important > part of performance... > The latency of linux didn't seem to be any worse > than FreeBSD, just the throughput was a lot higher in the same scenario. The latency must increase when you delay packets in order to coalesce them. Whether you were able to perceive that or not with the bulk send type testing and measurement that you're doing is a separate issue. >> Don't fall into the trap of labelling Linux's propensity for maximising >> throughput as superior to an alternative approach which strikes a >> different balance. It all depends on the use case. > well the linux balance seems t be "be better all around" at this moment > so that is > embarrassing. :-) Better all round for you? Seems to be the case. Better all round for everyone? No. Linux's choice of CUBIC as the default congestion control algorithm is also a choice that maximises throughput but has side effects which IMO are quite unfortunate. > I could see no latency reversion. You wouldn't because it would be practically invisible in the sorts of tests/measurements you're doing. Our good friends over at HRT on the other hand would be far more likely to care about latency on the order of microseconds. Again, the use case matters a lot. [snip] >>> Notice that this behaviour in Linux seems to be modal.. it seems to >>> 'switch on' a little bit >>> into the 'starting' trace. >>> >>> In addition, you can see also that Linux gets going faster even in the >>> beginning where >>> TSO isn't in play, by sending a lot more packets up-front. (of course >>> the wisdom of this >>> can be argued). >> They switched to using an initial window of 10 segments some time ago. >> FreeBSD starts with 3 or more recently, 10 if you're running recent >> 9-STABLE or 10-CURRENT. > I tried setting initial values as shown: > net.inet.tcp.local_slowstart_flightsize: 10 > net.inet.tcp.slowstart_flightsize: 10 > it didn't seem to make too much difference but I will redo the test. Assuming this is still FreeBSD 8.0 as you mentioned out-of-band, changing those variables without disabling rfc3390 will have no effect. >>> Has anyone done any work on aggregating ACKs, or delaying responding to >>> them? >> As noted by Navdeep, we already have the code to aggregate ACKs in our >> software LRO implementation. The bigger problem is that appropriate byte >> counting places a default 2*MSS limit on the amount of ACKed data the >> window can grow by i.e. if an ACK for 64k of data comes up the stack, >> we'll grow the window by 2 segments worth of data in response. That >> needs to be addressed - we could send the ACK count up with the >> aggregated single ACK or just ignore abc_l_var when LRO is in use for a >> connection. BTW, as a work around for the appropriate byte counting issue, you can crank net.inet.tcp.abc_l_var up to the number of MSS segments you wish to allow it to increase the window by (per ack). If LRO is aggregating 64 on-wire acks into 1 mega ack, you should set abc_l_var=64. > so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to > see this? I think (check the driver code in question as I'm not sure) that if you "ifconfig <if> lro" and the driver has hardware support or has been made aware of our software implementation, it should DTRT. Cheers, Lawrence
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?520B3056.1000804>