From owner-freebsd-net@FreeBSD.ORG Fri Aug 16 08:54:43 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 91F4C6B9; Fri, 16 Aug 2013 08:54:43 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 6DC072901; Fri, 16 Aug 2013 08:54:43 +0000 (UTC) Received: from jre-mbp.elischer.org (etroy.elischer.org [121.45.226.51]) (authenticated bits=0) by vps1.elischer.org (8.14.7/8.14.6) with ESMTP id r7G8sYnV062506 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Fri, 16 Aug 2013 01:54:40 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <520DE8C5.8070508@freebsd.org> Date: Fri, 16 Aug 2013 16:54:29 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:17.0) Gecko/20130801 Thunderbird/17.0.8 MIME-Version: 1.0 To: Luigi Rizzo Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org> <20130814102109.GA63246@onelab2.iet.unipi.it> In-Reply-To: <20130814102109.GA63246@onelab2.iet.unipi.it> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Lawrence Stewart , FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Aug 2013 08:54:43 -0000 On 8/14/13 6:21 PM, Luigi Rizzo wrote: > On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote: >> On 08/14/13 16:33, Julian Elischer wrote: >>> On 8/14/13 11:39 AM, Lawrence Stewart wrote: >>>> On 08/14/13 03:29, Julian Elischer wrote: >>>>> I have been tracking down a performance embarrassment on AMAZON EC2 and >>>>> have found it I think. >>>> Let us please avoid conflating performance with throughput. The >>>> behaviour you go on to describe as a performance embarrassment is >>>> actually a throughput difference, and the FreeBSD behaviour you're >>>> describing is essentially sacrificing throughput and CPU cycles for >>>> lower latency. That may not be a trade-off you like, but it is an >>>> important factor in this discussion. > ... >> Sure, there's nothing wrong with holding throughput up as a key >> performance metric for your use case. >> >> I'm just trying to pre-empt a discussion that focuses on one metric and >> fails to consider the bigger picture. > ... >>> I could see no latency reversion. >> You wouldn't because it would be practically invisible in the sorts of >> tests/measurements you're doing. Our good friends over at HRT on the >> other hand would be far more likely to care about latency on the order >> of microseconds. Again, the use case matters a lot. > ... >>> so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to >>> see this? >> I think (check the driver code in question as I'm not sure) that if you >> "ifconfig lro" and the driver has hardware support or has been made >> aware of our software implementation, it should DTRT. > The "lower throughput than linux" that julian was seeing is either > because of a slow (CPU-bound) sender or slow receiver. Given that > the FreeBSD tx path is quite expensive (redoing route and arp lookups > on every packet, etc.) I highly suspect the sender side is at fault. if we send bigger packets then we do less lookups do we not? > > Ack coalescing, LRO, GRO are limited to the set of packets that you > receive in the same batch, which in turn is upper bounded by the > interrupt moderation delay. Apart from simple benchmarks with only > a few flows, it is very hard that ack/lro/gro can coalesce more > than a few segments for the same flow. > > But the real fix is in tcp_output. > > In fact, it has never been the case that an ack (single or coalesced) > triggers an immediate transmission in the output path. We had this > in the past (Silly Window Syndrome) and there is code that avoids > sending less than 1-mtu under appropriate conditions (there is more > data to push out anyways, no NODELAY, there are outstanding acks, > the window can open further). In all these cases there is no > reasonable way to experience the difference in terms of latency. > > If one really cares, e.g. the High Speed Trading example, this is > a non issue because any reasonable person would run with TCP_NODELAY > (and possibly disable interrupt moderation), and optimize for latency > even on a per flow basis. > > In terms of coding effort, i suspect that by replacing the 1-mtu > limit (t_maxseg i believe is the variable that we use in the SWS > avoidance code) with 1-max-tso-segment we can probably achieve good > results with little programming effort. > > Then the problem remains that we should keep a copy of route and > arp information in the socket instead of redoing the lookups on > every single transmission, as they consume some 25% of the time of > a sendto(), and probably even more when it comes to large tcp > segments, sendfile() and the like. > > cheers > luigi > >