Date: Sat, 17 Aug 2013 06:14:04 -0700 (PDT) From: Barney Cordoba <barney_cordoba@yahoo.com> To: Luigi Rizzo <rizzo@iet.unipi.it>, Lawrence Stewart <lstewart@freebsd.org> Cc: FreeBSD Net <net@freebsd.org> Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) Message-ID: <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com> In-Reply-To: <20130814102109.GA63246@onelab2.iet.unipi.it> References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org> <20130814102109.GA63246@onelab2.iet.unipi.it>
next in thread | previous in thread | raw e-mail | index | archive | help
Horsehockey. What are you guys running with, P4s? Modern cpus are magnificently fast. The triviality of lookups is a non-issue in almost all cases. The ability of modern cpus to fill a transmit queue faster than the data can be transmitted is incontrovertible. With TCP you have windows and things; trying to drill down to hardware inefficiencies as if you're running on a 200Mhz P4 is just silly. I abandoned hardware offloads back when someone tried to sell me on data compression boards; the truth is that the IO overhead of copying to and from the board was higher than the cpu cycles needed to compress the data. The failure to understand how IO and locks interfere with traffic flow on multicore systems is the biggest problem with driver development; all of this chatter about moderation is simply a waste of time; such things are completely tunable; a task that gets far too little attention IMO. Tuning can make a world of difference if you understand what you're doing. The idea that having 400K ints/second to gain a tock of throughput is an acceptable trade-off is patently absurd. EFFICIENCY is tantamount. Throughput is almost always a tuning issue. BC ________________________________ From: Luigi Rizzo <rizzo@iet.unipi.it> To: Lawrence Stewart <lstewart@freebsd.org> Cc: FreeBSD Net <net@freebsd.org> Sent: Wednesday, August 14, 2013 6:21 AM Subject: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote: > On 08/14/13 16:33, Julian Elischer wrote: > > On 8/14/13 11:39 AM, Lawrence Stewart wrote: > >> On 08/14/13 03:29, Julian Elischer wrote: > >>> I have been tracking down a performance embarrassment on AMAZON EC2 and > >>> have found it I think. > >> Let us please avoid conflating performance with throughput. The > >> behaviour you go on to describe as a performance embarrassment is > >> actually a throughput difference, and the FreeBSD behaviour you're > >> describing is essentially sacrificing throughput and CPU cycles for > >> lower latency. That may not be a trade-off you like, but it is an > >> important factor in this discussion. ... > Sure, there's nothing wrong with holding throughput up as a key > performance metric for your use case. > > I'm just trying to pre-empt a discussion that focuses on one metric and > fails to consider the bigger picture. ... > > I could see no latency reversion. > > You wouldn't because it would be practically invisible in the sorts of > tests/measurements you're doing. Our good friends over at HRT on the > other hand would be far more likely to care about latency on the order > of microseconds. Again, the use case matters a lot. ... > > so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to > > see this? > > I think (check the driver code in question as I'm not sure) that if you > "ifconfig <if> lro" and the driver has hardware support or has been made > aware of our software implementation, it should DTRT. The "lower throughput than linux" that julian was seeing is either because of a slow (CPU-bound) sender or slow receiver. Given that the FreeBSD tx path is quite expensive (redoing route and arp lookups on every packet, etc.) I highly suspect the sender side is at fault. Ack coalescing, LRO, GRO are limited to the set of packets that you receive in the same batch, which in turn is upper bounded by the interrupt moderation delay. Apart from simple benchmarks with only a few flows, it is very hard that ack/lro/gro can coalesce more than a few segments for the same flow. But the real fix is in tcp_output. In fact, it has never been the case that an ack (single or coalesced) triggers an immediate transmission in the output path. We had this in the past (Silly Window Syndrome) and there is code that avoids sending less than 1-mtu under appropriate conditions (there is more data to push out anyways, no NODELAY, there are outstanding acks, the window can open further). In all these cases there is no reasonable way to experience the difference in terms of latency. If one really cares, e.g. the High Speed Trading example, this is a non issue because any reasonable person would run with TCP_NODELAY (and possibly disable interrupt moderation), and optimize for latency even on a per flow basis. In terms of coding effort, i suspect that by replacing the 1-mtu limit (t_maxseg i believe is the variable that we use in the SWS avoidance code) with 1-max-tso-segment we can probably achieve good results with little programming effort. Then the problem remains that we should keep a copy of route and arp information in the socket instead of redoing the lookups on every single transmission, as they consume some 25% of the time of a sendto(), and probably even more when it comes to large tcp segments, sendfile() and the like. cheers luigi _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Sat Aug 17 14:02:58 2013 Return-Path: <owner-freebsd-net@FreeBSD.ORG> Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id ADFB76C8 for <net@freebsd.org>; Sat, 17 Aug 2013 14:02:58 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: from nm6.bullet.mail.ne1.yahoo.com (nm6.bullet.mail.ne1.yahoo.com [98.138.90.69]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 63D762974 for <net@freebsd.org>; Sat, 17 Aug 2013 14:02:58 +0000 (UTC) Received: from [98.138.90.55] by nm6.bullet.mail.ne1.yahoo.com with NNFMP; 17 Aug 2013 14:02:51 -0000 Received: from [98.138.226.169] by tm8.bullet.mail.ne1.yahoo.com with NNFMP; 17 Aug 2013 14:02:51 -0000 Received: from [127.0.0.1] by omp1070.mail.ne1.yahoo.com with NNFMP; 17 Aug 2013 14:02:51 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 90697.13948.bm@omp1070.mail.ne1.yahoo.com Received: (qmail 75905 invoked by uid 60001); 17 Aug 2013 14:02:50 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t76748170; bh=glnQiP2oGzxd8FvuIRYGTRdMCngdPWXZKqDS2UPJgZM=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=cYAohvHRhZyh0shM7aMDrWATO9nZmpJpqPdB07dJf3iRH2UVnzawSfCc6HlqlQlXYnvsTY6qV3gQc4R5+BS/iNibHT83LeRb3vdmrj+WYbnn/xQVy4HFqG5zUEbdjz1cbwz1Uu4ghJfw6uAjMv0MA/qjkYSzqXVu6bVJTj1o1SkDomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; bÌrJDb1PBu4Kix2JoX4m98HKLvXKFsistvSy6ft7/J5NJj6kqdC8oJi1uaq3SL1h3grofRMNANt7ZleFcYNI2p9h2x5RMXRAdHbckWobDkSctLcR4tE2vyv6OPmwtAzwOsaDK3ZdzQP/ym37NV8vxEpAK+QkIeGUCNcJAwqtGNY=; X-YMail-OSG: HAEoOZUVM1nJolpldzfg_wn1zLGimSrGvzGvsiL5xAyNnbo t1j0npDEroQfKtzVV23A2DgmKMv6x_5zBpmvd3CeH9vZPOiP3mkF3sCFfElA S4exVo6ZGWwkUzlNPX5b2r_vGR3gzQsQXU_jE4lTZnJhiaudXw5jf7EaV3iM 1_UKV6Fg.RaTJWOvC6P.qvPjgBvhys_h5O0yD8V2XDODfR3fhuV2kQDYpak6 oEQpmx50ulzsgEl82cZsUL1HLc9RNxbCUHRqMIVQj4sY0OWVKLn0WMfTskX4 BtZxGdUjUJibKMEcWkaE.SUHjlJbo_MV_OTVHbvZ.A80mwAiS6QCk8czBfVh Uxe1D.drKu8WdOFLz6.7Gvh8eoMjocpRuHfZ5Wvy1.VRrhTRPk7gM5Mk_b65 H3XsK3453bhXUWpLEIBC3uvu0NIb3g2LrCx4fRgVxYb7F9BBHjrTdMSYrIDq AWwM9FMQukmqGqIEG5cObv5NNieWCCgzZc3YtfFfRgdJnbetVJo5IPDVgeoV Wu_GPU3c0kPPLgO_b1S3mhWNP2tw_tOsYzn3cCdQPlMToKW8UA43nVVwhYYg fN_MXi96xLsUpApk1E91um5yfWKwD5YawahA3eVbN.xOk926nf2_0iBaKuoq QXQ.5VCIWysPQ7KuJ0zPTVHEHHVFD8RQaQqd6 Received: from [98.203.118.124] by web121601.mail.ne1.yahoo.com via HTTP; Sat, 17 Aug 2013 07:02:50 PDT X-Rocket-MIMEInfo: 002.001, Cgo.PkVGRklDSUVOQ1kgaXMgdGFudGFtb3VudC4gVGhyb3VnaHB1dCBpcyBhbG1vc3QgYWx3YXlzIGEgdHVuaW5nIGlzc3VlLgoKCk9mIGNvdXJzZSBJIG1lYW50IHBhcmFtb3VudC4gQ29mZmVlIG1hdHRlcnMgOi18CgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwpGcm9tOiBMdWlnaSBSaXp6byA8cml6em9AaWV0LnVuaXBpLml0PgpUbzogTGF3cmVuY2UgU3Rld2FydCA8bHN0ZXdhcnRAZnJlZWJzZC5vcmc.IApDYzogRnJlZUJTRCBOZXQgPG5ldEBmcmVlYnNkLm9yZz4gClNlbnQ6IFdlZG5lc2QBMAEBAQE- X-Mailer: YahooMailWebService/0.8.154.571 References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org> <20130814102109.GA63246@onelab2.iet.unipi.it> <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com> Message-ID: <1376748170.66110.YahooMailNeo@web121601.mail.ne1.yahoo.com> Date: Sat, 17 Aug 2013 07:02:50 -0700 (PDT) From: Barney Cordoba <barney_cordoba@yahoo.com> Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) To: Luigi Rizzo <rizzo@iet.unipi.it>, Lawrence Stewart <lstewart@freebsd.org> In-Reply-To: <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Net <net@freebsd.org> X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Barney Cordoba <barney_cordoba@yahoo.com> List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org> List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>, <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe> List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net> List-Post: <mailto:freebsd-net@freebsd.org> List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help> List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>, <mailto:freebsd-net-request@freebsd.org?subject=subscribe> X-List-Received-Date: Sat, 17 Aug 2013 14:02:58 -0000 >>EFFICIENCY is tantamount. Throughput is almost always a tuning issue. Of course I meant paramount. Coffee matters :-| ________________________________ From: Luigi Rizzo <rizzo@iet.unipi.it> To: Lawrence Stewart <lstewart@freebsd.org> Cc: FreeBSD Net <net@freebsd.org> Sent: Wednesday, August 14, 2013 6:21 AM Subject: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) On Wed, Aug 14, 2013 at 05:23:02PM +1000, Lawrence Stewart wrote: > On 08/14/13 16:33, Julian Elischer wrote: > > On 8/14/13 11:39 AM, Lawrence Stewart wrote: > >> On 08/14/13 03:29, Julian Elischer wrote: > >>> I have been tracking down a performance embarrassment on AMAZON EC2 and > >>> have found it I think. > >> Let us please avoid conflating performance with throughput. The > >> behaviour you go on to describe as a performance embarrassment is > >> actually a throughput difference, and the FreeBSD behaviour you're > >> describing is essentially sacrificing throughput and CPU cycles for > >> lower latency. That may not be a trade-off you like, but it is an > >> important factor in this discussion. ... > Sure, there's nothing wrong with holding throughput up as a key > performance metric for your use case. > > I'm just trying to pre-empt a discussion that focuses on one metric and > fails to consider the bigger picture. ... > > I could see no latency reversion. > > You wouldn't because it would be practically invisible in the sorts of > tests/measurements you're doing. Our good friends over at HRT on the > other hand would be far more likely to care about latency on the order > of microseconds. Again, the use case matters a lot. ... > > so, does "Software LRO" mean that LRO on hte NIC should be ON or OFF to > > see this? > > I think (check the driver code in question as I'm not sure) that if you > "ifconfig <if> lro" and the driver has hardware support or has been made > aware of our software implementation, it should DTRT. The "lower throughput than linux" that julian was seeing is either because of a slow (CPU-bound) sender or slow receiver. Given that the FreeBSD tx path is quite expensive (redoing route and arp lookups on every packet, etc.) I highly suspect the sender side is at fault. Ack coalescing, LRO, GRO are limited to the set of packets that you receive in the same batch, which in turn is upper bounded by the interrupt moderation delay. Apart from simple benchmarks with only a few flows, it is very hard that ack/lro/gro can coalesce more than a few segments for the same flow. But the real fix is in tcp_output. In fact, it has never been the case that an ack (single or coalesced) triggers an immediate transmission in the output path. We had this in the past (Silly Window Syndrome) and there is code that avoids sending less than 1-mtu under appropriate conditions (there is more data to push out anyways, no NODELAY, there are outstanding acks, the window can open further). In all these cases there is no reasonable way to experience the difference in terms of latency. If one really cares, e.g. the High Speed Trading example, this is a non issue because any reasonable person would run with TCP_NODELAY (and possibly disable interrupt moderation), and optimize for latency even on a per flow basis. In terms of coding effort, i suspect that by replacing the 1-mtu limit (t_maxseg i believe is the variable that we use in the SWS avoidance code) with 1-max-tso-segment we can probably achieve good results with little programming effort. Then the problem remains that we should keep a copy of route and arp information in the socket instead of redoing the lookups on every single transmission, as they consume some 25% of the time of a sendto(), and probably even more when it comes to large tcp segments, sendfile() and the like. cheers luigi _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" _______________________________________________ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Sat Aug 17 15:59:13 2013 Return-Path: <owner-freebsd-net@FreeBSD.ORG> Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id EAEDA68E; Sat, 17 Aug 2013 15:59:13 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-wi0-x22f.google.com (mail-wi0-x22f.google.com [IPv6:2a00:1450:400c:c05::22f]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 572EB2DF6; Sat, 17 Aug 2013 15:59:13 +0000 (UTC) Received: by mail-wi0-f175.google.com with SMTP id hq12so1798317wib.2 for <multiple recipients>; Sat, 17 Aug 2013 08:59:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s 120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=4I+nOnfALmRniMdZQhSS1ovi+DUx/8YLSalCAwwROUg=; bûfp9cArltjowzJrKD1Py8sWkLKcOJRV8sdX8/PYKHVdfAmJSNbymmDaj+91WTiH3n Xpov2vKxBJRyphyTq8SQt6W9G5b+DdDpx0m1bCXbV4F1MCeHJfZ59Ufym6KU+VEJNju+ 0F3sAtjZokMIGbmbsKwCOFFlySGm2GZtHCMIELQDdOp2lCWSUZ8cZB8b0oMPMQWj1kHG zv/xa+wh9c4lNvSLX2t2BAqDpU1/Q0/rySr9jZQ17+j9T9cFLx8N5toVxZYkzVQsug2I HNe+WerzHe28YpJXpKFrd3toIU/J88gnCy1PoA8BExJALAUXJgQ5uCiSTUVVrYGgSyXn Poog=MIME-Version: 1.0 X-Received: by 10.180.8.42 with SMTP id o10mr2210836wia.0.1376755151446; Sat, 17 Aug 2013 08:59:11 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.217.116.136 with HTTP; Sat, 17 Aug 2013 08:59:11 -0700 (PDT) In-Reply-To: <1376748170.66110.YahooMailNeo@web121601.mail.ne1.yahoo.com> References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org> <20130814102109.GA63246@onelab2.iet.unipi.it> <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com> <1376748170.66110.YahooMailNeo@web121601.mail.ne1.yahoo.com> Date: Sat, 17 Aug 2013 08:59:11 -0700 X-Google-Sender-Auth: 7NMZ5TjE9Ra2pKW0aU2cDmP-Pe8 Message-ID: <CAJ-VmonGeqn5qqbfvF9xWaFPYNMNSVb6VwMx+oEVSGXVid98ag@mail.gmail.com> Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) From: Adrian Chadd <adrian@freebsd.org> To: Barney Cordoba <barney_cordoba@yahoo.com> Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Lawrence Stewart <lstewart@freebsd.org>, Luigi Rizzo <rizzo@iet.unipi.it>, FreeBSD Net <net@freebsd.org> X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org> List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>, <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe> List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net> List-Post: <mailto:freebsd-net@freebsd.org> List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help> List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>, <mailto:freebsd-net-request@freebsd.org?subject=subscribe> X-List-Received-Date: Sat, 17 Aug 2013 15:59:14 -0000 ... we get perfectly good throughput without 400k ints a second on the ixgbe driver. As in, I can easily saturate 2 x 10GE on ixgbe hardware with a handful of flows. That's not terribly difficult. However, there's a few interesting problems that need addressing: * There's lock contention between the transmit side from userland and the TCP timers, and the receive side with ACK processing. Under very high traffic load a lot of lock contention stalls things. We (the royal "we", I'm mostly just doing tooling at the moment) working on that. * There's lock contention on the ARP, routing table and PCB lookups. The latter will go away when we've finally implemented RSS for transmit and receive and then moved things over to using PCB groups on CPUs which have NIC driver threads bound to them. * There's increasing cache thrashing from a larger workload, causing the expensive lookups to be even more expensive. * All the list walks suck. We need to be batching things so we use CPU caches much more efficiently. The idea of using TSO on the transmit side and generic LRO on the receive side is to make the per-packet overhead less. I think we can be much more efficient in general in packet processing, but that's a big task. :-) So, using at least TSO is a big benefit if purely to avoid decomposing things into smaller mbufs and contending on those locks in a very big way. I'm working on PMC to make it easier to use to find these bottlenecks and make the code and data more efficient. Then, likely, I'll end up hacking on generic TSO/LRO, TX/RX RSS queue management and make the PCB group thing default on for SMP machines. I may even take a knife to some of the packet processing overhead. -adrian
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1376745244.6575.YahooMailNeo>
