From owner-freebsd-net@FreeBSD.ORG Sat Aug 17 13:14:12 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id 97A8434A for ; Sat, 17 Aug 2013 13:14:12 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: from nm10-vm2.bullet.mail.ne1.yahoo.com (nm10-vm2.bullet.mail.ne1.yahoo.com [98.138.90.158]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 38FE127D1 for ; Sat, 17 Aug 2013 13:14:11 +0000 (UTC) Received: from [98.138.90.50] by nm10.bullet.mail.ne1.yahoo.com with NNFMP; 17 Aug 2013 13:14:05 -0000 Received: from [98.138.89.195] by tm3.bullet.mail.ne1.yahoo.com with NNFMP; 17 Aug 2013 13:14:05 -0000 Received: from [127.0.0.1] by omp1053.mail.ne1.yahoo.com with NNFMP; 17 Aug 2013 13:14:05 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 197983.25334.bm@omp1053.mail.ne1.yahoo.com Received: (qmail 9235 invoked by uid 60001); 17 Aug 2013 13:14:05 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1376745244; bh=OzcZzVirSclyM3Ukul+yM9tJ8nb8ROoZboR/jmt4GIM=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=yaUbd+VG0meJRL7SNKOSwvLY/DJHUaqPX4yUtNcl9LqzmsVABEWRYnyYPiTa9kFqM14aD95piG2GOffGnxli6v8WpSsnyo45VBK7WTY+RCvHD+v27x7VKhp3CLuYN9GJAHS6rKV0h0VWlNd/XlrvKLxd3yJ+xHfyytph4WWXA7Y= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=pg5V6deAKuYdNwiokXU3CjVG3690avRYkUb4d5uBz7TKUdIOGYaT2a22A7PSjfE26Gad6ozc/eJqS9SFGWTIb0ScRbwYy4SbLS0cFABRE1b9RYgzQQWYazFJBmnlBISpqeYXr0u63ttRP5RonQJQEujgv6cMSyovXqq1dwTDZTQ=; X-YMail-OSG: mHYdQw4VM1lxOVuxutIHIVwDcMmUSK22aW3LzbLqxsfU.l_ OXEp9kTMMQ5240bL8O1O.YLkqHCTICllvay3AEKxVjY7yMrz91AIINB8KRmX WKQHbUc0LM1z0mh8eTAOZ49ORQEFxZdDGSWUueqaEU_aOxIaqFkVWcXR_JMd jVzv9Pciq3KRJLAagnOWKE6irMTEr_j4wPcY1WleSZgNp.X_OzsOQBQWlSVx C8iwwMvwZpWmr5JWQJjkDTp8QO6V9XgkHcn3tmiNkNvX2w.yhLFoQXwvyeAE llLk0PF3wtNpYPfMJjUEEbf8xfgYK1Y.BpoHKi6bDHSueFInueMVkEvuMOqz JA4AznrWxOp2ULvcNOwPFR2SGarY7xBvddg7NJMqOWgZnz0FwrTl4wPDxwFx QkzMazNaum_m9TXbyF4lzW57H1GHu8At7Cy1_nvHpG8B84GecSPIzgkO5wUA cRXm84x6KvFydpNfMNQbWEfVQNU1mdutD5ass2bm32mfj8WzpCcMvZ25etZ5 rDfiCB_E0b1kVGOz1qjclvGRQB0qDskkcWnaCxrt.VP_5zZInryl9dC0KUHI bjLTtNhyNcQf.Lm53v_vgd1IzJmXabiaT7lmCU9eHh4tukFVHo2idn8nx8HC NxJbTQYxM5L38SzL0etTtyfxzq8IQo7aDp3Pi_0eHzQ-- Received: from [98.203.118.124] by web121606.mail.ne1.yahoo.com via HTTP; Sat, 17 Aug 2013 06:14:04 PDT X-Rocket-MIMEInfo: 002.001, SG9yc2Vob2NrZXkuIFdoYXQgYXJlIHlvdSBndXlzIHJ1bm5pbmcgd2l0aCwgUDRzPwoKTW9kZXJuIGNwdXMgYXJlIG1hZ25pZmljZW50bHkgZmFzdC4gVGhlIHRyaXZpYWxpdHkgb2YgbG9va3VwcyBpcyBhIG5vbi1pc3N1ZcKgCmluIGFsbW9zdCBhbGwgY2FzZXMuIFRoZSBhYmlsaXR5IG9mIG1vZGVybiBjcHVzIHRvIGZpbGwgYSB0cmFuc21pdCBxdWV1ZSBmYXN0ZXIKdGhhbiB0aGUgZGF0YSBjYW4gYmUgdHJhbnNtaXR0ZWQgaXMgaW5jb250cm92ZXJ0aWJsZS4KCldpdGggVENQIHlvdSBoYXZlIHdpbmRvd3MBMAEBAQE- X-Mailer: YahooMailWebService/0.8.154.571 References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org> <20130814102109.GA63246@onelab2.iet.unipi.it> Message-ID: <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com> Date: Sat, 17 Aug 2013 06:14:04 -0700 (PDT) From: Barney Cordoba Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) To: Luigi Rizzo , Lawrence Stewart In-Reply-To: <20130814102109.GA63246@onelab2.iet.unipi.it> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Barney Cordoba List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 17 Aug 2013 13:14:12 -0000 Horsehockey. What are you guys running with, P4s?=0A=0AModern cpus are magn= ificently fast. The triviality of lookups is a non-issue=A0=0Ain almost all= cases. The ability of modern cpus to fill a transmit queue faster=0Athan t= he data can be transmitted is incontrovertible.=0A=0AWith TCP you have wind= ows and things; trying to drill down to hardware=0Ainefficiencies as if you= 're running on a 200Mhz P4 is just silly.=0A=0AI abandoned hardware offload= s back when someone tried to sell me on data=0Acompression boards; the trut= h is that the IO overhead of copying to and from=A0=0Athe board was higher = than the cpu cycles needed to compress the data.=0A=0A=0AThe failure to und= erstand how IO and locks interfere with traffic flow on=A0=0Amulticore syst= ems is the biggest problem with driver development; all of this=0Achatter a= bout moderation is simply a waste of time; such things are completely=0Atun= able; a task that gets far too little attention IMO. Tuning can make a worl= d=0Aof difference if you understand what you're doing.=0A=0AThe idea that h= aving 400K ints/second to gain a tock of throughput is an acceptable=0Atrad= e-off is patently absurd.=0A=0AEFFICIENCY is tantamount. Throughput is almo= st always a tuning issue.=0A=0A=0ABC=0A=0A________________________________= =0A From: Luigi Rizzo =0ATo: Lawrence Stewart =0ACc: FreeBSD Net =0ASent: Wednesday, Augu= st 14, 2013 6:21 AM=0ASubject: it's the output, not ack coalescing (Re: TSO= and FreeBSD vs Linux)=0A =0A=0AOn Wed, Aug 14, 2013 at 05:23:02PM +1000, L= awrence Stewart wrote:=0A> On 08/14/13 16:33, Julian Elischer wrote:=0A> > = On 8/14/13 11:39 AM, Lawrence Stewart wrote:=0A> >> On 08/14/13 03:29, Juli= an Elischer wrote:=0A> >>> I have been tracking down a performance embarras= sment on AMAZON EC2 and=0A> >>> have found it I think.=0A> >> Let us please= avoid conflating performance with throughput. The=0A> >> behaviour you go = on to describe as a performance embarrassment is=0A> >> actually a throughp= ut difference, and the FreeBSD behaviour you're=0A> >> describing is essent= ially sacrificing throughput and CPU cycles for=0A> >> lower latency. That = may not be a trade-off you like, but it is an=0A> >> important factor in th= is discussion.=0A...=0A> Sure, there's nothing wrong with holding throughpu= t up as a key=0A> performance metric for your use case.=0A> =0A> I'm just t= rying to pre-empt a discussion that focuses on one metric and=0A> fails to = consider the bigger picture.=0A...=0A> > I could see no latency reversion.= =0A> =0A> You wouldn't because it would be practically invisible in the sor= ts of=0A> tests/measurements you're doing. Our good friends over at HRT on = the=0A> other hand would be far more likely to care about latency on the or= der=0A> of microseconds. Again, the use case matters a lot.=0A...=0A> > so,= does "Software LRO" mean that LRO on hte NIC should be ON or OFF to=0A> > = see this?=0A> =0A> I think (check the driver code in question as I'm not su= re) that if you=0A> "ifconfig lro" and the driver has hardware support= or has been made=0A> aware of our software implementation, it should DTRT.= =0A=0AThe "lower throughput than linux" that julian was seeing is either=0A= because of a slow (CPU-bound) sender or slow receiver. Given that=0Athe Fre= eBSD tx path is quite expensive (redoing route and arp lookups=0Aon every p= acket, etc.) I highly suspect the sender side is at fault.=0A=0AAck coalesc= ing, LRO, GRO are limited to the set of packets that you=0Areceive in the s= ame batch, which in turn is upper bounded by the=0Ainterrupt moderation del= ay. Apart from simple benchmarks with only=0Aa few flows, it is very hard t= hat ack/lro/gro can coalesce more=0Athan a few segments for the same flow.= =0A=0A=A0=A0=A0 But the real fix is in tcp_output.=0A=0AIn fact, it has nev= er been the case that an ack (single or coalesced)=0Atriggers an immediate = transmission in the output path.=A0 We had this=0Ain the past (Silly Window= Syndrome) and there is code that avoids=0Asending less than 1-mtu under ap= propriate conditions (there is more=0Adata to push out anyways, no NODELAY,= there are outstanding acks,=0Athe window can open further).=A0 In all thes= e cases there is no=0Areasonable way to experience the difference in terms = of latency.=0A=0AIf one really cares, e.g. the High Speed Trading example, = this is=0Aa non issue because any reasonable person would run with TCP_NODE= LAY=0A(and possibly disable interrupt moderation), and optimize for latency= =0Aeven on a per flow basis.=0A=0AIn terms of coding effort, i suspect that= by replacing the 1-mtu=0Alimit (t_maxseg i believe is the variable that we= use in the SWS=0Aavoidance code) with 1-max-tso-segment we can probably ac= hieve good=0Aresults with little programming effort.=0A=0AThen the problem = remains that we should keep a copy of route and=0Aarp information in the so= cket instead of redoing the lookups on=0Aevery single transmission, as they= consume some 25% of the time of=0Aa sendto(), and probably even more when = it comes to large tcp=0Asegments, sendfile() and the like.=0A=0A=A0=A0=A0 c= heers=0A=A0=A0=A0 luigi=0A_______________________________________________= =0Afreebsd-net@freebsd.org mailing list=0Ahttp://lists.freebsd.org/mailman/= listinfo/freebsd-net=0ATo unsubscribe, send any mail to "freebsd-net-unsubs= cribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Sat Aug 17 14:02:58 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id ADFB76C8 for ; Sat, 17 Aug 2013 14:02:58 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: from nm6.bullet.mail.ne1.yahoo.com (nm6.bullet.mail.ne1.yahoo.com [98.138.90.69]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 63D762974 for ; Sat, 17 Aug 2013 14:02:58 +0000 (UTC) Received: from [98.138.90.55] by nm6.bullet.mail.ne1.yahoo.com with NNFMP; 17 Aug 2013 14:02:51 -0000 Received: from [98.138.226.169] by tm8.bullet.mail.ne1.yahoo.com with NNFMP; 17 Aug 2013 14:02:51 -0000 Received: from [127.0.0.1] by omp1070.mail.ne1.yahoo.com with NNFMP; 17 Aug 2013 14:02:51 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 90697.13948.bm@omp1070.mail.ne1.yahoo.com Received: (qmail 75905 invoked by uid 60001); 17 Aug 2013 14:02:50 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1376748170; bh=glnQiP2oGzxd8FvuIRYGTRdMCngdPWXZKqDS2UPJgZM=; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=cYAohvHRhZyh0shM7aMDrWATO9nZmpJpqPdB07dJf3iRH2UVnzawSfCc6HlqlQlXYnvsTY6qV3gQc4R5+BS/iNibHT83LeRb3vdmrj+WYbnn/xQVy4HFqG5zUEbdjz1cbwz1Uu4ghJfw6uAjMv0MA/qjkYSzqXVu6bVJTj1o1Sk= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Rocket-MIMEInfo:X-Mailer:References:Message-ID:Date:From:Reply-To:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=cCrJDb1PBu4Kix2JoX4m98HKLvXKFsistvSy6ft7/J5NJj6kqdC8oJi1uaq3SL1h3grofRMNANt7ZleFcYNI2p9h2x5RMXRAdHbckWobDkSctLcR4tE2vyv6OPmwtAzwOsaDK3ZdzQP/ym37NV8vxEpAK+QkIeGUCNcJAwqtGNY=; X-YMail-OSG: HAEoOZUVM1nJolpldzfg_wn1zLGimSrGvzGvsiL5xAyNnbo t1j0npDEroQfKtzVV23A2DgmKMv6x_5zBpmvd3CeH9vZPOiP3mkF3sCFfElA S4exVo6ZGWwkUzlNPX5b2r_vGR3gzQsQXU_jE4lTZnJhiaudXw5jf7EaV3iM 1_UKV6Fg.RaTJWOvC6P.qvPjgBvhys_h5O0yD8V2XDODfR3fhuV2kQDYpak6 oEQpmx50ulzsgEl82cZsUL1HLc9RNxbCUHRqMIVQj4sY0OWVKLn0WMfTskX4 BtZxGdUjUJibKMEcWkaE.SUHjlJbo_MV_OTVHbvZ.A80mwAiS6QCk8czBfVh Uxe1D.drKu8WdOFLz6.7Gvh8eoMjocpRuHfZ5Wvy1.VRrhTRPk7gM5Mk_b65 H3XsK3453bhXUWpLEIBC3uvu0NIb3g2LrCx4fRgVxYb7F9BBHjrTdMSYrIDq AWwM9FMQukmqGqIEG5cObv5NNieWCCgzZc3YtfFfRgdJnbetVJo5IPDVgeoV Wu_GPU3c0kPPLgO_b1S3mhWNP2tw_tOsYzn3cCdQPlMToKW8UA43nVVwhYYg fN_MXi96xLsUpApk1E91um5yfWKwD5YawahA3eVbN.xOk926nf2_0iBaKuoq QXQ.5VCIWysPQ7KuJ0zPTVHEHHVFD8RQaQqd6 Received: from [98.203.118.124] by web121601.mail.ne1.yahoo.com via HTTP; Sat, 17 Aug 2013 07:02:50 PDT X-Rocket-MIMEInfo: 002.001, Cgo.PkVGRklDSUVOQ1kgaXMgdGFudGFtb3VudC4gVGhyb3VnaHB1dCBpcyBhbG1vc3QgYWx3YXlzIGEgdHVuaW5nIGlzc3VlLgoKCk9mIGNvdXJzZSBJIG1lYW50IHBhcmFtb3VudC4gQ29mZmVlIG1hdHRlcnMgOi18CgpfX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwpGcm9tOiBMdWlnaSBSaXp6byA8cml6em9AaWV0LnVuaXBpLml0PgpUbzogTGF3cmVuY2UgU3Rld2FydCA8bHN0ZXdhcnRAZnJlZWJzZC5vcmc.IApDYzogRnJlZUJTRCBOZXQgPG5ldEBmcmVlYnNkLm9yZz4gClNlbnQ6IFdlZG5lc2QBMAEBAQE- X-Mailer: YahooMailWebService/0.8.154.571 References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org> <20130814102109.GA63246@onelab2.iet.unipi.it> <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com> Message-ID: <1376748170.66110.YahooMailNeo@web121601.mail.ne1.yahoo.com> Date: Sat, 17 Aug 2013 07:02:50 -0700 (PDT) From: Barney Cordoba Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) To: Luigi Rizzo , Lawrence Stewart In-Reply-To: <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: Barney Cordoba List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 17 Aug 2013 14:02:58 -0000 =0A=0A>>EFFICIENCY is tantamount. Throughput is almost always a tuning issu= e.=0A=0A=0AOf course I meant paramount. Coffee matters :-|=0A=0A___________= _____________________=0AFrom: Luigi Rizzo =0ATo: Lawren= ce Stewart =0ACc: FreeBSD Net =0AS= ent: Wednesday, August 14, 2013 6:21 AM=0ASubject: it's the output, not ack= coalescing (Re: TSO and FreeBSD vs Linux)=0A=0A=0AOn Wed, Aug 14, 2013 at = 05:23:02PM +1000, Lawrence Stewart wrote:=0A> On 08/14/13 16:33, Julian Eli= scher wrote:=0A> > On 8/14/13 11:39 AM, Lawrence Stewart wrote:=0A> >> On 0= 8/14/13 03:29, Julian Elischer wrote:=0A> >>> I have been tracking down a p= erformance embarrassment on AMAZON EC2 and=0A> >>> have found it I think.= =0A> >> Let us please avoid conflating performance with throughput. The=0A>= >> behaviour you go on to describe as a performance embarrassment is=0A> >= > actually a throughput difference, and the FreeBSD behaviour you're=0A> >>= describing is essentially sacrificing throughput and CPU cycles for=0A> >>= lower latency. That may not be a trade-off you like, but it is an=0A> >> i= mportant factor in this discussion.=0A...=0A> Sure, there's nothing wrong w= ith holding throughput up as a key=0A> performance metric for your use case= .=0A> =0A> I'm just trying to pre-empt a discussion that focuses on one met= ric and=0A> fails to consider the bigger picture.=0A...=0A> > I could see n= o latency reversion.=0A> =0A> You wouldn't because it would be practically = invisible in the sorts of=0A> tests/measurements you're doing. Our good fri= ends over at HRT on the=0A> other hand would be far more likely to care abo= ut latency on the order=0A> of microseconds. Again, the use case matters a = lot.=0A...=0A> > so, does "Software LRO" mean that LRO on hte NIC should be= ON or OFF to=0A> > see this?=0A> =0A> I think (check the driver code in qu= estion as I'm not sure) that if you=0A> "ifconfig lro" and the driver = has hardware support or has been made=0A> aware of our software implementat= ion, it should DTRT.=0A=0AThe "lower throughput than linux" that julian was= seeing is either=0Abecause of a slow (CPU-bound) sender or slow receiver. = Given that=0Athe FreeBSD tx path is quite expensive (redoing route and arp = lookups=0Aon every packet, etc.) I highly suspect the sender side is at fau= lt.=0A=0AAck coalescing, LRO, GRO are limited to the set of packets that yo= u=0Areceive in the same batch, which in turn is upper bounded by the=0Ainte= rrupt moderation delay. Apart from simple benchmarks with only=0Aa few flow= s, it is very hard that ack/lro/gro can coalesce more=0Athan a few segments= for the same flow.=0A=0A=A0=A0=A0 But the real fix is in tcp_output.=0A=0A= In fact, it has never been the case that an ack (single or coalesced)=0Atri= ggers an immediate transmission in the output path.=A0 We had this=0Ain the= past (Silly Window Syndrome) and there is code that avoids=0Asending less = than 1-mtu under appropriate conditions (there is more=0Adata to push out a= nyways, no NODELAY, there are outstanding acks,=0Athe window can open furth= er).=A0 In all these cases there is no=0Areasonable way to experience the d= ifference in terms of latency.=0A=0AIf one really cares, e.g. the High Spee= d Trading example, this is=0Aa non issue because any reasonable person woul= d run with TCP_NODELAY=0A(and possibly disable interrupt moderation), and o= ptimize for latency=0Aeven on a per flow basis.=0A=0AIn terms of coding eff= ort, i suspect that by replacing the 1-mtu=0Alimit (t_maxseg i believe is t= he variable that we use in the SWS=0Aavoidance code) with 1-max-tso-segment= we can probably achieve good=0Aresults with little programming effort.=0A= =0AThen the problem remains that we should keep a copy of route and=0Aarp i= nformation in the socket instead of redoing the lookups on=0Aevery single t= ransmission, as they consume some 25% of the time of=0Aa sendto(), and prob= ably even more when it comes to large tcp=0Asegments, sendfile() and the li= ke.=0A=0A=A0=A0=A0 cheers=0A=A0=A0=A0 luigi=0A_____________________________= __________________=0Afreebsd-net@freebsd.org mailing list=0Ahttp://lists.fr= eebsd.org/mailman/listinfo/freebsd-net=0ATo unsubscribe, send any mail to "= freebsd-net-unsubscribe@freebsd.org"=0A____________________________________= ___________=0Afreebsd-net@freebsd.org mailing list=0Ahttp://lists.freebsd.o= rg/mailman/listinfo/freebsd-net=0ATo unsubscribe, send any mail to "freebsd= -net-unsubscribe@freebsd.org" From owner-freebsd-net@FreeBSD.ORG Sat Aug 17 15:59:13 2013 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id EAEDA68E; Sat, 17 Aug 2013 15:59:13 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-wi0-x22f.google.com (mail-wi0-x22f.google.com [IPv6:2a00:1450:400c:c05::22f]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id 572EB2DF6; Sat, 17 Aug 2013 15:59:13 +0000 (UTC) Received: by mail-wi0-f175.google.com with SMTP id hq12so1798317wib.2 for ; Sat, 17 Aug 2013 08:59:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=4I+nOnfALmRniMdZQhSS1ovi+DUx/8YLSalCAwwROUg=; b=fBfp9cArltjowzJrKD1Py8sWkLKcOJRV8sdX8/PYKHVdfAmJSNbymmDaj+91WTiH3n Xpov2vKxBJRyphyTq8SQt6W9G5b+DdDpx0m1bCXbV4F1MCeHJfZ59Ufym6KU+VEJNju+ 0F3sAtjZokMIGbmbsKwCOFFlySGm2GZtHCMIELQDdOp2lCWSUZ8cZB8b0oMPMQWj1kHG zv/xa+wh9c4lNvSLX2t2BAqDpU1/Q0/rySr9jZQ17+j9T9cFLx8N5toVxZYkzVQsug2I HNe+WerzHe28YpJXpKFrd3toIU/J88gnCy1PoA8BExJALAUXJgQ5uCiSTUVVrYGgSyXn Poog== MIME-Version: 1.0 X-Received: by 10.180.8.42 with SMTP id o10mr2210836wia.0.1376755151446; Sat, 17 Aug 2013 08:59:11 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.217.116.136 with HTTP; Sat, 17 Aug 2013 08:59:11 -0700 (PDT) In-Reply-To: <1376748170.66110.YahooMailNeo@web121601.mail.ne1.yahoo.com> References: <520A6D07.5080106@freebsd.org> <520AFBE8.1090109@freebsd.org> <520B24A0.4000706@freebsd.org> <520B3056.1000804@freebsd.org> <20130814102109.GA63246@onelab2.iet.unipi.it> <1376745244.6575.YahooMailNeo@web121606.mail.ne1.yahoo.com> <1376748170.66110.YahooMailNeo@web121601.mail.ne1.yahoo.com> Date: Sat, 17 Aug 2013 08:59:11 -0700 X-Google-Sender-Auth: 7NMZ5TjE9Ra2pKW0aU2cDmP-Pe8 Message-ID: Subject: Re: it's the output, not ack coalescing (Re: TSO and FreeBSD vs Linux) From: Adrian Chadd To: Barney Cordoba Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: Lawrence Stewart , Luigi Rizzo , FreeBSD Net X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 17 Aug 2013 15:59:14 -0000 ... we get perfectly good throughput without 400k ints a second on the ixgbe driver. As in, I can easily saturate 2 x 10GE on ixgbe hardware with a handful of flows. That's not terribly difficult. However, there's a few interesting problems that need addressing: * There's lock contention between the transmit side from userland and the TCP timers, and the receive side with ACK processing. Under very high traffic load a lot of lock contention stalls things. We (the royal "we", I'm mostly just doing tooling at the moment) working on that. * There's lock contention on the ARP, routing table and PCB lookups. The latter will go away when we've finally implemented RSS for transmit and receive and then moved things over to using PCB groups on CPUs which have NIC driver threads bound to them. * There's increasing cache thrashing from a larger workload, causing the expensive lookups to be even more expensive. * All the list walks suck. We need to be batching things so we use CPU caches much more efficiently. The idea of using TSO on the transmit side and generic LRO on the receive side is to make the per-packet overhead less. I think we can be much more efficient in general in packet processing, but that's a big task. :-) So, using at least TSO is a big benefit if purely to avoid decomposing things into smaller mbufs and contending on those locks in a very big way. I'm working on PMC to make it easier to use to find these bottlenecks and make the code and data more efficient. Then, likely, I'll end up hacking on generic TSO/LRO, TX/RX RSS queue management and make the PCB group thing default on for SMP machines. I may even take a knife to some of the packet processing overhead. -adrian