From owner-freebsd-net@freebsd.org Wed Aug 19 13:20:26 2015 Return-Path: Delivered-To: freebsd-net@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5C75C9BE12B; Wed, 19 Aug 2015 13:20:26 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from kabab.cs.huji.ac.il (kabab.cs.huji.ac.il [132.65.116.210]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A1BB3127F; Wed, 19 Aug 2015 13:20:25 +0000 (UTC) (envelope-from danny@cs.huji.ac.il) Received: from chamsa.cs.huji.ac.il ([132.65.80.19]) by kabab.cs.huji.ac.il with esmtp id 1ZS3Hs-000F95-7t; Wed, 19 Aug 2015 16:20:16 +0300 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) Subject: Re: ix(intel) vs mlxen(mellanox) 10Gb performance From: Daniel Braniss In-Reply-To: <2013503980.25726607.1439989235806.JavaMail.zimbra@uoguelph.ca> Date: Wed, 19 Aug 2015 16:20:15 +0300 Cc: Hans Petter Selasky , pyunyh@gmail.com, FreeBSD stable , FreeBSD Net , Slawa Olhovchenkov , Christopher Forgeron Message-Id: <2BF7FA92-2DDD-452C-822C-534C0DC0B49F@cs.huji.ac.il> References: <1D52028A-B39F-4F9B-BD38-CB1D73BF5D56@cs.huji.ac.il> <9D8B0503-E8FA-43CA-88F0-01F184F84D9B@cs.huji.ac.il> <1721122651.24481798.1439902381663.JavaMail.zimbra@uoguelph.ca> <55D333D6.5040102@selasky.org> <1325951625.25292515.1439934848268.JavaMail.zimbra@uoguelph.ca> <55D429A4.3010407@selasky.org> <20150819074212.GB964@michelle.fasterthan.com> <55D43615.1030401@selasky.org> <2013503980.25726607.1439989235806.JavaMail.zimbra@uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.2104) Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 19 Aug 2015 13:20:26 -0000 > On 19 Aug 2015, at 16:00, Rick Macklem wrote: >=20 > Hans Petter Selasky wrote: >> On 08/19/15 09:42, Yonghyeon PYUN wrote: >>> On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: >>>> On 08/18/15 23:54, Rick Macklem wrote: >>>>> Ouch! Yes, I now see that the code that counts the # of mbufs is = before >>>>> the >>>>> code that adds the tcp/ip header mbuf. >>>>>=20 >>>>> In my opinion, this should be fixed by setting = if_hw_tsomaxsegcount to >>>>> whatever >>>>> the driver provides - 1. It is not the driver's responsibility to = know if >>>>> a tcp/ip >>>>> header mbuf will be added and is a lot less confusing that = expecting the >>>>> driver >>>>> author to know to subtract one. (I had mistakenly thought that >>>>> tcp_output() had >>>>> added the tc/ip header mbuf before the loop that counts mbufs in = the >>>>> list. >>>>> Btw, >>>>> this tcp/ip header mbuf also has leading space for the MAC layer = header.) >>>>>=20 >>>>=20 >>>> Hi Rick, >>>>=20 >>>> Your question is good. With the Mellanox hardware we have separate >>>> so-called inline data space for the TCP/IP headers, so if the TCP = stack >>>> subtracts something, then we would need to add something to the = limit, >>>> because then the scatter gather list is only used for the data = part. >>>>=20 >>>=20 >>> I think all drivers in tree don't subtract 1 for >>> if_hw_tsomaxsegcount. Probably touching Mellanox driver would be >>> simpler than fixing all other drivers in tree. >>>=20 >>>> Maybe it can be controlled by some kind of flag, if all the three = TSO >>>> limits should include the TCP/IP/ethernet headers too. I'm pretty = sure >>>> we want both versions. >>>>=20 >>>=20 >>> Hmm, I'm afraid it's already complex. Drivers have to tell almost >>> the same information to both bus_dma(9) and network stack. >>=20 >> Don't forget that not all drivers in the tree set the TSO limits = before >> if_attach(), so possibly the subtraction of one TSO fragment needs to = go >> into ip_output() .... >>=20 > Ok, I realized that some drivers may not know the answers before = ether_ifattach(), > due to the way they are configured/written (I saw the use of = if_hw_tsomax_update() > in the patch). >=20 > If it is subtracted as a part of the assignment to = if_hw_tsomaxsegcount in tcp_output() > at line#791 in tcp_output() like the following, I don't think it = should matter if the > values are set before ether_ifattach()? > /* > * Subtract 1 for the tcp/ip header mbuf that > * will be prepended to the mbuf chain in this > * function in the code below this block. > */ > if_hw_tsomaxsegcount =3D tp->t_tsomaxsegcount - = 1; >=20 > I don't have a good solution for the case where a driver doesn't plan = on using the > tcp/ip header provided by tcp_output() except to say the driver can = add one to the > setting to compensate for that (and if they fail to do so, it still = works, although > somewhat suboptimally). When I now read the comment in = sys/net/if_var.h it is clear > what it means, but for some reason I didn't read it that way before? = (I think it was > the part that said the driver didn't have to subtract for the headers = that confused me?) > In any case, we need to try and come up with a clear definition of = what they need to > be set to. >=20 > I can now think of two ways to deal with this: > 1 - Leave tcp_output() as is, but provide a macro for the device = driver authors to use > that sets if_hw_tsomaxsegcount with a flag for "driver uses tcp/ip = header mbuf", > documenting that this flag should normally be true. > OR > 2 - Change tcp_output() as above, noting that this is a workaround for = confusion w.r.t. > whether or not if_hw_tsomaxsegcount should include the tcp/ip = header mbuf and > update the comment in if_var.h to reflect this. Then drivers that = don't use the > tcp/ip header mbuf can increase their value for = if_hw_tsomaxsegcount by 1. > (The comment should also mention that a value of 35 or greater is = much preferred to > 32 if the hardware will support that.) >=20 > Also, I'd like to apologize for some of my emails getting a little = "blunt". I just find > it flustrating that this problem is still showing up and is even in = 10.2. This is partly > my fault for not making it clearer to driver authors what = if_hw_tsomaxsegcount should be > set to, because I had it incorrect. >=20 > Hopefully we can come up with a solution that everyone is comfortable = with, rick ok guys, when you have some code for me to try just let me know. danny