From owner-freebsd-net@FreeBSD.ORG Sun Feb 2 11:06:20 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2FD56415 for ; Sun, 2 Feb 2014 11:06:20 +0000 (UTC) Received: from kabab.cs.huji.ac.il (kabab.cs.huji.ac.il [132.65.116.12]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id A78991352 for ; Sun, 2 Feb 2014 11:06:19 +0000 (UTC) Received: from th-04.cs.huji.ac.il ([132.65.80.125]) by kabab.cs.huji.ac.il with esmtp id 1W9usL-0005yh-1O; Sun, 02 Feb 2014 13:06:09 +0200 Mime-Version: 1.0 (Mac OS X Mail 7.1 \(1827\)) Subject: Re: Terrible NFS performance under 9.2-RELEASE? From: Daniel Braniss In-Reply-To: <482557096.17290094.1390873872231.JavaMail.root@uoguelph.ca> Date: Sun, 2 Feb 2014 13:04:31 +0200 Message-Id: References: <482557096.17290094.1390873872231.JavaMail.root@uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.1827) Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.17 Cc: Pyun YongHyeon , FreeBSD Net , Adam McDougall , Jack Vogel X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 02 Feb 2014 11:06:20 -0000 hi Rick, et.all. tried your patch but it didn=92t help,the server is stuck. just for fun, I tried a different client/host, this one has a broadcom = NextXtreme II that was MFC=92ed lately, and the results are worse than the Intel (5hs instead = of 4hs) but faster without TSO with TSO enabled and bs=3D32k: 5.09hs 18325.62 real 1109.23 user 4591.60 sys without TSO: 4.75hs 17120.40 real 1114.08 user 3537.61 sys So what is the advantage of using TSO? (no complain here, just curious) I=92ll try to see if as a server it has the same TSO related issues.=20 cheers, danny On Jan 28, 2014, at 3:51 AM, Rick Macklem wrote: > Jack Vogel wrote: >> That header file is for the VF driver :) which I don't believe is >> being >> used in this case. >> The driver is capable of handling 256K but its limited by the stack >> to 64K >> (look in >> ixgbe.h), so its not a few bytes off due to the vlan header. >>=20 >> The scatter size is not an arbitrary one, its due to hardware >> limitations >> in Niantic >> (82599). Turning off TSO in the 10G environment is not practical, >> you will >> have >> trouble getting good performance. >>=20 >> Jack >>=20 > Well, if you look at this thread, Daniel got much better performance > by turning off TSO. However, I agree that this is not an ideal = solution. > = http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677B >=20 > rick >=20 >>=20 >>=20 >> On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN >> wrote: >>=20 >>> On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: >>>> pyunyh@gmail.com wrote: >>>>> On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: >>>>>> Adam McDougall wrote: >>>>>>> Also try rsize=3D32768,wsize=3D32768 in your mount options, >>>>>>> made a >>>>>>> huge >>>>>>> difference for me. I've noticed slow file transfers on NFS >>>>>>> in 9 >>>>>>> and >>>>>>> finally did some searching a couple months ago, someone >>>>>>> suggested >>>>>>> it >>>>>>> and >>>>>>> they were on to something. >>>>>>>=20 >>>>>> I have a "hunch" that might explain why 64K NFS reads/writes >>>>>> perform >>>>>> poorly for some network environments. >>>>>> A 64K NFS read reply/write request consists of a list of 34 >>>>>> mbufs >>>>>> when >>>>>> passed to TCP via sosend() and a total data length of around >>>>>> 65680bytes. >>>>>> Looking at a couple of drivers (virtio and ixgbe), they seem >>>>>> to >>>>>> expect >>>>>> no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. >>>>>> I >>>>>> think >>>>>> (I don't have anything that does TSO to confirm this) that >>>>>> NFS will >>>>>> pass >>>>>> a list that is longer (34 plus a TCP/IP header). >>>>>> At a glance, it appears that the drivers call m_defrag() or >>>>>> m_collapse() >>>>>> when the mbuf list won't fit in their scatter table (32 or 33 >>>>>> elements) >>>>>> and if this fails, just silently drop the data without >>>>>> sending it. >>>>>> If I'm right, there would considerable overhead from >>>>>> m_defrag()/m_collapse() >>>>>> and near disaster if they fail to fix the problem and the >>>>>> data is >>>>>> silently >>>>>> dropped instead of xmited. >>>>>>=20 >>>>>=20 >>>>> I think the actual number of DMA segments allocated for the >>>>> mbuf >>>>> chain is determined by bus_dma(9). bus_dma(9) will coalesce >>>>> current segment with previous segment if possible. >>>>>=20 >>>> Ok, I'll have to take a look, but I thought that an array of >>>> sized >>>> by "num_segs" is passed in as an argument. (And num_segs is set >>>> to >>>> either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) >>>> It looked to me that the ixgbe driver called itself ix, so it >>>> isn't >>>> obvious to me which we are talking about. (I know that Daniel >>>> Braniss >>>> had an ix0 and ix1, which were fixed for NFS by disabling TSO.) >>>>=20 >>>=20 >>> It's ix(4). ixbge(4) is a different driver. >>>=20 >>>> I'll admit I mostly looked at virtio's network driver, since that >>>> was the one being used by J David. >>>>=20 >>>> Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have >>>> been >>>> cropping up for quite a while, and I am just trying to find out >>>> why. >>>> (I have no hardware/software that exhibits the problem, so I can >>>> only look at the sources and ask others to try testing stuff.) >>>>=20 >>>>> I'm not sure whether you're referring to ixgbe(4) or ix(4) but >>>>> I >>>>> see the total length of all segment size of ix(4) is 65535 so >>>>> it has no room for ethernet/VLAN header of the mbuf chain. The >>>>> driver should be fixed to transmit a 64KB datagram. >>>> Well, if_hw_tsomax is set to 65535 by the generic code (the >>>> driver >>>> doesn't set it) and the code in tcp_output() seems to subtract >>>> the >>>> size of an tcp/ip header from that before passing data to the >>>> driver, >>>> so I think the mbuf chain passed to the driver will fit in one >>>> ip datagram. (I'd assume all sorts of stuff would break for TSO >>>> enabled drivers if that wasn't the case?) >>>=20 >>> I believe the generic code is doing right. I'm under the >>> impression the non-working TSO indicates a bug in driver. Some >>> drivers didn't account for additional ethernet/VLAN header so the >>> total size of DMA segments exceeded 65535. I've attached a diff >>> for ix(4). It wasn't tested at all as I don't have hardware to >>> test. >>>=20 >>>>=20 >>>>> I think the use of m_defrag(9) in TSO is suboptimal. All TSO >>>>> capable controllers are able to handle multiple TX buffers so >>>>> it >>>>> should have used m_collapse(9) rather than copying entire chain >>>>> with m_defrag(9). >>>>>=20 >>>> I haven't looked at these closely yet (plan on doing so to-day), >>>> but >>>> even m_collapse() looked like it copied data between mbufs and >>>> that >>>> is certainly suboptimal, imho. I don't see why a driver can't >>>> split >>>> the mbuf list, if there are too many entries for the >>>> scatter/gather >>>> and do it in two iterations (much like tcp_output() does already, >>>> since the data length exceeds 65535 - tcp/ip header size). >>>>=20 >>>=20 >>> It can split the mbuf list if controllers supports increased number >>> of TX buffers. Because controller shall consume the same number of >>> DMA descriptors for the mbuf list, drivers tend to impose a limit >>> on the number of TX buffers to save resources. >>>=20 >>>> However, at this point, I just want to find out if the long chain >>>> of mbufs is why TSO is problematic for these drivers, since I'll >>>> admit I'm getting tired of telling people to disable TSO (and I >>>> suspect some don't believe me and never try it). >>>>=20 >>>=20 >>> TSO capable controllers tend to have various limitations(the first >>> TX buffer should have complete ethernet/IP/TCP header, ip_len of IP >>> header should be reset to 0, TCP pseudo checksum should be >>> recomputed etc) and cheap controllers need more assistance from >>> driver to let its firmware know various IP/TCP header offset >>> location in the mbuf. Because this requires a IP/TCP header >>> parsing, it's error prone and very complex. >>>=20 >>>>>> Anyhow, I have attached a patch that makes NFS use >>>>>> MJUMPAGESIZE >>>>>> clusters, >>>>>> so the mbuf count drops from 34 to 18. >>>>>>=20 >>>>>=20 >>>>> Could we make it conditional on size? >>>>>=20 >>>> Not sure what you mean? If you mean "the size of the read/write", >>>> that would be possible for NFSv3, but less so for NFSv4. (The >>>> read/write >>>> is just one Op. in the compound for NFSv4 and there is no way to >>>> predict how much more data is going to be generated by subsequent >>>> Ops.) >>>>=20 >>>=20 >>> Sorry, I should have been more clearer. You already answered my >>> question. Thanks. >>>=20 >>>> If by "size" you mean amount of memory in the machine then, yes, >>>> it >>>> certainly could be conditional on that. (I plan to try and look >>>> at >>>> the allocator to-day as well, but if others know of disadvantages >>>> with >>>> using MJUMPAGESIZE instead of MCLBYTES, please speak up.) >>>>=20 >>>> Garrett Wollman already alluded to the MCLBYTES case being >>>> pre-allocated, >>>> but I'll admit I have no idea what the implications of that are >>>> at this >>>> time. >>>>=20 >>>>>> If anyone has a TSO scatter/gather enabled net interface and >>>>>> can >>>>>> test this >>>>>> patch on it with NFS I/O (default of 64K rsize/wsize) when >>>>>> TSO is >>>>>> enabled >>>>>> and see what effect it has, that would be appreciated. >>>>>>=20 >>>>>> Btw, thanks go to Garrett Wollman for suggesting the change >>>>>> to >>>>>> MJUMPAGESIZE >>>>>> clusters. >>>>>>=20 >>>>>> rick >>>>>> ps: If the attachment doesn't make it through and you want >>>>>> the >>>>>> patch, just >>>>>> email me and I'll send you a copy. >>>>>>=20 >>>=20 >>> _______________________________________________ >>> freebsd-net@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net >>> To unsubscribe, send any mail to >>> "freebsd-net-unsubscribe@freebsd.org" >>>=20 >> _______________________________________________ >> freebsd-net@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-net >> To unsubscribe, send any mail to >> "freebsd-net-unsubscribe@freebsd.org"