From owner-freebsd-net@FreeBSD.ORG Sun Feb 2 16:15:39 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CF2CD9A4 for ; Sun, 2 Feb 2014 16:15:39 +0000 (UTC) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 73D3017E5 for ; Sun, 2 Feb 2014 16:15:39 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: X-IronPort-AV: E=Sophos;i="4.95,766,1384318800"; d="scan'208";a="93007239" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-jnhn.mail.uoguelph.ca with ESMTP; 02 Feb 2014 11:15:31 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id DE79DB3F62; Sun, 2 Feb 2014 11:15:30 -0500 (EST) Date: Sun, 2 Feb 2014 11:15:30 -0500 (EST) From: Rick Macklem To: Daniel Braniss Message-ID: <906704123.1485103.1391357730899.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: Pyun YongHyeon , FreeBSD Net , Adam McDougall , Jack Vogel X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 02 Feb 2014 16:15:39 -0000 Daniel Braniss wrote: > hi Rick, et.all. >=20 > tried your patch but it didn=E2=80=99t help,the server is stuck. Oh well. I was hoping that was going to make TSO work reliably. Just to comfirm it, this server works reliably when TSO is disabled? Thanks for doing the testing, rick > just for fun, I tried a different client/host, this one has a > broadcom NextXtreme II that was > MFC=E2=80=99ed lately, and the results are worse than the Intel (5hs inst= ead > of 4hs) but faster without TSO >=20 > with TSO enabled and bs=3D32k: > 5.09hs=09=0918325.62 real 1109.23 user 4591.60 sys >=20 > without TSO: > 4.75hs=09=0917120.40 real 1114.08 user 3537.61 sys >=20 > So what is the advantage of using TSO? (no complain here, just > curious) >=20 > I=E2=80=99ll try to see if as a server it has the same TSO related issues= . >=20 > cheers, > =09danny >=20 > On Jan 28, 2014, at 3:51 AM, Rick Macklem > wrote: >=20 > > Jack Vogel wrote: > >> That header file is for the VF driver :) which I don't believe is > >> being > >> used in this case. > >> The driver is capable of handling 256K but its limited by the > >> stack > >> to 64K > >> (look in > >> ixgbe.h), so its not a few bytes off due to the vlan header. > >>=20 > >> The scatter size is not an arbitrary one, its due to hardware > >> limitations > >> in Niantic > >> (82599). Turning off TSO in the 10G environment is not practical, > >> you will > >> have > >> trouble getting good performance. > >>=20 > >> Jack > >>=20 > > Well, if you look at this thread, Daniel got much better > > performance > > by turning off TSO. However, I agree that this is not an ideal > > solution. > > http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677= B > >=20 > > rick > >=20 > >>=20 > >>=20 > >> On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN > >> wrote: > >>=20 > >>> On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: > >>>> pyunyh@gmail.com wrote: > >>>>> On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > >>>>>> Adam McDougall wrote: > >>>>>>> Also try rsize=3D32768,wsize=3D32768 in your mount options, > >>>>>>> made a > >>>>>>> huge > >>>>>>> difference for me. I've noticed slow file transfers on NFS > >>>>>>> in 9 > >>>>>>> and > >>>>>>> finally did some searching a couple months ago, someone > >>>>>>> suggested > >>>>>>> it > >>>>>>> and > >>>>>>> they were on to something. > >>>>>>>=20 > >>>>>> I have a "hunch" that might explain why 64K NFS reads/writes > >>>>>> perform > >>>>>> poorly for some network environments. > >>>>>> A 64K NFS read reply/write request consists of a list of 34 > >>>>>> mbufs > >>>>>> when > >>>>>> passed to TCP via sosend() and a total data length of around > >>>>>> 65680bytes. > >>>>>> Looking at a couple of drivers (virtio and ixgbe), they seem > >>>>>> to > >>>>>> expect > >>>>>> no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. > >>>>>> I > >>>>>> think > >>>>>> (I don't have anything that does TSO to confirm this) that > >>>>>> NFS will > >>>>>> pass > >>>>>> a list that is longer (34 plus a TCP/IP header). > >>>>>> At a glance, it appears that the drivers call m_defrag() or > >>>>>> m_collapse() > >>>>>> when the mbuf list won't fit in their scatter table (32 or 33 > >>>>>> elements) > >>>>>> and if this fails, just silently drop the data without > >>>>>> sending it. > >>>>>> If I'm right, there would considerable overhead from > >>>>>> m_defrag()/m_collapse() > >>>>>> and near disaster if they fail to fix the problem and the > >>>>>> data is > >>>>>> silently > >>>>>> dropped instead of xmited. > >>>>>>=20 > >>>>>=20 > >>>>> I think the actual number of DMA segments allocated for the > >>>>> mbuf > >>>>> chain is determined by bus_dma(9). bus_dma(9) will coalesce > >>>>> current segment with previous segment if possible. > >>>>>=20 > >>>> Ok, I'll have to take a look, but I thought that an array of > >>>> sized > >>>> by "num_segs" is passed in as an argument. (And num_segs is set > >>>> to > >>>> either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) > >>>> It looked to me that the ixgbe driver called itself ix, so it > >>>> isn't > >>>> obvious to me which we are talking about. (I know that Daniel > >>>> Braniss > >>>> had an ix0 and ix1, which were fixed for NFS by disabling TSO.) > >>>>=20 > >>>=20 > >>> It's ix(4). ixbge(4) is a different driver. > >>>=20 > >>>> I'll admit I mostly looked at virtio's network driver, since > >>>> that > >>>> was the one being used by J David. > >>>>=20 > >>>> Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have > >>>> been > >>>> cropping up for quite a while, and I am just trying to find out > >>>> why. > >>>> (I have no hardware/software that exhibits the problem, so I can > >>>> only look at the sources and ask others to try testing stuff.) > >>>>=20 > >>>>> I'm not sure whether you're referring to ixgbe(4) or ix(4) but > >>>>> I > >>>>> see the total length of all segment size of ix(4) is 65535 so > >>>>> it has no room for ethernet/VLAN header of the mbuf chain. The > >>>>> driver should be fixed to transmit a 64KB datagram. > >>>> Well, if_hw_tsomax is set to 65535 by the generic code (the > >>>> driver > >>>> doesn't set it) and the code in tcp_output() seems to subtract > >>>> the > >>>> size of an tcp/ip header from that before passing data to the > >>>> driver, > >>>> so I think the mbuf chain passed to the driver will fit in one > >>>> ip datagram. (I'd assume all sorts of stuff would break for TSO > >>>> enabled drivers if that wasn't the case?) > >>>=20 > >>> I believe the generic code is doing right. I'm under the > >>> impression the non-working TSO indicates a bug in driver. Some > >>> drivers didn't account for additional ethernet/VLAN header so the > >>> total size of DMA segments exceeded 65535. I've attached a diff > >>> for ix(4). It wasn't tested at all as I don't have hardware to > >>> test. > >>>=20 > >>>>=20 > >>>>> I think the use of m_defrag(9) in TSO is suboptimal. All TSO > >>>>> capable controllers are able to handle multiple TX buffers so > >>>>> it > >>>>> should have used m_collapse(9) rather than copying entire chain > >>>>> with m_defrag(9). > >>>>>=20 > >>>> I haven't looked at these closely yet (plan on doing so to-day), > >>>> but > >>>> even m_collapse() looked like it copied data between mbufs and > >>>> that > >>>> is certainly suboptimal, imho. I don't see why a driver can't > >>>> split > >>>> the mbuf list, if there are too many entries for the > >>>> scatter/gather > >>>> and do it in two iterations (much like tcp_output() does > >>>> already, > >>>> since the data length exceeds 65535 - tcp/ip header size). > >>>>=20 > >>>=20 > >>> It can split the mbuf list if controllers supports increased > >>> number > >>> of TX buffers. Because controller shall consume the same number > >>> of > >>> DMA descriptors for the mbuf list, drivers tend to impose a limit > >>> on the number of TX buffers to save resources. > >>>=20 > >>>> However, at this point, I just want to find out if the long > >>>> chain > >>>> of mbufs is why TSO is problematic for these drivers, since I'll > >>>> admit I'm getting tired of telling people to disable TSO (and I > >>>> suspect some don't believe me and never try it). > >>>>=20 > >>>=20 > >>> TSO capable controllers tend to have various limitations(the > >>> first > >>> TX buffer should have complete ethernet/IP/TCP header, ip_len of > >>> IP > >>> header should be reset to 0, TCP pseudo checksum should be > >>> recomputed etc) and cheap controllers need more assistance from > >>> driver to let its firmware know various IP/TCP header offset > >>> location in the mbuf. Because this requires a IP/TCP header > >>> parsing, it's error prone and very complex. > >>>=20 > >>>>>> Anyhow, I have attached a patch that makes NFS use > >>>>>> MJUMPAGESIZE > >>>>>> clusters, > >>>>>> so the mbuf count drops from 34 to 18. > >>>>>>=20 > >>>>>=20 > >>>>> Could we make it conditional on size? > >>>>>=20 > >>>> Not sure what you mean? If you mean "the size of the > >>>> read/write", > >>>> that would be possible for NFSv3, but less so for NFSv4. (The > >>>> read/write > >>>> is just one Op. in the compound for NFSv4 and there is no way to > >>>> predict how much more data is going to be generated by > >>>> subsequent > >>>> Ops.) > >>>>=20 > >>>=20 > >>> Sorry, I should have been more clearer. You already answered my > >>> question. Thanks. > >>>=20 > >>>> If by "size" you mean amount of memory in the machine then, yes, > >>>> it > >>>> certainly could be conditional on that. (I plan to try and look > >>>> at > >>>> the allocator to-day as well, but if others know of > >>>> disadvantages > >>>> with > >>>> using MJUMPAGESIZE instead of MCLBYTES, please speak up.) > >>>>=20 > >>>> Garrett Wollman already alluded to the MCLBYTES case being > >>>> pre-allocated, > >>>> but I'll admit I have no idea what the implications of that are > >>>> at this > >>>> time. > >>>>=20 > >>>>>> If anyone has a TSO scatter/gather enabled net interface and > >>>>>> can > >>>>>> test this > >>>>>> patch on it with NFS I/O (default of 64K rsize/wsize) when > >>>>>> TSO is > >>>>>> enabled > >>>>>> and see what effect it has, that would be appreciated. > >>>>>>=20 > >>>>>> Btw, thanks go to Garrett Wollman for suggesting the change > >>>>>> to > >>>>>> MJUMPAGESIZE > >>>>>> clusters. > >>>>>>=20 > >>>>>> rick > >>>>>> ps: If the attachment doesn't make it through and you want > >>>>>> the > >>>>>> patch, just > >>>>>> email me and I'll send you a copy. > >>>>>>=20 > >>>=20 > >>> _______________________________________________ > >>> freebsd-net@freebsd.org mailing list > >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net > >>> To unsubscribe, send any mail to > >>> "freebsd-net-unsubscribe@freebsd.org" > >>>=20 > >> _______________________________________________ > >> freebsd-net@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-net > >> To unsubscribe, send any mail to > >> "freebsd-net-unsubscribe@freebsd.org" >=20 > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" >=20