Date: Mon, 27 Jan 2014 17:15:29 -0800 From: Jack Vogel <jfvogel@gmail.com> To: Pyun YongHyeon <pyunyh@gmail.com> Cc: Daniel Braniss <danny@cs.huji.ac.il>, FreeBSD Net <freebsd-net@freebsd.org>, Adam McDougall <mcdouga9@egr.msu.edu>, Rick Macklem <rmacklem@uoguelph.ca> Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <CAFOYbcndfNwTbNdcep4fgeYGiFqXd0Wp6cCFVJgS_mAkv3TwWw@mail.gmail.com> In-Reply-To: <20140128005818.GB2722@michelle.cdnetworks.com> References: <20140127055047.GA1368@michelle.cdnetworks.com> <1168237133.17228249.1390865239175.JavaMail.root@uoguelph.ca> <20140128005818.GB2722@michelle.cdnetworks.com>
next in thread | previous in thread | raw e-mail | index | archive | help
That header file is for the VF driver :) which I don't believe is being used in this case. The driver is capable of handling 256K but its limited by the stack to 64K (look in ixgbe.h), so its not a few bytes off due to the vlan header. The scatter size is not an arbitrary one, its due to hardware limitations in Niantic (82599). Turning off TSO in the 10G environment is not practical, you will have trouble getting good performance. Jack On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN <pyunyh@gmail.com> wrote: > On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: > > pyunyh@gmail.com wrote: > > > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > > > > Adam McDougall wrote: > > > > > Also try rsize=32768,wsize=32768 in your mount options, made a > > > > > huge > > > > > difference for me. I've noticed slow file transfers on NFS in 9 > > > > > and > > > > > finally did some searching a couple months ago, someone suggested > > > > > it > > > > > and > > > > > they were on to something. > > > > > > > > > I have a "hunch" that might explain why 64K NFS reads/writes > > > > perform > > > > poorly for some network environments. > > > > A 64K NFS read reply/write request consists of a list of 34 mbufs > > > > when > > > > passed to TCP via sosend() and a total data length of around > > > > 65680bytes. > > > > Looking at a couple of drivers (virtio and ixgbe), they seem to > > > > expect > > > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. I > > > > think > > > > (I don't have anything that does TSO to confirm this) that NFS will > > > > pass > > > > a list that is longer (34 plus a TCP/IP header). > > > > At a glance, it appears that the drivers call m_defrag() or > > > > m_collapse() > > > > when the mbuf list won't fit in their scatter table (32 or 33 > > > > elements) > > > > and if this fails, just silently drop the data without sending it. > > > > If I'm right, there would considerable overhead from > > > > m_defrag()/m_collapse() > > > > and near disaster if they fail to fix the problem and the data is > > > > silently > > > > dropped instead of xmited. > > > > > > > > > > I think the actual number of DMA segments allocated for the mbuf > > > chain is determined by bus_dma(9). bus_dma(9) will coalesce > > > current segment with previous segment if possible. > > > > > Ok, I'll have to take a look, but I thought that an array of sized > > by "num_segs" is passed in as an argument. (And num_segs is set to > > either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) > > It looked to me that the ixgbe driver called itself ix, so it isn't > > obvious to me which we are talking about. (I know that Daniel Braniss > > had an ix0 and ix1, which were fixed for NFS by disabling TSO.) > > > > It's ix(4). ixbge(4) is a different driver. > > > I'll admit I mostly looked at virtio's network driver, since that > > was the one being used by J David. > > > > Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have been > > cropping up for quite a while, and I am just trying to find out why. > > (I have no hardware/software that exhibits the problem, so I can > > only look at the sources and ask others to try testing stuff.) > > > > > I'm not sure whether you're referring to ixgbe(4) or ix(4) but I > > > see the total length of all segment size of ix(4) is 65535 so > > > it has no room for ethernet/VLAN header of the mbuf chain. The > > > driver should be fixed to transmit a 64KB datagram. > > Well, if_hw_tsomax is set to 65535 by the generic code (the driver > > doesn't set it) and the code in tcp_output() seems to subtract the > > size of an tcp/ip header from that before passing data to the driver, > > so I think the mbuf chain passed to the driver will fit in one > > ip datagram. (I'd assume all sorts of stuff would break for TSO > > enabled drivers if that wasn't the case?) > > I believe the generic code is doing right. I'm under the > impression the non-working TSO indicates a bug in driver. Some > drivers didn't account for additional ethernet/VLAN header so the > total size of DMA segments exceeded 65535. I've attached a diff > for ix(4). It wasn't tested at all as I don't have hardware to > test. > > > > > > I think the use of m_defrag(9) in TSO is suboptimal. All TSO > > > capable controllers are able to handle multiple TX buffers so it > > > should have used m_collapse(9) rather than copying entire chain > > > with m_defrag(9). > > > > > I haven't looked at these closely yet (plan on doing so to-day), but > > even m_collapse() looked like it copied data between mbufs and that > > is certainly suboptimal, imho. I don't see why a driver can't split > > the mbuf list, if there are too many entries for the scatter/gather > > and do it in two iterations (much like tcp_output() does already, > > since the data length exceeds 65535 - tcp/ip header size). > > > > It can split the mbuf list if controllers supports increased number > of TX buffers. Because controller shall consume the same number of > DMA descriptors for the mbuf list, drivers tend to impose a limit > on the number of TX buffers to save resources. > > > However, at this point, I just want to find out if the long chain > > of mbufs is why TSO is problematic for these drivers, since I'll > > admit I'm getting tired of telling people to disable TSO (and I > > suspect some don't believe me and never try it). > > > > TSO capable controllers tend to have various limitations(the first > TX buffer should have complete ethernet/IP/TCP header, ip_len of IP > header should be reset to 0, TCP pseudo checksum should be > recomputed etc) and cheap controllers need more assistance from > driver to let its firmware know various IP/TCP header offset > location in the mbuf. Because this requires a IP/TCP header > parsing, it's error prone and very complex. > > > > > Anyhow, I have attached a patch that makes NFS use MJUMPAGESIZE > > > > clusters, > > > > so the mbuf count drops from 34 to 18. > > > > > > > > > > Could we make it conditional on size? > > > > > Not sure what you mean? If you mean "the size of the read/write", > > that would be possible for NFSv3, but less so for NFSv4. (The read/write > > is just one Op. in the compound for NFSv4 and there is no way to > > predict how much more data is going to be generated by subsequent Ops.) > > > > Sorry, I should have been more clearer. You already answered my > question. Thanks. > > > If by "size" you mean amount of memory in the machine then, yes, it > > certainly could be conditional on that. (I plan to try and look at > > the allocator to-day as well, but if others know of disadvantages with > > using MJUMPAGESIZE instead of MCLBYTES, please speak up.) > > > > Garrett Wollman already alluded to the MCLBYTES case being pre-allocated, > > but I'll admit I have no idea what the implications of that are at this > > time. > > > > > > If anyone has a TSO scatter/gather enabled net interface and can > > > > test this > > > > patch on it with NFS I/O (default of 64K rsize/wsize) when TSO is > > > > enabled > > > > and see what effect it has, that would be appreciated. > > > > > > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > > > MJUMPAGESIZE > > > > clusters. > > > > > > > > rick > > > > ps: If the attachment doesn't make it through and you want the > > > > patch, just > > > > email me and I'll send you a copy. > > > > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFOYbcndfNwTbNdcep4fgeYGiFqXd0Wp6cCFVJgS_mAkv3TwWw>