Date: Mon, 27 Jan 2014 20:51:12 -0500 (EST) From: Rick Macklem <rmacklem@uoguelph.ca> To: Jack Vogel <jfvogel@gmail.com> Cc: Daniel Braniss <danny@cs.huji.ac.il>, FreeBSD Net <freebsd-net@freebsd.org>, Adam McDougall <mcdouga9@egr.msu.edu>, Pyun YongHyeon <pyunyh@gmail.com> Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <482557096.17290094.1390873872231.JavaMail.root@uoguelph.ca> In-Reply-To: <CAFOYbcndfNwTbNdcep4fgeYGiFqXd0Wp6cCFVJgS_mAkv3TwWw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Jack Vogel wrote: > That header file is for the VF driver :) which I don't believe is > being > used in this case. > The driver is capable of handling 256K but its limited by the stack > to 64K > (look in > ixgbe.h), so its not a few bytes off due to the vlan header. > > The scatter size is not an arbitrary one, its due to hardware > limitations > in Niantic > (82599). Turning off TSO in the 10G environment is not practical, > you will > have > trouble getting good performance. > > Jack > Well, if you look at this thread, Daniel got much better performance by turning off TSO. However, I agree that this is not an ideal solution. http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677B rick > > > On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN <pyunyh@gmail.com> > wrote: > > > On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote: > > > pyunyh@gmail.com wrote: > > > > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote: > > > > > Adam McDougall wrote: > > > > > > Also try rsize=32768,wsize=32768 in your mount options, > > > > > > made a > > > > > > huge > > > > > > difference for me. I've noticed slow file transfers on NFS > > > > > > in 9 > > > > > > and > > > > > > finally did some searching a couple months ago, someone > > > > > > suggested > > > > > > it > > > > > > and > > > > > > they were on to something. > > > > > > > > > > > I have a "hunch" that might explain why 64K NFS reads/writes > > > > > perform > > > > > poorly for some network environments. > > > > > A 64K NFS read reply/write request consists of a list of 34 > > > > > mbufs > > > > > when > > > > > passed to TCP via sosend() and a total data length of around > > > > > 65680bytes. > > > > > Looking at a couple of drivers (virtio and ixgbe), they seem > > > > > to > > > > > expect > > > > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit. > > > > > I > > > > > think > > > > > (I don't have anything that does TSO to confirm this) that > > > > > NFS will > > > > > pass > > > > > a list that is longer (34 plus a TCP/IP header). > > > > > At a glance, it appears that the drivers call m_defrag() or > > > > > m_collapse() > > > > > when the mbuf list won't fit in their scatter table (32 or 33 > > > > > elements) > > > > > and if this fails, just silently drop the data without > > > > > sending it. > > > > > If I'm right, there would considerable overhead from > > > > > m_defrag()/m_collapse() > > > > > and near disaster if they fail to fix the problem and the > > > > > data is > > > > > silently > > > > > dropped instead of xmited. > > > > > > > > > > > > > I think the actual number of DMA segments allocated for the > > > > mbuf > > > > chain is determined by bus_dma(9). bus_dma(9) will coalesce > > > > current segment with previous segment if possible. > > > > > > > Ok, I'll have to take a look, but I thought that an array of > > > sized > > > by "num_segs" is passed in as an argument. (And num_segs is set > > > to > > > either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).) > > > It looked to me that the ixgbe driver called itself ix, so it > > > isn't > > > obvious to me which we are talking about. (I know that Daniel > > > Braniss > > > had an ix0 and ix1, which were fixed for NFS by disabling TSO.) > > > > > > > It's ix(4). ixbge(4) is a different driver. > > > > > I'll admit I mostly looked at virtio's network driver, since that > > > was the one being used by J David. > > > > > > Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have > > > been > > > cropping up for quite a while, and I am just trying to find out > > > why. > > > (I have no hardware/software that exhibits the problem, so I can > > > only look at the sources and ask others to try testing stuff.) > > > > > > > I'm not sure whether you're referring to ixgbe(4) or ix(4) but > > > > I > > > > see the total length of all segment size of ix(4) is 65535 so > > > > it has no room for ethernet/VLAN header of the mbuf chain. The > > > > driver should be fixed to transmit a 64KB datagram. > > > Well, if_hw_tsomax is set to 65535 by the generic code (the > > > driver > > > doesn't set it) and the code in tcp_output() seems to subtract > > > the > > > size of an tcp/ip header from that before passing data to the > > > driver, > > > so I think the mbuf chain passed to the driver will fit in one > > > ip datagram. (I'd assume all sorts of stuff would break for TSO > > > enabled drivers if that wasn't the case?) > > > > I believe the generic code is doing right. I'm under the > > impression the non-working TSO indicates a bug in driver. Some > > drivers didn't account for additional ethernet/VLAN header so the > > total size of DMA segments exceeded 65535. I've attached a diff > > for ix(4). It wasn't tested at all as I don't have hardware to > > test. > > > > > > > > > I think the use of m_defrag(9) in TSO is suboptimal. All TSO > > > > capable controllers are able to handle multiple TX buffers so > > > > it > > > > should have used m_collapse(9) rather than copying entire chain > > > > with m_defrag(9). > > > > > > > I haven't looked at these closely yet (plan on doing so to-day), > > > but > > > even m_collapse() looked like it copied data between mbufs and > > > that > > > is certainly suboptimal, imho. I don't see why a driver can't > > > split > > > the mbuf list, if there are too many entries for the > > > scatter/gather > > > and do it in two iterations (much like tcp_output() does already, > > > since the data length exceeds 65535 - tcp/ip header size). > > > > > > > It can split the mbuf list if controllers supports increased number > > of TX buffers. Because controller shall consume the same number of > > DMA descriptors for the mbuf list, drivers tend to impose a limit > > on the number of TX buffers to save resources. > > > > > However, at this point, I just want to find out if the long chain > > > of mbufs is why TSO is problematic for these drivers, since I'll > > > admit I'm getting tired of telling people to disable TSO (and I > > > suspect some don't believe me and never try it). > > > > > > > TSO capable controllers tend to have various limitations(the first > > TX buffer should have complete ethernet/IP/TCP header, ip_len of IP > > header should be reset to 0, TCP pseudo checksum should be > > recomputed etc) and cheap controllers need more assistance from > > driver to let its firmware know various IP/TCP header offset > > location in the mbuf. Because this requires a IP/TCP header > > parsing, it's error prone and very complex. > > > > > > > Anyhow, I have attached a patch that makes NFS use > > > > > MJUMPAGESIZE > > > > > clusters, > > > > > so the mbuf count drops from 34 to 18. > > > > > > > > > > > > > Could we make it conditional on size? > > > > > > > Not sure what you mean? If you mean "the size of the read/write", > > > that would be possible for NFSv3, but less so for NFSv4. (The > > > read/write > > > is just one Op. in the compound for NFSv4 and there is no way to > > > predict how much more data is going to be generated by subsequent > > > Ops.) > > > > > > > Sorry, I should have been more clearer. You already answered my > > question. Thanks. > > > > > If by "size" you mean amount of memory in the machine then, yes, > > > it > > > certainly could be conditional on that. (I plan to try and look > > > at > > > the allocator to-day as well, but if others know of disadvantages > > > with > > > using MJUMPAGESIZE instead of MCLBYTES, please speak up.) > > > > > > Garrett Wollman already alluded to the MCLBYTES case being > > > pre-allocated, > > > but I'll admit I have no idea what the implications of that are > > > at this > > > time. > > > > > > > > If anyone has a TSO scatter/gather enabled net interface and > > > > > can > > > > > test this > > > > > patch on it with NFS I/O (default of 64K rsize/wsize) when > > > > > TSO is > > > > > enabled > > > > > and see what effect it has, that would be appreciated. > > > > > > > > > > Btw, thanks go to Garrett Wollman for suggesting the change > > > > > to > > > > > MJUMPAGESIZE > > > > > clusters. > > > > > > > > > > rick > > > > > ps: If the attachment doesn't make it through and you want > > > > > the > > > > > patch, just > > > > > email me and I'll send you a copy. > > > > > > > > > _______________________________________________ > > freebsd-net@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > To unsubscribe, send any mail to > > "freebsd-net-unsubscribe@freebsd.org" > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?482557096.17290094.1390873872231.JavaMail.root>