Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 Jan 2014 20:51:12 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Jack Vogel <jfvogel@gmail.com>
Cc:        Daniel Braniss <danny@cs.huji.ac.il>, FreeBSD Net <freebsd-net@freebsd.org>, Adam McDougall <mcdouga9@egr.msu.edu>, Pyun YongHyeon <pyunyh@gmail.com>
Subject:   Re: Terrible NFS performance under 9.2-RELEASE?
Message-ID:  <482557096.17290094.1390873872231.JavaMail.root@uoguelph.ca>
In-Reply-To: <CAFOYbcndfNwTbNdcep4fgeYGiFqXd0Wp6cCFVJgS_mAkv3TwWw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Jack Vogel wrote:
> That header file is for the VF driver :) which I don't believe is
> being
> used in this case.
> The driver is capable of handling 256K but its limited by the stack
> to 64K
> (look in
> ixgbe.h), so its not a few bytes off due to the vlan header.
> 
> The scatter size is not an arbitrary one, its due to hardware
> limitations
> in Niantic
> (82599).  Turning off TSO in the 10G environment is not practical,
> you will
> have
> trouble getting good performance.
> 
> Jack
> 
Well, if you look at this thread, Daniel got much better performance
by turning off TSO. However, I agree that this is not an ideal solution.
http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677B

rick

> 
> 
> On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN <pyunyh@gmail.com>
> wrote:
> 
> > On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote:
> > > pyunyh@gmail.com wrote:
> > > > On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote:
> > > > > Adam McDougall wrote:
> > > > > > Also try rsize=32768,wsize=32768 in your mount options,
> > > > > > made a
> > > > > > huge
> > > > > > difference for me.  I've noticed slow file transfers on NFS
> > > > > > in 9
> > > > > > and
> > > > > > finally did some searching a couple months ago, someone
> > > > > > suggested
> > > > > > it
> > > > > > and
> > > > > > they were on to something.
> > > > > >
> > > > > I have a "hunch" that might explain why 64K NFS reads/writes
> > > > > perform
> > > > > poorly for some network environments.
> > > > > A 64K NFS read reply/write request consists of a list of 34
> > > > > mbufs
> > > > > when
> > > > > passed to TCP via sosend() and a total data length of around
> > > > > 65680bytes.
> > > > > Looking at a couple of drivers (virtio and ixgbe), they seem
> > > > > to
> > > > > expect
> > > > > no more than 32-33 mbufs in a list for a 65535 byte TSO xmit.
> > > > > I
> > > > > think
> > > > > (I don't have anything that does TSO to confirm this) that
> > > > > NFS will
> > > > > pass
> > > > > a list that is longer (34 plus a TCP/IP header).
> > > > > At a glance, it appears that the drivers call m_defrag() or
> > > > > m_collapse()
> > > > > when the mbuf list won't fit in their scatter table (32 or 33
> > > > > elements)
> > > > > and if this fails, just silently drop the data without
> > > > > sending it.
> > > > > If I'm right, there would considerable overhead from
> > > > > m_defrag()/m_collapse()
> > > > > and near disaster if they fail to fix the problem and the
> > > > > data is
> > > > > silently
> > > > > dropped instead of xmited.
> > > > >
> > > >
> > > > I think the actual number of DMA segments allocated for the
> > > > mbuf
> > > > chain is determined by bus_dma(9).  bus_dma(9) will coalesce
> > > > current segment with previous segment if possible.
> > > >
> > > Ok, I'll have to take a look, but I thought that an array of
> > > sized
> > > by "num_segs" is passed in as an argument. (And num_segs is set
> > > to
> > > either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).)
> > > It looked to me that the ixgbe driver called itself ix, so it
> > > isn't
> > > obvious to me which we are talking about. (I know that Daniel
> > > Braniss
> > > had an ix0 and ix1, which were fixed for NFS by disabling TSO.)
> > >
> >
> > It's ix(4). ixbge(4) is a different driver.
> >
> > > I'll admit I mostly looked at virtio's network driver, since that
> > > was the one being used by J David.
> > >
> > > Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have
> > > been
> > > cropping up for quite a while, and I am just trying to find out
> > > why.
> > > (I have no hardware/software that exhibits the problem, so I can
> > > only look at the sources and ask others to try testing stuff.)
> > >
> > > > I'm not sure whether you're referring to ixgbe(4) or ix(4) but
> > > > I
> > > > see the total length of all segment size of ix(4) is 65535 so
> > > > it has no room for ethernet/VLAN header of the mbuf chain.  The
> > > > driver should be fixed to transmit a 64KB datagram.
> > > Well, if_hw_tsomax is set to 65535 by the generic code (the
> > > driver
> > > doesn't set it) and the code in tcp_output() seems to subtract
> > > the
> > > size of an tcp/ip header from that before passing data to the
> > > driver,
> > > so I think the mbuf chain passed to the driver will fit in one
> > > ip datagram. (I'd assume all sorts of stuff would break for TSO
> > > enabled drivers if that wasn't the case?)
> >
> > I believe the generic code is doing right.  I'm under the
> > impression the non-working TSO indicates a bug in driver.  Some
> > drivers didn't account for additional ethernet/VLAN header so the
> > total size of DMA segments exceeded 65535.  I've attached a diff
> > for ix(4). It wasn't tested at all as I don't have hardware to
> > test.
> >
> > >
> > > > I think the use of m_defrag(9) in TSO is suboptimal. All TSO
> > > > capable controllers are able to handle multiple TX buffers so
> > > > it
> > > > should have used m_collapse(9) rather than copying entire chain
> > > > with m_defrag(9).
> > > >
> > > I haven't looked at these closely yet (plan on doing so to-day),
> > > but
> > > even m_collapse() looked like it copied data between mbufs and
> > > that
> > > is certainly suboptimal, imho. I don't see why a driver can't
> > > split
> > > the mbuf list, if there are too many entries for the
> > > scatter/gather
> > > and do it in two iterations (much like tcp_output() does already,
> > > since the data length exceeds 65535 - tcp/ip header size).
> > >
> >
> > It can split the mbuf list if controllers supports increased number
> > of TX buffers.  Because controller shall consume the same number of
> > DMA descriptors for the mbuf list, drivers tend to impose a limit
> > on the number of TX buffers to save resources.
> >
> > > However, at this point, I just want to find out if the long chain
> > > of mbufs is why TSO is problematic for these drivers, since I'll
> > > admit I'm getting tired of telling people to disable TSO (and I
> > > suspect some don't believe me and never try it).
> > >
> >
> > TSO capable controllers tend to have various limitations(the first
> > TX buffer should have complete ethernet/IP/TCP header, ip_len of IP
> > header should be reset to 0, TCP pseudo checksum should be
> > recomputed etc) and cheap controllers need more assistance from
> > driver to let its firmware know various IP/TCP header offset
> > location in the mbuf.  Because this requires a IP/TCP header
> > parsing, it's error prone and very complex.
> >
> > > > > Anyhow, I have attached a patch that makes NFS use
> > > > > MJUMPAGESIZE
> > > > > clusters,
> > > > > so the mbuf count drops from 34 to 18.
> > > > >
> > > >
> > > > Could we make it conditional on size?
> > > >
> > > Not sure what you mean? If you mean "the size of the read/write",
> > > that would be possible for NFSv3, but less so for NFSv4. (The
> > > read/write
> > > is just one Op. in the compound for NFSv4 and there is no way to
> > > predict how much more data is going to be generated by subsequent
> > > Ops.)
> > >
> >
> > Sorry, I should have been more clearer. You already answered my
> > question.  Thanks.
> >
> > > If by "size" you mean amount of memory in the machine then, yes,
> > > it
> > > certainly could be conditional on that. (I plan to try and look
> > > at
> > > the allocator to-day as well, but if others know of disadvantages
> > > with
> > > using MJUMPAGESIZE instead of MCLBYTES, please speak up.)
> > >
> > > Garrett Wollman already alluded to the MCLBYTES case being
> > > pre-allocated,
> > > but I'll admit I have no idea what the implications of that are
> > > at this
> > > time.
> > >
> > > > > If anyone has a TSO scatter/gather enabled net interface and
> > > > > can
> > > > > test this
> > > > > patch on it with NFS I/O (default of 64K rsize/wsize) when
> > > > > TSO is
> > > > > enabled
> > > > > and see what effect it has, that would be appreciated.
> > > > >
> > > > > Btw, thanks go to Garrett Wollman for suggesting the change
> > > > > to
> > > > > MJUMPAGESIZE
> > > > > clusters.
> > > > >
> > > > > rick
> > > > > ps: If the attachment doesn't make it through and you want
> > > > > the
> > > > > patch, just
> > > > >     email me and I'll send you a copy.
> > > > >
> >
> > _______________________________________________
> > freebsd-net@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > To unsubscribe, send any mail to
> > "freebsd-net-unsubscribe@freebsd.org"
> >
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?482557096.17290094.1390873872231.JavaMail.root>