From owner-freebsd-net@FreeBSD.ORG  Sun Feb  2 16:15:39 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id CF2CD9A4
 for <freebsd-net@freebsd.org>; Sun,  2 Feb 2014 16:15:39 +0000 (UTC)
Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca
 [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 73D3017E5
 for <freebsd-net@freebsd.org>; Sun,  2 Feb 2014 16:15:39 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: 
X-IronPort-AV: E=Sophos;i="4.95,766,1384318800"; d="scan'208";a="93007239"
Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.222])
 by esa-jnhn.mail.uoguelph.ca with ESMTP; 02 Feb 2014 11:15:31 -0500
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id DE79DB3F62;
 Sun,  2 Feb 2014 11:15:30 -0500 (EST)
Date: Sun, 2 Feb 2014 11:15:30 -0500 (EST)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Daniel Braniss <danny@cs.huji.ac.il>
Message-ID: <906704123.1485103.1391357730899.JavaMail.root@uoguelph.ca>
In-Reply-To: <A7172665-0AF0-43D9-AB83-54611C564BAB@cs.huji.ac.il>
Subject: Re: Terrible NFS performance under 9.2-RELEASE?
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [172.17.91.202]
X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790)
Cc: Pyun YongHyeon <pyunyh@gmail.com>, FreeBSD Net <freebsd-net@freebsd.org>,
 Adam McDougall <mcdouga9@egr.msu.edu>, Jack Vogel <jfvogel@gmail.com>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 02 Feb 2014 16:15:39 -0000

Daniel Braniss wrote:
> hi Rick, et.all.
>=20
> tried your patch but it didn=E2=80=99t help,the server is stuck.
Oh well. I was hoping that was going to make TSO work reliably.
Just to comfirm it, this server works reliably when TSO is disabled?

Thanks for doing the testing, rick

> just for fun, I tried a different client/host, this one has a
> broadcom NextXtreme II  that was
> MFC=E2=80=99ed lately, and the results are worse than the Intel (5hs inst=
ead
> of 4hs) but faster without TSO
>=20
> with TSO enabled and bs=3D32k:
> 5.09hs=09=0918325.62 real      1109.23 user      4591.60 sys
>=20
> without TSO:
> 4.75hs=09=0917120.40 real      1114.08 user      3537.61 sys
>=20
> So what is the advantage of using TSO? (no complain here, just
> curious)
>=20
> I=E2=80=99ll try to see if as a server it has the same TSO related issues=
.
>=20
> cheers,
> =09danny
>=20
> On Jan 28, 2014, at 3:51 AM, Rick Macklem <rmacklem@uoguelph.ca>
> wrote:
>=20
> > Jack Vogel wrote:
> >> That header file is for the VF driver :) which I don't believe is
> >> being
> >> used in this case.
> >> The driver is capable of handling 256K but its limited by the
> >> stack
> >> to 64K
> >> (look in
> >> ixgbe.h), so its not a few bytes off due to the vlan header.
> >>=20
> >> The scatter size is not an arbitrary one, its due to hardware
> >> limitations
> >> in Niantic
> >> (82599).  Turning off TSO in the 10G environment is not practical,
> >> you will
> >> have
> >> trouble getting good performance.
> >>=20
> >> Jack
> >>=20
> > Well, if you look at this thread, Daniel got much better
> > performance
> > by turning off TSO. However, I agree that this is not an ideal
> > solution.
> > http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677=
B
> >=20
> > rick
> >=20
> >>=20
> >>=20
> >> On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN <pyunyh@gmail.com>
> >> wrote:
> >>=20
> >>> On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote:
> >>>> pyunyh@gmail.com wrote:
> >>>>> On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote:
> >>>>>> Adam McDougall wrote:
> >>>>>>> Also try rsize=3D32768,wsize=3D32768 in your mount options,
> >>>>>>> made a
> >>>>>>> huge
> >>>>>>> difference for me.  I've noticed slow file transfers on NFS
> >>>>>>> in 9
> >>>>>>> and
> >>>>>>> finally did some searching a couple months ago, someone
> >>>>>>> suggested
> >>>>>>> it
> >>>>>>> and
> >>>>>>> they were on to something.
> >>>>>>>=20
> >>>>>> I have a "hunch" that might explain why 64K NFS reads/writes
> >>>>>> perform
> >>>>>> poorly for some network environments.
> >>>>>> A 64K NFS read reply/write request consists of a list of 34
> >>>>>> mbufs
> >>>>>> when
> >>>>>> passed to TCP via sosend() and a total data length of around
> >>>>>> 65680bytes.
> >>>>>> Looking at a couple of drivers (virtio and ixgbe), they seem
> >>>>>> to
> >>>>>> expect
> >>>>>> no more than 32-33 mbufs in a list for a 65535 byte TSO xmit.
> >>>>>> I
> >>>>>> think
> >>>>>> (I don't have anything that does TSO to confirm this) that
> >>>>>> NFS will
> >>>>>> pass
> >>>>>> a list that is longer (34 plus a TCP/IP header).
> >>>>>> At a glance, it appears that the drivers call m_defrag() or
> >>>>>> m_collapse()
> >>>>>> when the mbuf list won't fit in their scatter table (32 or 33
> >>>>>> elements)
> >>>>>> and if this fails, just silently drop the data without
> >>>>>> sending it.
> >>>>>> If I'm right, there would considerable overhead from
> >>>>>> m_defrag()/m_collapse()
> >>>>>> and near disaster if they fail to fix the problem and the
> >>>>>> data is
> >>>>>> silently
> >>>>>> dropped instead of xmited.
> >>>>>>=20
> >>>>>=20
> >>>>> I think the actual number of DMA segments allocated for the
> >>>>> mbuf
> >>>>> chain is determined by bus_dma(9).  bus_dma(9) will coalesce
> >>>>> current segment with previous segment if possible.
> >>>>>=20
> >>>> Ok, I'll have to take a look, but I thought that an array of
> >>>> sized
> >>>> by "num_segs" is passed in as an argument. (And num_segs is set
> >>>> to
> >>>> either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).)
> >>>> It looked to me that the ixgbe driver called itself ix, so it
> >>>> isn't
> >>>> obvious to me which we are talking about. (I know that Daniel
> >>>> Braniss
> >>>> had an ix0 and ix1, which were fixed for NFS by disabling TSO.)
> >>>>=20
> >>>=20
> >>> It's ix(4). ixbge(4) is a different driver.
> >>>=20
> >>>> I'll admit I mostly looked at virtio's network driver, since
> >>>> that
> >>>> was the one being used by J David.
> >>>>=20
> >>>> Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have
> >>>> been
> >>>> cropping up for quite a while, and I am just trying to find out
> >>>> why.
> >>>> (I have no hardware/software that exhibits the problem, so I can
> >>>> only look at the sources and ask others to try testing stuff.)
> >>>>=20
> >>>>> I'm not sure whether you're referring to ixgbe(4) or ix(4) but
> >>>>> I
> >>>>> see the total length of all segment size of ix(4) is 65535 so
> >>>>> it has no room for ethernet/VLAN header of the mbuf chain.  The
> >>>>> driver should be fixed to transmit a 64KB datagram.
> >>>> Well, if_hw_tsomax is set to 65535 by the generic code (the
> >>>> driver
> >>>> doesn't set it) and the code in tcp_output() seems to subtract
> >>>> the
> >>>> size of an tcp/ip header from that before passing data to the
> >>>> driver,
> >>>> so I think the mbuf chain passed to the driver will fit in one
> >>>> ip datagram. (I'd assume all sorts of stuff would break for TSO
> >>>> enabled drivers if that wasn't the case?)
> >>>=20
> >>> I believe the generic code is doing right.  I'm under the
> >>> impression the non-working TSO indicates a bug in driver.  Some
> >>> drivers didn't account for additional ethernet/VLAN header so the
> >>> total size of DMA segments exceeded 65535.  I've attached a diff
> >>> for ix(4). It wasn't tested at all as I don't have hardware to
> >>> test.
> >>>=20
> >>>>=20
> >>>>> I think the use of m_defrag(9) in TSO is suboptimal. All TSO
> >>>>> capable controllers are able to handle multiple TX buffers so
> >>>>> it
> >>>>> should have used m_collapse(9) rather than copying entire chain
> >>>>> with m_defrag(9).
> >>>>>=20
> >>>> I haven't looked at these closely yet (plan on doing so to-day),
> >>>> but
> >>>> even m_collapse() looked like it copied data between mbufs and
> >>>> that
> >>>> is certainly suboptimal, imho. I don't see why a driver can't
> >>>> split
> >>>> the mbuf list, if there are too many entries for the
> >>>> scatter/gather
> >>>> and do it in two iterations (much like tcp_output() does
> >>>> already,
> >>>> since the data length exceeds 65535 - tcp/ip header size).
> >>>>=20
> >>>=20
> >>> It can split the mbuf list if controllers supports increased
> >>> number
> >>> of TX buffers.  Because controller shall consume the same number
> >>> of
> >>> DMA descriptors for the mbuf list, drivers tend to impose a limit
> >>> on the number of TX buffers to save resources.
> >>>=20
> >>>> However, at this point, I just want to find out if the long
> >>>> chain
> >>>> of mbufs is why TSO is problematic for these drivers, since I'll
> >>>> admit I'm getting tired of telling people to disable TSO (and I
> >>>> suspect some don't believe me and never try it).
> >>>>=20
> >>>=20
> >>> TSO capable controllers tend to have various limitations(the
> >>> first
> >>> TX buffer should have complete ethernet/IP/TCP header, ip_len of
> >>> IP
> >>> header should be reset to 0, TCP pseudo checksum should be
> >>> recomputed etc) and cheap controllers need more assistance from
> >>> driver to let its firmware know various IP/TCP header offset
> >>> location in the mbuf.  Because this requires a IP/TCP header
> >>> parsing, it's error prone and very complex.
> >>>=20
> >>>>>> Anyhow, I have attached a patch that makes NFS use
> >>>>>> MJUMPAGESIZE
> >>>>>> clusters,
> >>>>>> so the mbuf count drops from 34 to 18.
> >>>>>>=20
> >>>>>=20
> >>>>> Could we make it conditional on size?
> >>>>>=20
> >>>> Not sure what you mean? If you mean "the size of the
> >>>> read/write",
> >>>> that would be possible for NFSv3, but less so for NFSv4. (The
> >>>> read/write
> >>>> is just one Op. in the compound for NFSv4 and there is no way to
> >>>> predict how much more data is going to be generated by
> >>>> subsequent
> >>>> Ops.)
> >>>>=20
> >>>=20
> >>> Sorry, I should have been more clearer. You already answered my
> >>> question.  Thanks.
> >>>=20
> >>>> If by "size" you mean amount of memory in the machine then, yes,
> >>>> it
> >>>> certainly could be conditional on that. (I plan to try and look
> >>>> at
> >>>> the allocator to-day as well, but if others know of
> >>>> disadvantages
> >>>> with
> >>>> using MJUMPAGESIZE instead of MCLBYTES, please speak up.)
> >>>>=20
> >>>> Garrett Wollman already alluded to the MCLBYTES case being
> >>>> pre-allocated,
> >>>> but I'll admit I have no idea what the implications of that are
> >>>> at this
> >>>> time.
> >>>>=20
> >>>>>> If anyone has a TSO scatter/gather enabled net interface and
> >>>>>> can
> >>>>>> test this
> >>>>>> patch on it with NFS I/O (default of 64K rsize/wsize) when
> >>>>>> TSO is
> >>>>>> enabled
> >>>>>> and see what effect it has, that would be appreciated.
> >>>>>>=20
> >>>>>> Btw, thanks go to Garrett Wollman for suggesting the change
> >>>>>> to
> >>>>>> MJUMPAGESIZE
> >>>>>> clusters.
> >>>>>>=20
> >>>>>> rick
> >>>>>> ps: If the attachment doesn't make it through and you want
> >>>>>> the
> >>>>>> patch, just
> >>>>>>    email me and I'll send you a copy.
> >>>>>>=20
> >>>=20
> >>> _______________________________________________
> >>> freebsd-net@freebsd.org mailing list
> >>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> >>> To unsubscribe, send any mail to
> >>> "freebsd-net-unsubscribe@freebsd.org"
> >>>=20
> >> _______________________________________________
> >> freebsd-net@freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> >> To unsubscribe, send any mail to
> >> "freebsd-net-unsubscribe@freebsd.org"
>=20
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe@freebsd.org"
>=20