From owner-freebsd-net@FreeBSD.ORG  Sun Feb  2 11:06:20 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 2FD56415
 for <freebsd-net@freebsd.org>; Sun,  2 Feb 2014 11:06:20 +0000 (UTC)
Received: from kabab.cs.huji.ac.il (kabab.cs.huji.ac.il [132.65.116.12])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id A78991352
 for <freebsd-net@freebsd.org>; Sun,  2 Feb 2014 11:06:19 +0000 (UTC)
Received: from th-04.cs.huji.ac.il ([132.65.80.125])
 by kabab.cs.huji.ac.il with esmtp
 id 1W9usL-0005yh-1O; Sun, 02 Feb 2014 13:06:09 +0200
Mime-Version: 1.0 (Mac OS X Mail 7.1 \(1827\))
Subject: Re: Terrible NFS performance under 9.2-RELEASE?
From: Daniel Braniss <danny@cs.huji.ac.il>
In-Reply-To: <482557096.17290094.1390873872231.JavaMail.root@uoguelph.ca>
Date: Sun, 2 Feb 2014 13:04:31 +0200
Message-Id: <A7172665-0AF0-43D9-AB83-54611C564BAB@cs.huji.ac.il>
References: <482557096.17290094.1390873872231.JavaMail.root@uoguelph.ca>
To: Rick Macklem <rmacklem@uoguelph.ca>
X-Mailer: Apple Mail (2.1827)
Content-Type: text/plain;
	charset=windows-1252
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.17
Cc: Pyun YongHyeon <pyunyh@gmail.com>, FreeBSD Net <freebsd-net@freebsd.org>,
 Adam McDougall <mcdouga9@egr.msu.edu>, Jack Vogel <jfvogel@gmail.com>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 02 Feb 2014 11:06:20 -0000

hi Rick, et.all.

tried your patch but it didn=92t help,the server is stuck.
just for fun, I tried a different client/host, this one has a broadcom =
NextXtreme II  that was
MFC=92ed lately, and the results are worse than the Intel (5hs instead =
of 4hs) but faster without TSO

with TSO enabled and bs=3D32k:
5.09hs		18325.62 real      1109.23 user      4591.60 sys

without TSO:
4.75hs		17120.40 real      1114.08 user      3537.61 sys

So what is the advantage of using TSO? (no complain here, just curious)

I=92ll try to see if as a server it has the same TSO related issues.=20

cheers,
	danny

On Jan 28, 2014, at 3:51 AM, Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Jack Vogel wrote:
>> That header file is for the VF driver :) which I don't believe is
>> being
>> used in this case.
>> The driver is capable of handling 256K but its limited by the stack
>> to 64K
>> (look in
>> ixgbe.h), so its not a few bytes off due to the vlan header.
>>=20
>> The scatter size is not an arbitrary one, its due to hardware
>> limitations
>> in Niantic
>> (82599).  Turning off TSO in the 10G environment is not practical,
>> you will
>> have
>> trouble getting good performance.
>>=20
>> Jack
>>=20
> Well, if you look at this thread, Daniel got much better performance
> by turning off TSO. However, I agree that this is not an ideal =
solution.
> =
http://docs.FreeBSD.org/cgi/mid.cgi?2C287272-7B57-4AAD-B22F-6A65D9F8677B
>=20
> rick
>=20
>>=20
>>=20
>> On Mon, Jan 27, 2014 at 4:58 PM, Yonghyeon PYUN <pyunyh@gmail.com>
>> wrote:
>>=20
>>> On Mon, Jan 27, 2014 at 06:27:19PM -0500, Rick Macklem wrote:
>>>> pyunyh@gmail.com wrote:
>>>>> On Sun, Jan 26, 2014 at 09:16:54PM -0500, Rick Macklem wrote:
>>>>>> Adam McDougall wrote:
>>>>>>> Also try rsize=3D32768,wsize=3D32768 in your mount options,
>>>>>>> made a
>>>>>>> huge
>>>>>>> difference for me.  I've noticed slow file transfers on NFS
>>>>>>> in 9
>>>>>>> and
>>>>>>> finally did some searching a couple months ago, someone
>>>>>>> suggested
>>>>>>> it
>>>>>>> and
>>>>>>> they were on to something.
>>>>>>>=20
>>>>>> I have a "hunch" that might explain why 64K NFS reads/writes
>>>>>> perform
>>>>>> poorly for some network environments.
>>>>>> A 64K NFS read reply/write request consists of a list of 34
>>>>>> mbufs
>>>>>> when
>>>>>> passed to TCP via sosend() and a total data length of around
>>>>>> 65680bytes.
>>>>>> Looking at a couple of drivers (virtio and ixgbe), they seem
>>>>>> to
>>>>>> expect
>>>>>> no more than 32-33 mbufs in a list for a 65535 byte TSO xmit.
>>>>>> I
>>>>>> think
>>>>>> (I don't have anything that does TSO to confirm this) that
>>>>>> NFS will
>>>>>> pass
>>>>>> a list that is longer (34 plus a TCP/IP header).
>>>>>> At a glance, it appears that the drivers call m_defrag() or
>>>>>> m_collapse()
>>>>>> when the mbuf list won't fit in their scatter table (32 or 33
>>>>>> elements)
>>>>>> and if this fails, just silently drop the data without
>>>>>> sending it.
>>>>>> If I'm right, there would considerable overhead from
>>>>>> m_defrag()/m_collapse()
>>>>>> and near disaster if they fail to fix the problem and the
>>>>>> data is
>>>>>> silently
>>>>>> dropped instead of xmited.
>>>>>>=20
>>>>>=20
>>>>> I think the actual number of DMA segments allocated for the
>>>>> mbuf
>>>>> chain is determined by bus_dma(9).  bus_dma(9) will coalesce
>>>>> current segment with previous segment if possible.
>>>>>=20
>>>> Ok, I'll have to take a look, but I thought that an array of
>>>> sized
>>>> by "num_segs" is passed in as an argument. (And num_segs is set
>>>> to
>>>> either IXGBE_82598_SCATTER (100) or IXGBE_82599_SCATTER (32).)
>>>> It looked to me that the ixgbe driver called itself ix, so it
>>>> isn't
>>>> obvious to me which we are talking about. (I know that Daniel
>>>> Braniss
>>>> had an ix0 and ix1, which were fixed for NFS by disabling TSO.)
>>>>=20
>>>=20
>>> It's ix(4). ixbge(4) is a different driver.
>>>=20
>>>> I'll admit I mostly looked at virtio's network driver, since that
>>>> was the one being used by J David.
>>>>=20
>>>> Problems w.r.t. TSO enabled for NFS using 64K rsize/wsize have
>>>> been
>>>> cropping up for quite a while, and I am just trying to find out
>>>> why.
>>>> (I have no hardware/software that exhibits the problem, so I can
>>>> only look at the sources and ask others to try testing stuff.)
>>>>=20
>>>>> I'm not sure whether you're referring to ixgbe(4) or ix(4) but
>>>>> I
>>>>> see the total length of all segment size of ix(4) is 65535 so
>>>>> it has no room for ethernet/VLAN header of the mbuf chain.  The
>>>>> driver should be fixed to transmit a 64KB datagram.
>>>> Well, if_hw_tsomax is set to 65535 by the generic code (the
>>>> driver
>>>> doesn't set it) and the code in tcp_output() seems to subtract
>>>> the
>>>> size of an tcp/ip header from that before passing data to the
>>>> driver,
>>>> so I think the mbuf chain passed to the driver will fit in one
>>>> ip datagram. (I'd assume all sorts of stuff would break for TSO
>>>> enabled drivers if that wasn't the case?)
>>>=20
>>> I believe the generic code is doing right.  I'm under the
>>> impression the non-working TSO indicates a bug in driver.  Some
>>> drivers didn't account for additional ethernet/VLAN header so the
>>> total size of DMA segments exceeded 65535.  I've attached a diff
>>> for ix(4). It wasn't tested at all as I don't have hardware to
>>> test.
>>>=20
>>>>=20
>>>>> I think the use of m_defrag(9) in TSO is suboptimal. All TSO
>>>>> capable controllers are able to handle multiple TX buffers so
>>>>> it
>>>>> should have used m_collapse(9) rather than copying entire chain
>>>>> with m_defrag(9).
>>>>>=20
>>>> I haven't looked at these closely yet (plan on doing so to-day),
>>>> but
>>>> even m_collapse() looked like it copied data between mbufs and
>>>> that
>>>> is certainly suboptimal, imho. I don't see why a driver can't
>>>> split
>>>> the mbuf list, if there are too many entries for the
>>>> scatter/gather
>>>> and do it in two iterations (much like tcp_output() does already,
>>>> since the data length exceeds 65535 - tcp/ip header size).
>>>>=20
>>>=20
>>> It can split the mbuf list if controllers supports increased number
>>> of TX buffers.  Because controller shall consume the same number of
>>> DMA descriptors for the mbuf list, drivers tend to impose a limit
>>> on the number of TX buffers to save resources.
>>>=20
>>>> However, at this point, I just want to find out if the long chain
>>>> of mbufs is why TSO is problematic for these drivers, since I'll
>>>> admit I'm getting tired of telling people to disable TSO (and I
>>>> suspect some don't believe me and never try it).
>>>>=20
>>>=20
>>> TSO capable controllers tend to have various limitations(the first
>>> TX buffer should have complete ethernet/IP/TCP header, ip_len of IP
>>> header should be reset to 0, TCP pseudo checksum should be
>>> recomputed etc) and cheap controllers need more assistance from
>>> driver to let its firmware know various IP/TCP header offset
>>> location in the mbuf.  Because this requires a IP/TCP header
>>> parsing, it's error prone and very complex.
>>>=20
>>>>>> Anyhow, I have attached a patch that makes NFS use
>>>>>> MJUMPAGESIZE
>>>>>> clusters,
>>>>>> so the mbuf count drops from 34 to 18.
>>>>>>=20
>>>>>=20
>>>>> Could we make it conditional on size?
>>>>>=20
>>>> Not sure what you mean? If you mean "the size of the read/write",
>>>> that would be possible for NFSv3, but less so for NFSv4. (The
>>>> read/write
>>>> is just one Op. in the compound for NFSv4 and there is no way to
>>>> predict how much more data is going to be generated by subsequent
>>>> Ops.)
>>>>=20
>>>=20
>>> Sorry, I should have been more clearer. You already answered my
>>> question.  Thanks.
>>>=20
>>>> If by "size" you mean amount of memory in the machine then, yes,
>>>> it
>>>> certainly could be conditional on that. (I plan to try and look
>>>> at
>>>> the allocator to-day as well, but if others know of disadvantages
>>>> with
>>>> using MJUMPAGESIZE instead of MCLBYTES, please speak up.)
>>>>=20
>>>> Garrett Wollman already alluded to the MCLBYTES case being
>>>> pre-allocated,
>>>> but I'll admit I have no idea what the implications of that are
>>>> at this
>>>> time.
>>>>=20
>>>>>> If anyone has a TSO scatter/gather enabled net interface and
>>>>>> can
>>>>>> test this
>>>>>> patch on it with NFS I/O (default of 64K rsize/wsize) when
>>>>>> TSO is
>>>>>> enabled
>>>>>> and see what effect it has, that would be appreciated.
>>>>>>=20
>>>>>> Btw, thanks go to Garrett Wollman for suggesting the change
>>>>>> to
>>>>>> MJUMPAGESIZE
>>>>>> clusters.
>>>>>>=20
>>>>>> rick
>>>>>> ps: If the attachment doesn't make it through and you want
>>>>>> the
>>>>>> patch, just
>>>>>>    email me and I'll send you a copy.
>>>>>>=20
>>>=20
>>> _______________________________________________
>>> freebsd-net@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>>> To unsubscribe, send any mail to
>>> "freebsd-net-unsubscribe@freebsd.org"
>>>=20
>> _______________________________________________
>> freebsd-net@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-net
>> To unsubscribe, send any mail to
>> "freebsd-net-unsubscribe@freebsd.org"