From owner-freebsd-net@FreeBSD.ORG  Thu Jul 17 11:46:16 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id A4954CB0;
 Thu, 17 Jul 2014 11:46:16 +0000 (UTC)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "vps1.elischer.org",
 Issuer "CA Cert Signing Authority" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 8117C2D57;
 Thu, 17 Jul 2014 11:46:16 +0000 (UTC)
Received: from Julian-MBP3.local
 (ppp121-45-250-191.lns20.per2.internode.on.net [121.45.250.191])
 (authenticated bits=0)
 by vps1.elischer.org (8.14.9/8.14.9) with ESMTP id s6HBk2Wu072106
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO);
 Thu, 17 Jul 2014 04:46:05 -0700 (PDT)
 (envelope-from julian@freebsd.org)
Message-ID: <53C7B774.60304@freebsd.org>
Date: Thu, 17 Jul 2014 19:45:56 +0800
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9;
 rv:24.0) Gecko/20100101 Thunderbird/24.6.0
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>, Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: NFS client READ performance on -current
References: <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca>
 <201407151034.54681.jhb@freebsd.org>
In-Reply-To: <201407151034.54681.jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: pyunyh@gmail.com, "Russell L. Carter" <rcarter@pinyon.org>,
 freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 17 Jul 2014 11:46:16 -0000

On 7/15/14, 10:34 PM, John Baldwin wrote:
> On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote:
>> Yonghyeon Pyun wrote:
>>> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote:
>>>> On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
>>>>> John Baldwin wrote:
>>>>>> On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
>>>>>>> Russell L. Carter wrote:
>>>>>>>>
>>>>>>>> On 07/02/14 19:09, Rick Macklem wrote:
>>>>>>>>
>>>>>>>>> Could you please post the dmesg stuff for the network
>>>>>>>>> interface,
>>>>>>>>> so I can tell what driver is being used? I'll take a look
>>>>>>>>> at
>>>>>>>>> it,
>>>>>>>>> in case it needs to be changed to use m_defrag().
>>>>>>>> em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port
>>>>>>>> 0xd020-0xd03f
>>>>>>>> mem
>>>>>>>> 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at
>>>>>>>> device 0.0
>>>>>>>> on
>>>>>>>> pci2
>>>>>>>> em0: Using an MSI interrupt
>>>>>>>> em0: Ethernet address: 00:15:17:bc:29:ba
>>>>>>>> 001.000007 [2323] netmap_attach             success for em0
>>>>>>>> tx
>>>>>>>> 1/1024
>>>>>>>> rx
>>>>>>>> 1/1024 queues/slots
>>>>>>>>
>>>>>>>> This is one of those dual nic cards, so there is em1 as
>>>>>>>> well...
>>>>>>>>
>>>>>>> Well, I took a quick look at the driver and it does use
>>>>>>> m_defrag(),
>>>>>>> but
>>>>>>> I think that the "retry:" label it does a goto after doing so
>>>>>>> might
>>>>>>> be in
>>>>>>> the wrong place.
>>>>>>>
>>>>>>> The attached untested patch might fix this.
>>>>>>>
>>>>>>> Is it convenient to build a kernel with this patch applied
>>>>>>> and then
>>>>>>> try
>>>>>>> it with TSO enabled?
>>>>>>>
>>>>>>> rick
>>>>>>> ps: It does have the transmit segment limit set to 32. I have
>>>>>>> no
>>>>>>> idea if
>>>>>>>      this is a hardware limitation.
>>>>>> I think the retry is not in the wrong place, but the overhead
>>>>>> of all
>>>>>> those
>>>>>> pullups is apparently quite severe.
>>>>> The m_defrag() call after the first failure will just barely
>>>>> squeeze
>>>>> the just under 64K TSO segment into 32 mbuf clusters. Then I
>>>>> think any
>>>>> m_pullup() done during the retry will allocate an mbuf
>>>>> (at a glance it seems to always do this when the old mbuf is a
>>>>> cluster)
>>>>> and prepend that to the list.
>>>>> --> Now the list is > 32 mbufs again and the
>>>>> bus_dmammap_load_mbuf_sg()
>>>>>      will fail again on the retry, this time fatally, I think?
>>>>>
>>>>> I can't see any reason to re-do all the stuff using m_pullup()
>>>>> and Russell
>>>>> reported that moving the "retry:" fixed his problem, from what I
>>>>> understood.
>>>> Ah, I had assumed (incorrectly) that the m_pullup()s would all be
>>>> nops in this
>>>> case.  It seems the NIC would really like to have all those things
>>>> in a single
>>>> segment, but it is not required, so I agree that your patch is
>>>> fine.
>>>>
>>> I recall em(4) controllers have various limitation in TSO. Driver
>>> has to update IP header to make TSO work so driver has to get a
>>> writable mbufs.  bpf(4) consumers will see IP packet length is 0
>>> after this change.  I think tcpdump has a compile time option to
>>> guess correct IP packet length.  The firmware of controller also
>>> should be able to access complete IP/TCP header in a single buffer.
>>> I don't remember more details in TSO limitation but I guess you may
>>> be able to get more details TSO limitation from publicly available
>>> Intel data sheet.
>> I think that the patch should handle this ok. All of the m_pullup()
>> stuff gets done the first time. Then, if the result is more than 32
>> mbufs in the list, m_defrag() is called to copy the chain. This should
>> result in all the header stuff in the first mbuf cluster and the map
>> call is done again with this list of clusters. (Without the patch,
>> m_pullup() would allocate another prepended mbuf and make the chain
>> more than 32mbufs again.)
> Hmm, I am surprised by the m_pullup() behavior that it doesn't just
> notice that the first mbuf with a cluster has the desired data already
> and returns without doing anything.  That is, I'm surprised the first
> statement in m_pullup() isn't just:
>
> 	if (n->m_len >= len)
> 		return (n);
I seem to remember that the standard behaviour is for the caller to do 
exactly that.

>