From owner-freebsd-net@FreeBSD.ORG Thu Jul 17 11:46:16 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id A4954CB0; Thu, 17 Jul 2014 11:46:16 +0000 (UTC) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "vps1.elischer.org", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 8117C2D57; Thu, 17 Jul 2014 11:46:16 +0000 (UTC) Received: from Julian-MBP3.local (ppp121-45-250-191.lns20.per2.internode.on.net [121.45.250.191]) (authenticated bits=0) by vps1.elischer.org (8.14.9/8.14.9) with ESMTP id s6HBk2Wu072106 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Thu, 17 Jul 2014 04:46:05 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <53C7B774.60304@freebsd.org> Date: Thu, 17 Jul 2014 19:45:56 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:24.0) Gecko/20100101 Thunderbird/24.6.0 MIME-Version: 1.0 To: John Baldwin , Rick Macklem Subject: Re: NFS client READ performance on -current References: <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca> <201407151034.54681.jhb@freebsd.org> In-Reply-To: <201407151034.54681.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: pyunyh@gmail.com, "Russell L. Carter" , freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Jul 2014 11:46:16 -0000 On 7/15/14, 10:34 PM, John Baldwin wrote: > On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote: >> Yonghyeon Pyun wrote: >>> On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote: >>>> On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote: >>>>> John Baldwin wrote: >>>>>> On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote: >>>>>>> Russell L. Carter wrote: >>>>>>>> >>>>>>>> On 07/02/14 19:09, Rick Macklem wrote: >>>>>>>> >>>>>>>>> Could you please post the dmesg stuff for the network >>>>>>>>> interface, >>>>>>>>> so I can tell what driver is being used? I'll take a look >>>>>>>>> at >>>>>>>>> it, >>>>>>>>> in case it needs to be changed to use m_defrag(). >>>>>>>> em0: port >>>>>>>> 0xd020-0xd03f >>>>>>>> mem >>>>>>>> 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at >>>>>>>> device 0.0 >>>>>>>> on >>>>>>>> pci2 >>>>>>>> em0: Using an MSI interrupt >>>>>>>> em0: Ethernet address: 00:15:17:bc:29:ba >>>>>>>> 001.000007 [2323] netmap_attach success for em0 >>>>>>>> tx >>>>>>>> 1/1024 >>>>>>>> rx >>>>>>>> 1/1024 queues/slots >>>>>>>> >>>>>>>> This is one of those dual nic cards, so there is em1 as >>>>>>>> well... >>>>>>>> >>>>>>> Well, I took a quick look at the driver and it does use >>>>>>> m_defrag(), >>>>>>> but >>>>>>> I think that the "retry:" label it does a goto after doing so >>>>>>> might >>>>>>> be in >>>>>>> the wrong place. >>>>>>> >>>>>>> The attached untested patch might fix this. >>>>>>> >>>>>>> Is it convenient to build a kernel with this patch applied >>>>>>> and then >>>>>>> try >>>>>>> it with TSO enabled? >>>>>>> >>>>>>> rick >>>>>>> ps: It does have the transmit segment limit set to 32. I have >>>>>>> no >>>>>>> idea if >>>>>>> this is a hardware limitation. >>>>>> I think the retry is not in the wrong place, but the overhead >>>>>> of all >>>>>> those >>>>>> pullups is apparently quite severe. >>>>> The m_defrag() call after the first failure will just barely >>>>> squeeze >>>>> the just under 64K TSO segment into 32 mbuf clusters. Then I >>>>> think any >>>>> m_pullup() done during the retry will allocate an mbuf >>>>> (at a glance it seems to always do this when the old mbuf is a >>>>> cluster) >>>>> and prepend that to the list. >>>>> --> Now the list is > 32 mbufs again and the >>>>> bus_dmammap_load_mbuf_sg() >>>>> will fail again on the retry, this time fatally, I think? >>>>> >>>>> I can't see any reason to re-do all the stuff using m_pullup() >>>>> and Russell >>>>> reported that moving the "retry:" fixed his problem, from what I >>>>> understood. >>>> Ah, I had assumed (incorrectly) that the m_pullup()s would all be >>>> nops in this >>>> case. It seems the NIC would really like to have all those things >>>> in a single >>>> segment, but it is not required, so I agree that your patch is >>>> fine. >>>> >>> I recall em(4) controllers have various limitation in TSO. Driver >>> has to update IP header to make TSO work so driver has to get a >>> writable mbufs. bpf(4) consumers will see IP packet length is 0 >>> after this change. I think tcpdump has a compile time option to >>> guess correct IP packet length. The firmware of controller also >>> should be able to access complete IP/TCP header in a single buffer. >>> I don't remember more details in TSO limitation but I guess you may >>> be able to get more details TSO limitation from publicly available >>> Intel data sheet. >> I think that the patch should handle this ok. All of the m_pullup() >> stuff gets done the first time. Then, if the result is more than 32 >> mbufs in the list, m_defrag() is called to copy the chain. This should >> result in all the header stuff in the first mbuf cluster and the map >> call is done again with this list of clusters. (Without the patch, >> m_pullup() would allocate another prepended mbuf and make the chain >> more than 32mbufs again.) > Hmm, I am surprised by the m_pullup() behavior that it doesn't just > notice that the first mbuf with a cluster has the desired data already > and returns without doing anything. That is, I'm surprised the first > statement in m_pullup() isn't just: > > if (n->m_len >= len) > return (n); I seem to remember that the standard behaviour is for the caller to do exactly that. >