From owner-freebsd-net@FreeBSD.ORG Tue Jul 15 15:38:12 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 0D5AAD6C for ; Tue, 15 Jul 2014 15:38:12 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 22B9222E7 for ; Tue, 15 Jul 2014 15:38:10 +0000 (UTC) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id CFE7AB97E; Tue, 15 Jul 2014 11:38:09 -0400 (EDT) From: John Baldwin To: Rick Macklem Subject: Re: NFS client READ performance on -current Date: Tue, 15 Jul 2014 10:34:54 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20140415; KDE/4.5.5; amd64; ; ) References: <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca> In-Reply-To: <2136988575.13956627.1405199640153.JavaMail.root@uoguelph.ca> MIME-Version: 1.0 Content-Type: Text/Plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <201407151034.54681.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Tue, 15 Jul 2014 11:38:09 -0400 (EDT) Cc: pyunyh@gmail.com, "Russell L. Carter" , freebsd-net@freebsd.org X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 15 Jul 2014 15:38:12 -0000 On Saturday, July 12, 2014 5:14:00 pm Rick Macklem wrote: > Yonghyeon Pyun wrote: > > On Fri, Jul 11, 2014 at 09:54:23AM -0400, John Baldwin wrote: > > > On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote: > > > > John Baldwin wrote: > > > > > On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote: > > > > > > Russell L. Carter wrote: > > > > > > > > > > > > > > > > > > > > > On 07/02/14 19:09, Rick Macklem wrote: > > > > > > > > > > > > > > > Could you please post the dmesg stuff for the network > > > > > > > > interface, > > > > > > > > so I can tell what driver is being used? I'll take a look > > > > > > > > at > > > > > > > > it, > > > > > > > > in case it needs to be changed to use m_defrag(). > > > > > > > > > > > > > > em0: port > > > > > > > 0xd020-0xd03f > > > > > > > mem > > > > > > > 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at > > > > > > > device 0.0 > > > > > > > on > > > > > > > pci2 > > > > > > > em0: Using an MSI interrupt > > > > > > > em0: Ethernet address: 00:15:17:bc:29:ba > > > > > > > 001.000007 [2323] netmap_attach success for em0 > > > > > > > tx > > > > > > > 1/1024 > > > > > > > rx > > > > > > > 1/1024 queues/slots > > > > > > > > > > > > > > This is one of those dual nic cards, so there is em1 as > > > > > > > well... > > > > > > > > > > > > > Well, I took a quick look at the driver and it does use > > > > > > m_defrag(), > > > > > > but > > > > > > I think that the "retry:" label it does a goto after doing so > > > > > > might > > > > > > be in > > > > > > the wrong place. > > > > > > > > > > > > The attached untested patch might fix this. > > > > > > > > > > > > Is it convenient to build a kernel with this patch applied > > > > > > and then > > > > > > try > > > > > > it with TSO enabled? > > > > > > > > > > > > rick > > > > > > ps: It does have the transmit segment limit set to 32. I have > > > > > > no > > > > > > idea if > > > > > > this is a hardware limitation. > > > > > > > > > > I think the retry is not in the wrong place, but the overhead > > > > > of all > > > > > those > > > > > pullups is apparently quite severe. > > > > The m_defrag() call after the first failure will just barely > > > > squeeze > > > > the just under 64K TSO segment into 32 mbuf clusters. Then I > > > > think any > > > > m_pullup() done during the retry will allocate an mbuf > > > > (at a glance it seems to always do this when the old mbuf is a > > > > cluster) > > > > and prepend that to the list. > > > > --> Now the list is > 32 mbufs again and the > > > > bus_dmammap_load_mbuf_sg() > > > > will fail again on the retry, this time fatally, I think? > > > > > > > > I can't see any reason to re-do all the stuff using m_pullup() > > > > and Russell > > > > reported that moving the "retry:" fixed his problem, from what I > > > > understood. > > > > > > Ah, I had assumed (incorrectly) that the m_pullup()s would all be > > > nops in this > > > case. It seems the NIC would really like to have all those things > > > in a single > > > segment, but it is not required, so I agree that your patch is > > > fine. > > > > > > > I recall em(4) controllers have various limitation in TSO. Driver > > has to update IP header to make TSO work so driver has to get a > > writable mbufs. bpf(4) consumers will see IP packet length is 0 > > after this change. I think tcpdump has a compile time option to > > guess correct IP packet length. The firmware of controller also > > should be able to access complete IP/TCP header in a single buffer. > > I don't remember more details in TSO limitation but I guess you may > > be able to get more details TSO limitation from publicly available > > Intel data sheet. > I think that the patch should handle this ok. All of the m_pullup() > stuff gets done the first time. Then, if the result is more than 32 > mbufs in the list, m_defrag() is called to copy the chain. This should > result in all the header stuff in the first mbuf cluster and the map > call is done again with this list of clusters. (Without the patch, > m_pullup() would allocate another prepended mbuf and make the chain > more than 32mbufs again.) Hmm, I am surprised by the m_pullup() behavior that it doesn't just notice that the first mbuf with a cluster has the desired data already and returns without doing anything. That is, I'm surprised the first statement in m_pullup() isn't just: if (n->m_len >= len) return (n); -- John Baldwin