From owner-freebsd-net@FreeBSD.ORG  Fri Jul 11 16:41:37 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id DF94EA77
 for <freebsd-net@freebsd.org>; Fri, 11 Jul 2014 16:41:36 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id B3DD02D2E
 for <freebsd-net@freebsd.org>; Fri, 11 Jul 2014 16:41:36 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id A67C0B91E;
 Fri, 11 Jul 2014 12:41:33 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: NFS client READ performance on -current
Date: Fri, 11 Jul 2014 09:54:23 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.4-CBSD-20140415; KDE/4.5.5; amd64; ; )
References: <1610703198.9975909.1405031503143.JavaMail.root@uoguelph.ca>
In-Reply-To: <1610703198.9975909.1405031503143.JavaMail.root@uoguelph.ca>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Message-Id: <201407110954.23381.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Fri, 11 Jul 2014 12:41:33 -0400 (EDT)
Cc: "Russell L. Carter" <rcarter@pinyon.org>, freebsd-net@freebsd.org
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Jul 2014 16:41:37 -0000

On Thursday, July 10, 2014 6:31:43 pm Rick Macklem wrote:
> John Baldwin wrote:
> > On Thursday, July 03, 2014 8:51:01 pm Rick Macklem wrote:
> > > Russell L. Carter wrote:
> > > > 
> > > > 
> > > > On 07/02/14 19:09, Rick Macklem wrote:
> > > > 
> > > > > Could you please post the dmesg stuff for the network
> > > > > interface,
> > > > > so I can tell what driver is being used? I'll take a look at
> > > > > it,
> > > > > in case it needs to be changed to use m_defrag().
> > > > 
> > > > em0: <Intel(R) PRO/1000 Network Connection 7.4.2> port
> > > > 0xd020-0xd03f
> > > > mem
> > > > 0xfe4a0000-0xfe4bffff,0xfe480000-0xfe49ffff irq 44 at device 0.0
> > > > on
> > > > pci2
> > > > em0: Using an MSI interrupt
> > > > em0: Ethernet address: 00:15:17:bc:29:ba
> > > > 001.000007 [2323] netmap_attach             success for em0 tx
> > > > 1/1024
> > > > rx
> > > > 1/1024 queues/slots
> > > > 
> > > > This is one of those dual nic cards, so there is em1 as well...
> > > > 
> > > Well, I took a quick look at the driver and it does use m_defrag(),
> > > but
> > > I think that the "retry:" label it does a goto after doing so might
> > > be in
> > > the wrong place.
> > > 
> > > The attached untested patch might fix this.
> > > 
> > > Is it convenient to build a kernel with this patch applied and then
> > > try
> > > it with TSO enabled?
> > > 
> > > rick
> > > ps: It does have the transmit segment limit set to 32. I have no
> > > idea if
> > >     this is a hardware limitation.
> > 
> > I think the retry is not in the wrong place, but the overhead of all
> > those
> > pullups is apparently quite severe.
> The m_defrag() call after the first failure will just barely squeeze
> the just under 64K TSO segment into 32 mbuf clusters. Then I think any
> m_pullup() done during the retry will allocate an mbuf
> (at a glance it seems to always do this when the old mbuf is a cluster)
> and prepend that to the list.
> --> Now the list is > 32 mbufs again and the bus_dmammap_load_mbuf_sg()
>     will fail again on the retry, this time fatally, I think?
> 
> I can't see any reason to re-do all the stuff using m_pullup() and Russell
> reported that moving the "retry:" fixed his problem, from what I understood.

Ah, I had assumed (incorrectly) that the m_pullup()s would all be nops in this
case.  It seems the NIC would really like to have all those things in a single
segment, but it is not required, so I agree that your patch is fine.

> >  It would be interesting to test
> > the
> > following in addition to your change to see if it improves
> > performance
> > further:
> > 
> > Index: if_em.c
> > ===================================================================
> > --- if_em.c	(revision 268495)
> > +++ if_em.c	(working copy)
> > @@ -1959,7 +1959,9 @@ retry:
> >  	if (error == EFBIG && remap) {
> >  		struct mbuf *m;
> >  
> > -		m = m_defrag(*m_headp, M_NOWAIT);
> > +		m = m_collapse(*m_headp, M_NOWAIT, EM_MAX_SCATTER);
> > +		if (m == NULL)
> > +			m = m_defrag(*m_headp, M_NOWAIT);
> Since a just under 64K TSO segment barely fits in 32 mbuf clusters,
> I'm at least 99% sure the m_collapse() will fail, but it can't hurt to
> try it. (If it supported 33 or 34, I think m_collapse() would have a
> reasonable chance of success.)
> 
> Right now the NFS and krpc code creates 2 small mbufs in front of the
> read/write data clusters and I think the TCP layer adds another one.
> Even if this was modified to put it all in one cluster, I don't think
> m_collapse() would succeed, since it only copies the data up and deletes
> an mbuf from the chain if it will all fit in the preceding one. Since
> the read/write data clusters are full (except the last one), they can't
> fit in the M_TRAILINGSPACE() of the preceding one unless it is empty
> from my reading of m_collapse().

Correct, ok.

-- 
John Baldwin