From owner-freebsd-net@FreeBSD.ORG  Tue Jan 28 01:33:01 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 8F30C387
 for <freebsd-net@freebsd.org>; Tue, 28 Jan 2014 01:33:01 +0000 (UTC)
Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca
 [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 261B11674
 for <freebsd-net@freebsd.org>; Tue, 28 Jan 2014 01:33:00 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqQEAFMI51KDaFve/2dsb2JhbABXA4NEVoJ9uRJPgTJ0giUBAQEDAQEBASArHgIIAxsYAgINGQIpAQkmBggHBAEcAQOHXAgNqWufYReBKYx0CgYCARskEAcRgh5AgUkEiUiMDIQFkGyDSx4xe0I
X-IronPort-AV: E=Sophos;i="4.95,732,1384318800"; d="scan'208";a="90927656"
Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.222])
 by esa-annu.net.uoguelph.ca with ESMTP; 27 Jan 2014 20:32:59 -0500
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 13B00B3F26;
 Mon, 27 Jan 2014 20:32:59 -0500 (EST)
Date: Mon, 27 Jan 2014 20:32:59 -0500 (EST)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John-Mark Gurney <jmg@funkthat.com>
Message-ID: <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca>
In-Reply-To: <20140128002826.GU13704@funkthat.com>
Subject: Re: Terrible NFS performance under 9.2-RELEASE?
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790)
Cc: freebsd-net@freebsd.org, Adam McDougall <mcdouga9@egr.msu.edu>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Jan 2014 01:33:01 -0000

John-Mark Gurney wrote:
> Rick Macklem wrote this message on Mon, Jan 27, 2014 at 18:47 -0500:
> > John-Mark Gurney wrote:
> > > Rick Macklem wrote this message on Sun, Jan 26, 2014 at 21:16
> > > -0500:
> > > > Btw, thanks go to Garrett Wollman for suggesting the change to
> > > > MJUMPAGESIZE
> > > > clusters.
> > > > 
> > > > rick
> > > > ps: If the attachment doesn't make it through and you want the
> > > > patch, just
> > > >     email me and I'll send you a copy.
> > > 
> > > The patch looks good, but we probably shouldn't change
> > > _readlink..
> > > The chances of a link being >2k are pretty slim, and the chances
> > > of
> > > the link being >32k are even smaller...
> > > 
> > Yea, I already thought of that, actually. However, see below w.r.t.
> > NFSv4.
> > 
> > However, at this point I
> > mostly want to find out if it the long mbuf chain that causes
> > problems
> > for TSO enabled network interfaces.
> 
> I agree, though a long mbuf chain is more of a driver issue than an
> NFS issue...
> 
Yes, if my hunch is correct, it is. If my hunch gets verified, I will
be posting w.r.t. how best to deal with the problem. I suspect a patch
like this one might serve as a useful work-around while the drivers
gets fixed, if the hunch is correct.

> > > In fact, we might want to switch _readlink to MGET (could be
> > > conditional
> > > upon cnt) so that if it fits in an mbuf we don't allocate a
> > > cluster
> > > for
> > > it...
> > > 
> > For NFSv4, what was an RPC for NFSv3 becomes one of several Ops. in
> > a compound RPC. As such, there is no way to know how much
> > additional
> > RPC message there will be. So, although the readlink reply won't
> > use
> > much of the 4K allocation, replies for subsequent Ops. in the
> > compound
> > certainly could. (Is it more efficient to allocate 4K now and use
> > part of it for subsequent message reply stuff or allocate
> > additional
> > mbuf clusters later for subsequent stuff, as required? On a small
> > memory constrained machine, I suspect the latter is correct, but
> > for
> > the kind of hardware that has TSO scatter/gather enabled network
> > interfaces, I'm not so sure. At this point, I wouldn't even say
> > that using 4K clusters is going to be a win and my hunch is that
> > any win wouldn't apply to small memory constrained machines.)
> 
> Though the code that was patched wasn't using any partial buffers,
> it was always allocating a new buffer...  If the code in
> _read/_readlinks starts using a previous mbuf chain, then obviously
> things are different and I'd agree, always allocating a 2k/4k
> cluster makes sense...
> 
Yes, but nd_mb and nd_bpos are set, which means subsequent replies can
use the remainder of the cluster.

Why does it always allocate a new cluster? Well, because the code is
OLD. It was written for OpenBSD2.6 and, at that time, I tried to make
it portable across the BSDen. I'm not so concerned w.r.t. its portability
now, since no one else is porting it and I don't plan to, but I still
think it would be nice if it were portable to other BSDen.
Back when I wrote it, I believe that MCLBYTES was 1K and an entire
cluster was needed. (To be honest, I found out that FreeBSD's NCLBYTES
is 2K about 2 days ago, when I started looking at this stuff.)

Could it now look to see if enough bytes (a little over 1K) were available
in the current cluster and use that. Yes, but it would reduce the portability
of the code and I don't think it would make a measurable difference performance
wise.

> > My test server has 256Mbytes of ram and it certainly doesn't show
> > any improvement (big surprise;-), but it also doesn't show any
> > degradation for the limited testing I've done.
> 
> I'm not too surprised, unless you're on a heavy server pushing
> >200MB/sec, the allocation cost is probably cheap enough that it
> doesn't show up...  going to 4k means immediately half as many mbufs
> are needed/allocated, and as they are page sized, don't have the
> problems of physical memory fragmentation, nor do they have to do an
> IPI/tlb shoot down in the case of multipage allocations...  (I'm
> dealing w/ this for geli.)
> 
Yes, Garrett Wollman proposed this and I suspect there might be a
performance gain for larger systems. He has a more involved patch.
To be honest, if Garrett is convinced that his patch is of benefit
performance wise, I will do a separate posting w.r.t. it and whether
or not it is appropriate to be committed to head, etc.

> > Again, my main interest at this point is whether reducing the
> > number of mbufs in the chain fixes the TSO issues. I think
> > the question of whether or not 4K clusters are performance
> > improvement in general, is an interesting one that comes later.
> 
> Another thing I noticed is that we are getting an mbuf and then
> allocating a cluster... Is there a reason we aren't using something
> like m_getm or m_getcl?  We have a special uma zone that has
> mbuf and mbuf cluster already paired meaning we save some lock
> operations for each segment allocated...
> 
See above w.r.t. OLD portable code. There was a time when MGETCL()
wasn't guaranteed to succeed even when M_WAITOK is specified.
This is also why there is that weird loop in the NFSMCLGET() macro.
(I think there was a time in FreeBSD's past when allocation was never
 guaranteed and the rest of the code doesn't tolerate a NULL mbuf ptr.
 Something like M_TRYWAIT in old versions of FreeBSD?)

Btw, Garrett Wollman's patch uses m_getm2() to get the mbuf list.

rick

> --
>   John-Mark Gurney				Voice: +1 415 225 5579
> 
>      "All that I will do, has been done, All that I have, has not."
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe@freebsd.org"
>