From owner-freebsd-net@FreeBSD.ORG  Thu Jan 30 03:31:10 2014
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 2C7096B1;
 Thu, 30 Jan 2014 03:31:10 +0000 (UTC)
Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca
 [131.104.91.36])
 by mx1.freebsd.org (Postfix) with ESMTP id 956461B3B;
 Thu, 30 Jan 2014 03:31:09 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqQEAPXG6VKDaFve/2dsb2JhbABZg0RXgwG5S0+BGnSCJQEBAQMBAQEBIAQnHQECCwUWGAICDRkCKQEJJgYIBwQBGQMEh1wIDapuoHAXgSmMfwYBAQEaNAeCb4FJBIlJineBFYQFkG2DSx4xfAgXIg
X-IronPort-AV: E=Sophos;i="4.95,746,1384318800"; d="scan'208";a="91741811"
Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.222])
 by esa-annu.net.uoguelph.ca with ESMTP; 29 Jan 2014 22:31:08 -0500
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 2F49BB4184;
 Wed, 29 Jan 2014 22:31:08 -0500 (EST)
Date: Wed, 29 Jan 2014 22:31:08 -0500 (EST)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Bryan Venteicher <bryanv@freebsd.org>
Message-ID: <1879662319.18746958.1391052668182.JavaMail.root@uoguelph.ca>
In-Reply-To: <CAGaYwLcDVMA3=1x4hXXVvRojCBewWFZUyZfdiup=jo685+51+A@mail.gmail.com>
Subject: Re: Terrible NFS performance under 9.2-RELEASE?
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.203]
X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790)
Cc: freebsd-net@freebsd.org, J David <j.david.lists@gmail.com>,
 Garrett Wollman <wollman@freebsd.org>
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Jan 2014 03:31:10 -0000

Bryan Venteicher wrote:
> On Wed, Jan 29, 2014 at 5:01 PM, Rick Macklem <rmacklem@uoguelph.ca>
> wrote:
> 
> > J David wrote:
> > > On Tue, Jan 28, 2014 at 7:32 PM, Rick Macklem
> > > <rmacklem@uoguelph.ca>
> > > wrote:
> > > > Hopefully Garrett and/or you will be able to do some testing of
> > > > it
> > > > and report back w.r.t. performance gains, etc.
> > >
> > > OK, it has seen light testing.
> > >
> > > As predicted the vtnet drops are eliminated and CPU load is
> > > reduced.
> > >
> > Ok, that's good news. Bryan, is increasing VTNET_MAX_TX_SEGS in the
> > driver feasible?
> >
> >
> 
> I've been busy the last few days, and won't be able to get to any
> code
> until the weekend.
> 
> The current MAX_TX_SEGS value is mostly arbitrary - the implicit
> limit is
> VIRTIO_MAX_INDIRECT. This value is used in virtqueue.c to allocate an
> array
> of 'struct vring_desc' which is 16 bytes so we have some next power
> of 2
> rounding going on, so we can make it bigger without using any real
> additional memory usage.
> 
> But also note I do put an MAX_TX_SEGS sized array of 'struct
> sglist_segs'
> on the stack so it cannot be made too big. Even what is currently
> there is
> probably already pushing what's a Good Idea to put on the stack
> anyways
> (especially since it is near the bottom of a typically pretty deep
> call
> stack). I've been meaning to move that to hanging on the 'struct
> vtnet_txq'
> instead.
> 
Well, NFS hands TCP a list of 34 mbufs. If TCP obly adds one, then
increasing it from 34 to 35 would be all it takes. However, see below.

> I think all TSO capable drivers that use m_collapse(..., 32) (and
> don't set
> if_hw_tsomax) are broken - there looks to be several. I was slightly
> on top
> of my game by using 33 since it appears m_collapse() does not touch
> the
> pkthdr mbuf (I think that was my thinking 3 years ago, and seems to
> be the
> case by a quick glance at the code). I think drivers using
> m_defrag(...,
> 32) are OK, but that function can be much, much more expensive.
> 
Well, even m_defrag(..M_NOWAIT..) can fail and then it means a TCP
layer timeout/retransmit. If the allocator is constipated, this could
be pretty much a trainwreck, I think.

I also agree that m_defrag() adds a lot of overhead, but calling
m_collapse() a lot will be quite a bit of overhead, as well. (Also,
I don't think that m_collapse() is more likely to fail, since it
only copies data to the previous mbuf when the entire mbuf that
follows will fit and it's allowed. I'd assume that a ref count
copied mbuf cluster doesn't allow this copy or things would be
badly broken.)

Bottom line, I think calling either m_collapse() or m_defrag()
should be considered a "last resort".

Maybe the driver could reduce the size of if_hw_tsomax whenever
it finds it needs to call one of these functions, to try and avoid
a re-occurrence?

rick

> 
> However, I do suspect we'll be putting a refined version of the patch
> > in head someday (maybe April, sooner would have to be committed by
> > someone else). I suspect that Garrett's code for server read will
> > work
> > well and I'll cobble something to-gether for server readdir and
> > client
> > write.
> >
> > > The performance is also improved:
> > >
> > > Test Before After
> > > SeqWr 1506 7461
> > > SeqRd 566 192015
> > > RndRd 602 218730
> > > RndWr 44 13972
> > >
> > > All numbers in kiB/sec.
> > >
> > If you get the chance, you can try a few tunables on the server.
> > vfs.nfsd.fha.enable=0
> > - ken@ found that FHA was necessary for ZFS exports, to avoid out
> >   of order reads from confusing ZFS's sequential reading heuristic.
> > However, FHA also means that all readaheads for a file are
> > serialized
> > with the reads for the file (same fh->same nfsd thread). Somehow,
> > it
> > seems to me that doing reads concurrently in the server (given
> > shared
> > vnode locks) could be a good thing.
> > --> I wonder what the story is for UFS?
> > So, it would be interesting to see what disabling FHA does for the
> > sequential read test.
> >
> > I think I already mentioned the DRC cache ones:
> > vfs.nfsd.tcphighwater=100000
> > vfs.nfsd.tcpcachetimeo=600 (actually I think Garrett uses 300)
> >
> > Good to see some progress, rick
> > ps: Daniel reports that he will be able to test the patch this
> >     weekend, to see if it fixes his problem that required TSO
> >     to be disabled, so we'll wait and see.
> >
> > > There were initially still some problems with lousy hostcache
> > > values
> > > on the client after the test, which is what causes the iperf
> > > performance to tank after the NFS test, but after a reboot of
> > > both
> > > sides and fresh retest, I haven't reproduced that again.  If it
> > > comes
> > > back, I'll try to figure out what's going on.
> > >
> > Hopefully a networking type might know what is going on, because
> > this
> > is way out of my area of expertise.
> >
> > > But this definitely looks like a move in the right direction.
> > >
> > > Thanks!
> > > _______________________________________________
> > > freebsd-net@freebsd.org mailing list
> > > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > > To unsubscribe, send any mail to
> > > "freebsd-net-unsubscribe@freebsd.org"
> > >
> >
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe@freebsd.org"
>