Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 29 Jan 2014 22:31:08 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Bryan Venteicher <bryanv@freebsd.org>
Cc:        freebsd-net@freebsd.org, J David <j.david.lists@gmail.com>, Garrett Wollman <wollman@freebsd.org>
Subject:   Re: Terrible NFS performance under 9.2-RELEASE?
Message-ID:  <1879662319.18746958.1391052668182.JavaMail.root@uoguelph.ca>
In-Reply-To: <CAGaYwLcDVMA3=1x4hXXVvRojCBewWFZUyZfdiup=jo685%2B51%2BA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Bryan Venteicher wrote:
> On Wed, Jan 29, 2014 at 5:01 PM, Rick Macklem <rmacklem@uoguelph.ca>
> wrote:
> 
> > J David wrote:
> > > On Tue, Jan 28, 2014 at 7:32 PM, Rick Macklem
> > > <rmacklem@uoguelph.ca>
> > > wrote:
> > > > Hopefully Garrett and/or you will be able to do some testing of
> > > > it
> > > > and report back w.r.t. performance gains, etc.
> > >
> > > OK, it has seen light testing.
> > >
> > > As predicted the vtnet drops are eliminated and CPU load is
> > > reduced.
> > >
> > Ok, that's good news. Bryan, is increasing VTNET_MAX_TX_SEGS in the
> > driver feasible?
> >
> >
> 
> I've been busy the last few days, and won't be able to get to any
> code
> until the weekend.
> 
> The current MAX_TX_SEGS value is mostly arbitrary - the implicit
> limit is
> VIRTIO_MAX_INDIRECT. This value is used in virtqueue.c to allocate an
> array
> of 'struct vring_desc' which is 16 bytes so we have some next power
> of 2
> rounding going on, so we can make it bigger without using any real
> additional memory usage.
> 
> But also note I do put an MAX_TX_SEGS sized array of 'struct
> sglist_segs'
> on the stack so it cannot be made too big. Even what is currently
> there is
> probably already pushing what's a Good Idea to put on the stack
> anyways
> (especially since it is near the bottom of a typically pretty deep
> call
> stack). I've been meaning to move that to hanging on the 'struct
> vtnet_txq'
> instead.
> 
Well, NFS hands TCP a list of 34 mbufs. If TCP obly adds one, then
increasing it from 34 to 35 would be all it takes. However, see below.

> I think all TSO capable drivers that use m_collapse(..., 32) (and
> don't set
> if_hw_tsomax) are broken - there looks to be several. I was slightly
> on top
> of my game by using 33 since it appears m_collapse() does not touch
> the
> pkthdr mbuf (I think that was my thinking 3 years ago, and seems to
> be the
> case by a quick glance at the code). I think drivers using
> m_defrag(...,
> 32) are OK, but that function can be much, much more expensive.
> 
Well, even m_defrag(..M_NOWAIT..) can fail and then it means a TCP
layer timeout/retransmit. If the allocator is constipated, this could
be pretty much a trainwreck, I think.

I also agree that m_defrag() adds a lot of overhead, but calling
m_collapse() a lot will be quite a bit of overhead, as well. (Also,
I don't think that m_collapse() is more likely to fail, since it
only copies data to the previous mbuf when the entire mbuf that
follows will fit and it's allowed. I'd assume that a ref count
copied mbuf cluster doesn't allow this copy or things would be
badly broken.)

Bottom line, I think calling either m_collapse() or m_defrag()
should be considered a "last resort".

Maybe the driver could reduce the size of if_hw_tsomax whenever
it finds it needs to call one of these functions, to try and avoid
a re-occurrence?

rick

> 
> However, I do suspect we'll be putting a refined version of the patch
> > in head someday (maybe April, sooner would have to be committed by
> > someone else). I suspect that Garrett's code for server read will
> > work
> > well and I'll cobble something to-gether for server readdir and
> > client
> > write.
> >
> > > The performance is also improved:
> > >
> > > Test Before After
> > > SeqWr 1506 7461
> > > SeqRd 566 192015
> > > RndRd 602 218730
> > > RndWr 44 13972
> > >
> > > All numbers in kiB/sec.
> > >
> > If you get the chance, you can try a few tunables on the server.
> > vfs.nfsd.fha.enable=0
> > - ken@ found that FHA was necessary for ZFS exports, to avoid out
> >   of order reads from confusing ZFS's sequential reading heuristic.
> > However, FHA also means that all readaheads for a file are
> > serialized
> > with the reads for the file (same fh->same nfsd thread). Somehow,
> > it
> > seems to me that doing reads concurrently in the server (given
> > shared
> > vnode locks) could be a good thing.
> > --> I wonder what the story is for UFS?
> > So, it would be interesting to see what disabling FHA does for the
> > sequential read test.
> >
> > I think I already mentioned the DRC cache ones:
> > vfs.nfsd.tcphighwater=100000
> > vfs.nfsd.tcpcachetimeo=600 (actually I think Garrett uses 300)
> >
> > Good to see some progress, rick
> > ps: Daniel reports that he will be able to test the patch this
> >     weekend, to see if it fixes his problem that required TSO
> >     to be disabled, so we'll wait and see.
> >
> > > There were initially still some problems with lousy hostcache
> > > values
> > > on the client after the test, which is what causes the iperf
> > > performance to tank after the NFS test, but after a reboot of
> > > both
> > > sides and fresh retest, I haven't reproduced that again.  If it
> > > comes
> > > back, I'll try to figure out what's going on.
> > >
> > Hopefully a networking type might know what is going on, because
> > this
> > is way out of my area of expertise.
> >
> > > But this definitely looks like a move in the right direction.
> > >
> > > Thanks!
> > > _______________________________________________
> > > freebsd-net@freebsd.org mailing list
> > > http://lists.freebsd.org/mailman/listinfo/freebsd-net
> > > To unsubscribe, send any mail to
> > > "freebsd-net-unsubscribe@freebsd.org"
> > >
> >
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to
> "freebsd-net-unsubscribe@freebsd.org"
> 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1879662319.18746958.1391052668182.JavaMail.root>