Date: Wed, 29 Jan 2014 22:31:08 -0500 (EST) From: Rick Macklem <rmacklem@uoguelph.ca> To: Bryan Venteicher <bryanv@freebsd.org> Cc: freebsd-net@freebsd.org, J David <j.david.lists@gmail.com>, Garrett Wollman <wollman@freebsd.org> Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <1879662319.18746958.1391052668182.JavaMail.root@uoguelph.ca> In-Reply-To: <CAGaYwLcDVMA3=1x4hXXVvRojCBewWFZUyZfdiup=jo685%2B51%2BA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Bryan Venteicher wrote: > On Wed, Jan 29, 2014 at 5:01 PM, Rick Macklem <rmacklem@uoguelph.ca> > wrote: > > > J David wrote: > > > On Tue, Jan 28, 2014 at 7:32 PM, Rick Macklem > > > <rmacklem@uoguelph.ca> > > > wrote: > > > > Hopefully Garrett and/or you will be able to do some testing of > > > > it > > > > and report back w.r.t. performance gains, etc. > > > > > > OK, it has seen light testing. > > > > > > As predicted the vtnet drops are eliminated and CPU load is > > > reduced. > > > > > Ok, that's good news. Bryan, is increasing VTNET_MAX_TX_SEGS in the > > driver feasible? > > > > > > I've been busy the last few days, and won't be able to get to any > code > until the weekend. > > The current MAX_TX_SEGS value is mostly arbitrary - the implicit > limit is > VIRTIO_MAX_INDIRECT. This value is used in virtqueue.c to allocate an > array > of 'struct vring_desc' which is 16 bytes so we have some next power > of 2 > rounding going on, so we can make it bigger without using any real > additional memory usage. > > But also note I do put an MAX_TX_SEGS sized array of 'struct > sglist_segs' > on the stack so it cannot be made too big. Even what is currently > there is > probably already pushing what's a Good Idea to put on the stack > anyways > (especially since it is near the bottom of a typically pretty deep > call > stack). I've been meaning to move that to hanging on the 'struct > vtnet_txq' > instead. > Well, NFS hands TCP a list of 34 mbufs. If TCP obly adds one, then increasing it from 34 to 35 would be all it takes. However, see below. > I think all TSO capable drivers that use m_collapse(..., 32) (and > don't set > if_hw_tsomax) are broken - there looks to be several. I was slightly > on top > of my game by using 33 since it appears m_collapse() does not touch > the > pkthdr mbuf (I think that was my thinking 3 years ago, and seems to > be the > case by a quick glance at the code). I think drivers using > m_defrag(..., > 32) are OK, but that function can be much, much more expensive. > Well, even m_defrag(..M_NOWAIT..) can fail and then it means a TCP layer timeout/retransmit. If the allocator is constipated, this could be pretty much a trainwreck, I think. I also agree that m_defrag() adds a lot of overhead, but calling m_collapse() a lot will be quite a bit of overhead, as well. (Also, I don't think that m_collapse() is more likely to fail, since it only copies data to the previous mbuf when the entire mbuf that follows will fit and it's allowed. I'd assume that a ref count copied mbuf cluster doesn't allow this copy or things would be badly broken.) Bottom line, I think calling either m_collapse() or m_defrag() should be considered a "last resort". Maybe the driver could reduce the size of if_hw_tsomax whenever it finds it needs to call one of these functions, to try and avoid a re-occurrence? rick > > However, I do suspect we'll be putting a refined version of the patch > > in head someday (maybe April, sooner would have to be committed by > > someone else). I suspect that Garrett's code for server read will > > work > > well and I'll cobble something to-gether for server readdir and > > client > > write. > > > > > The performance is also improved: > > > > > > Test Before After > > > SeqWr 1506 7461 > > > SeqRd 566 192015 > > > RndRd 602 218730 > > > RndWr 44 13972 > > > > > > All numbers in kiB/sec. > > > > > If you get the chance, you can try a few tunables on the server. > > vfs.nfsd.fha.enable=0 > > - ken@ found that FHA was necessary for ZFS exports, to avoid out > > of order reads from confusing ZFS's sequential reading heuristic. > > However, FHA also means that all readaheads for a file are > > serialized > > with the reads for the file (same fh->same nfsd thread). Somehow, > > it > > seems to me that doing reads concurrently in the server (given > > shared > > vnode locks) could be a good thing. > > --> I wonder what the story is for UFS? > > So, it would be interesting to see what disabling FHA does for the > > sequential read test. > > > > I think I already mentioned the DRC cache ones: > > vfs.nfsd.tcphighwater=100000 > > vfs.nfsd.tcpcachetimeo=600 (actually I think Garrett uses 300) > > > > Good to see some progress, rick > > ps: Daniel reports that he will be able to test the patch this > > weekend, to see if it fixes his problem that required TSO > > to be disabled, so we'll wait and see. > > > > > There were initially still some problems with lousy hostcache > > > values > > > on the client after the test, which is what causes the iperf > > > performance to tank after the NFS test, but after a reboot of > > > both > > > sides and fresh retest, I haven't reproduced that again. If it > > > comes > > > back, I'll try to figure out what's going on. > > > > > Hopefully a networking type might know what is going on, because > > this > > is way out of my area of expertise. > > > > > But this definitely looks like a move in the right direction. > > > > > > Thanks! > > > _______________________________________________ > > > freebsd-net@freebsd.org mailing list > > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > > To unsubscribe, send any mail to > > > "freebsd-net-unsubscribe@freebsd.org" > > > > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1879662319.18746958.1391052668182.JavaMail.root>