From owner-freebsd-net@FreeBSD.ORG Thu Jan 30 03:31:10 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2C7096B1; Thu, 30 Jan 2014 03:31:10 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 956461B3B; Thu, 30 Jan 2014 03:31:09 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqQEAPXG6VKDaFve/2dsb2JhbABZg0RXgwG5S0+BGnSCJQEBAQMBAQEBIAQnHQECCwUWGAICDRkCKQEJJgYIBwQBGQMEh1wIDapuoHAXgSmMfwYBAQEaNAeCb4FJBIlJineBFYQFkG2DSx4xfAgXIg X-IronPort-AV: E=Sophos;i="4.95,746,1384318800"; d="scan'208";a="91741811" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 29 Jan 2014 22:31:08 -0500 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 2F49BB4184; Wed, 29 Jan 2014 22:31:08 -0500 (EST) Date: Wed, 29 Jan 2014 22:31:08 -0500 (EST) From: Rick Macklem To: Bryan Venteicher Message-ID: <1879662319.18746958.1391052668182.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: Terrible NFS performance under 9.2-RELEASE? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790) Cc: freebsd-net@freebsd.org, J David , Garrett Wollman X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Jan 2014 03:31:10 -0000 Bryan Venteicher wrote: > On Wed, Jan 29, 2014 at 5:01 PM, Rick Macklem > wrote: > > > J David wrote: > > > On Tue, Jan 28, 2014 at 7:32 PM, Rick Macklem > > > > > > wrote: > > > > Hopefully Garrett and/or you will be able to do some testing of > > > > it > > > > and report back w.r.t. performance gains, etc. > > > > > > OK, it has seen light testing. > > > > > > As predicted the vtnet drops are eliminated and CPU load is > > > reduced. > > > > > Ok, that's good news. Bryan, is increasing VTNET_MAX_TX_SEGS in the > > driver feasible? > > > > > > I've been busy the last few days, and won't be able to get to any > code > until the weekend. > > The current MAX_TX_SEGS value is mostly arbitrary - the implicit > limit is > VIRTIO_MAX_INDIRECT. This value is used in virtqueue.c to allocate an > array > of 'struct vring_desc' which is 16 bytes so we have some next power > of 2 > rounding going on, so we can make it bigger without using any real > additional memory usage. > > But also note I do put an MAX_TX_SEGS sized array of 'struct > sglist_segs' > on the stack so it cannot be made too big. Even what is currently > there is > probably already pushing what's a Good Idea to put on the stack > anyways > (especially since it is near the bottom of a typically pretty deep > call > stack). I've been meaning to move that to hanging on the 'struct > vtnet_txq' > instead. > Well, NFS hands TCP a list of 34 mbufs. If TCP obly adds one, then increasing it from 34 to 35 would be all it takes. However, see below. > I think all TSO capable drivers that use m_collapse(..., 32) (and > don't set > if_hw_tsomax) are broken - there looks to be several. I was slightly > on top > of my game by using 33 since it appears m_collapse() does not touch > the > pkthdr mbuf (I think that was my thinking 3 years ago, and seems to > be the > case by a quick glance at the code). I think drivers using > m_defrag(..., > 32) are OK, but that function can be much, much more expensive. > Well, even m_defrag(..M_NOWAIT..) can fail and then it means a TCP layer timeout/retransmit. If the allocator is constipated, this could be pretty much a trainwreck, I think. I also agree that m_defrag() adds a lot of overhead, but calling m_collapse() a lot will be quite a bit of overhead, as well. (Also, I don't think that m_collapse() is more likely to fail, since it only copies data to the previous mbuf when the entire mbuf that follows will fit and it's allowed. I'd assume that a ref count copied mbuf cluster doesn't allow this copy or things would be badly broken.) Bottom line, I think calling either m_collapse() or m_defrag() should be considered a "last resort". Maybe the driver could reduce the size of if_hw_tsomax whenever it finds it needs to call one of these functions, to try and avoid a re-occurrence? rick > > However, I do suspect we'll be putting a refined version of the patch > > in head someday (maybe April, sooner would have to be committed by > > someone else). I suspect that Garrett's code for server read will > > work > > well and I'll cobble something to-gether for server readdir and > > client > > write. > > > > > The performance is also improved: > > > > > > Test Before After > > > SeqWr 1506 7461 > > > SeqRd 566 192015 > > > RndRd 602 218730 > > > RndWr 44 13972 > > > > > > All numbers in kiB/sec. > > > > > If you get the chance, you can try a few tunables on the server. > > vfs.nfsd.fha.enable=0 > > - ken@ found that FHA was necessary for ZFS exports, to avoid out > > of order reads from confusing ZFS's sequential reading heuristic. > > However, FHA also means that all readaheads for a file are > > serialized > > with the reads for the file (same fh->same nfsd thread). Somehow, > > it > > seems to me that doing reads concurrently in the server (given > > shared > > vnode locks) could be a good thing. > > --> I wonder what the story is for UFS? > > So, it would be interesting to see what disabling FHA does for the > > sequential read test. > > > > I think I already mentioned the DRC cache ones: > > vfs.nfsd.tcphighwater=100000 > > vfs.nfsd.tcpcachetimeo=600 (actually I think Garrett uses 300) > > > > Good to see some progress, rick > > ps: Daniel reports that he will be able to test the patch this > > weekend, to see if it fixes his problem that required TSO > > to be disabled, so we'll wait and see. > > > > > There were initially still some problems with lousy hostcache > > > values > > > on the client after the test, which is what causes the iperf > > > performance to tank after the NFS test, but after a reboot of > > > both > > > sides and fresh retest, I haven't reproduced that again. If it > > > comes > > > back, I'll try to figure out what's going on. > > > > > Hopefully a networking type might know what is going on, because > > this > > is way out of my area of expertise. > > > > > But this definitely looks like a move in the right direction. > > > > > > Thanks! > > > _______________________________________________ > > > freebsd-net@freebsd.org mailing list > > > http://lists.freebsd.org/mailman/listinfo/freebsd-net > > > To unsubscribe, send any mail to > > > "freebsd-net-unsubscribe@freebsd.org" > > > > > > _______________________________________________ > freebsd-net@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to > "freebsd-net-unsubscribe@freebsd.org" >