From owner-freebsd-net@FreeBSD.ORG Tue Jan 28 02:14:55 2014 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2A16EA9F for ; Tue, 28 Jan 2014 02:14:55 +0000 (UTC) Received: from h2.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id F32271953 for ; Tue, 28 Jan 2014 02:14:54 +0000 (UTC) Received: from h2.funkthat.com (localhost [127.0.0.1]) by h2.funkthat.com (8.14.3/8.14.3) with ESMTP id s0S2Ep3C065392 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 27 Jan 2014 18:14:51 -0800 (PST) (envelope-from jmg@h2.funkthat.com) Received: (from jmg@localhost) by h2.funkthat.com (8.14.3/8.14.3/Submit) id s0S2EobG065391; Mon, 27 Jan 2014 18:14:50 -0800 (PST) (envelope-from jmg) Date: Mon, 27 Jan 2014 18:14:50 -0800 From: John-Mark Gurney To: Rick Macklem Subject: Re: Terrible NFS performance under 9.2-RELEASE? Message-ID: <20140128021450.GY13704@funkthat.com> Mail-Followup-To: Rick Macklem , freebsd-net@freebsd.org, Adam McDougall References: <20140128002826.GU13704@funkthat.com> <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1415339672.17282775.1390872779067.JavaMail.root@uoguelph.ca> User-Agent: Mutt/1.4.2.3i X-Operating-System: FreeBSD 7.2-RELEASE i386 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.2 (h2.funkthat.com [127.0.0.1]); Mon, 27 Jan 2014 18:14:51 -0800 (PST) Cc: freebsd-net@freebsd.org, Adam McDougall X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Jan 2014 02:14:55 -0000 Rick Macklem wrote this message on Mon, Jan 27, 2014 at 20:32 -0500: > John-Mark Gurney wrote: > > Rick Macklem wrote this message on Mon, Jan 27, 2014 at 18:47 -0500: > > > John-Mark Gurney wrote: > > > > Rick Macklem wrote this message on Sun, Jan 26, 2014 at 21:16 > > > > -0500: > > > > > Btw, thanks go to Garrett Wollman for suggesting the change to > > > > > MJUMPAGESIZE > > > > > clusters. > > > > > > > > > > rick > > > > > ps: If the attachment doesn't make it through and you want the > > > > > patch, just > > > > > email me and I'll send you a copy. > > > > > > > > The patch looks good, but we probably shouldn't change > > > > _readlink.. > > > > The chances of a link being >2k are pretty slim, and the chances > > > > of > > > > the link being >32k are even smaller... > > > > > > > Yea, I already thought of that, actually. However, see below w.r.t. > > > NFSv4. > > > > > > However, at this point I > > > mostly want to find out if it the long mbuf chain that causes > > > problems > > > for TSO enabled network interfaces. > > > > I agree, though a long mbuf chain is more of a driver issue than an > > NFS issue... > > > Yes, if my hunch is correct, it is. If my hunch gets verified, I will > be posting w.r.t. how best to deal with the problem. I suspect a patch > like this one might serve as a useful work-around while the drivers > gets fixed, if the hunch is correct. It would be nice to have a way to force such a segment to go out to the drivers to make debugging/testing drivers easier... I'm not sure the best way to handle that though... > > > > In fact, we might want to switch _readlink to MGET (could be > > > > conditional > > > > upon cnt) so that if it fits in an mbuf we don't allocate a > > > > cluster > > > > for > > > > it... > > > > > > > For NFSv4, what was an RPC for NFSv3 becomes one of several Ops. in > > > a compound RPC. As such, there is no way to know how much > > > additional > > > RPC message there will be. So, although the readlink reply won't > > > use > > > much of the 4K allocation, replies for subsequent Ops. in the > > > compound > > > certainly could. (Is it more efficient to allocate 4K now and use > > > part of it for subsequent message reply stuff or allocate > > > additional > > > mbuf clusters later for subsequent stuff, as required? On a small > > > memory constrained machine, I suspect the latter is correct, but > > > for > > > the kind of hardware that has TSO scatter/gather enabled network > > > interfaces, I'm not so sure. At this point, I wouldn't even say > > > that using 4K clusters is going to be a win and my hunch is that > > > any win wouldn't apply to small memory constrained machines.) > > > > Though the code that was patched wasn't using any partial buffers, > > it was always allocating a new buffer... If the code in > > _read/_readlinks starts using a previous mbuf chain, then obviously > > things are different and I'd agree, always allocating a 2k/4k > > cluster makes sense... > > > Yes, but nd_mb and nd_bpos are set, which means subsequent replies can > use the remainder of the cluster. Couldn't we scan the list of replies, find out how much data we need, m_getm the space for it all (which will use 4k clusters as necessary)? > Why does it always allocate a new cluster? Well, because the code is > OLD. It was written for OpenBSD2.6 and, at that time, I tried to make > it portable across the BSDen. I'm not so concerned w.r.t. its portability > now, since no one else is porting it and I don't plan to, but I still > think it would be nice if it were portable to other BSDen. > Back when I wrote it, I believe that MCLBYTES was 1K and an entire > cluster was needed. (To be honest, I found out that FreeBSD's NCLBYTES > is 2K about 2 days ago, when I started looking at this stuff.) > > Could it now look to see if enough bytes (a little over 1K) were available > in the current cluster and use that. Yes, but it would reduce the portability > of the code and I don't think it would make a measurable difference performance > wise. Are you sure it would reduce the portability? I can't think of a way it would... Some code will always need to be written for portability.. > > > My test server has 256Mbytes of ram and it certainly doesn't show > > > any improvement (big surprise;-), but it also doesn't show any > > > degradation for the limited testing I've done. > > > > I'm not too surprised, unless you're on a heavy server pushing > > >200MB/sec, the allocation cost is probably cheap enough that it > > doesn't show up... going to 4k means immediately half as many mbufs > > are needed/allocated, and as they are page sized, don't have the > > problems of physical memory fragmentation, nor do they have to do an > > IPI/tlb shoot down in the case of multipage allocations... (I'm > > dealing w/ this for geli.) > > > Yes, Garrett Wollman proposed this and I suspect there might be a > performance gain for larger systems. He has a more involved patch. > To be honest, if Garrett is convinced that his patch is of benefit > performance wise, I will do a separate posting w.r.t. it and whether > or not it is appropriate to be committed to head, etc. > > > > Again, my main interest at this point is whether reducing the > > > number of mbufs in the chain fixes the TSO issues. I think > > > the question of whether or not 4K clusters are performance > > > improvement in general, is an interesting one that comes later. > > > > Another thing I noticed is that we are getting an mbuf and then > > allocating a cluster... Is there a reason we aren't using something > > like m_getm or m_getcl? We have a special uma zone that has > > mbuf and mbuf cluster already paired meaning we save some lock > > operations for each segment allocated... > > > See above w.r.t. OLD portable code. There was a time when MGETCL() > wasn't guaranteed to succeed even when M_WAITOK is specified. > This is also why there is that weird loop in the NFSMCLGET() macro. Correct, but as you wrapped them in NFS* macros, it doesn't mean you can't merge the MGETCL w/ NFSMCLGET into a new function that merges the two... It's just another (not too difficult) wrapper that the porter has to write... Though apparently portability has been given up since you use MCLGET directly in nfsserver/nfs_nfsdport.c instead of NFSMCLGET... Sounds like nfsport.h needs some updating.... > (I think there was a time in FreeBSD's past when allocation was never > guaranteed and the rest of the code doesn't tolerate a NULL mbuf ptr. > Something like M_TRYWAIT in old versions of FreeBSD?) Correct, there was a time that M_WAITOK could still return, but it was many years ago and many releases ago... > Btw, Garrett Wollman's patch uses m_getm2() to get the mbuf list. Interestingly, m_getm2 will use 4k clusters as necessary, and in the _readlink case, do the correct thing... Hmmm... m_getm2 isn't documented... It was added by andre almost 7 years ago... It does appear to be a public interface as ofed, sctp iscsi and ng(_tty) all use it, though only sctp appears to use it any differently than m_getm.. The rest could simply use m_getm instead of m_getm2... Considering it was committed the day before SCTP was committed, I'm not too surprised... P.S. if someone wants to submit a patch to mbuf.9 to update the docs that would be helpful... I'll review and commit... and m_append is also undocumented... -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."