From owner-freebsd-hackers@FreeBSD.ORG Wed Oct 10 22:09:15 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C92823D5; Wed, 10 Oct 2012 22:09:15 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 0FE5E8FC14; Wed, 10 Oct 2012 22:09:14 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEAJ+LclCDaFvO/2dsb2JhbABFFoV7uhmCIAEBAQQBAQEgBCcgBgUbDgoCAg0ZAikBCSYGCAcEARwBA4dkC6ZJkXWBIYouGoRkgRIDkz6CLYEVjxmDCYFHNA X-IronPort-AV: E=Sophos;i="4.80,567,1344225600"; d="scan'208";a="185836181" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 10 Oct 2012 18:09:07 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id EC7FAB3F62; Wed, 10 Oct 2012 18:09:07 -0400 (EDT) Date: Wed, 10 Oct 2012 18:09:07 -0400 (EDT) From: Rick Macklem To: Nikolay Denev Message-ID: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: Subject: Re: NFS server bottlenecks MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692) Cc: rmacklem@freebsd.org, Garrett Wollman , freebsd-hackers@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2012 22:09:16 -0000 Nikolay Denev wrote: > On Oct 10, 2012, at 3:18 AM, Rick Macklem > wrote: > > > Nikolay Denev wrote: > >> On Oct 4, 2012, at 12:36 AM, Rick Macklem > >> wrote: > >> > >>> Garrett Wollman wrote: > >>>> < >>>> said: > >>>> > >>>>>> Simple: just use a sepatate mutex for each list that a cache > >>>>>> entry > >>>>>> is on, rather than a global lock for everything. This would > >>>>>> reduce > >>>>>> the mutex contention, but I'm not sure how significantly since > >>>>>> I > >>>>>> don't have the means to measure it yet. > >>>>>> > >>>>> Well, since the cache trimming is removing entries from the > >>>>> lists, > >>>>> I > >>>>> don't > >>>>> see how that can be done with a global lock for list updates? > >>>> > >>>> Well, the global lock is what we have now, but the cache trimming > >>>> process only looks at one list at a time, so not locking the list > >>>> that > >>>> isn't being iterated over probably wouldn't hurt, unless there's > >>>> some > >>>> mechanism (that I didn't see) for entries to move from one list > >>>> to > >>>> another. Note that I'm considering each hash bucket a separate > >>>> "list". (One issue to worry about in that case would be > >>>> cache-line > >>>> contention in the array of hash buckets; perhaps > >>>> NFSRVCACHE_HASHSIZE > >>>> ought to be increased to reduce that.) > >>>> > >>> Yea, a separate mutex for each hash list might help. There is also > >>> the > >>> LRU list that all entries end up on, that gets used by the > >>> trimming > >>> code. > >>> (I think? I wrote this stuff about 8 years ago, so I haven't > >>> looked > >>> at > >>> it in a while.) > >>> > >>> Also, increasing the hash table size is probably a good idea, > >>> especially > >>> if you reduce how aggressively the cache is trimmed. > >>> > >>>>> Only doing it once/sec would result in a very large cache when > >>>>> bursts of > >>>>> traffic arrives. > >>>> > >>>> My servers have 96 GB of memory so that's not a big deal for me. > >>>> > >>> This code was originally "production tested" on a server with > >>> 1Gbyte, > >>> so times have changed a bit;-) > >>> > >>>>> I'm not sure I see why doing it as a separate thread will > >>>>> improve > >>>>> things. > >>>>> There are N nfsd threads already (N can be bumped up to 256 if > >>>>> you > >>>>> wish) > >>>>> and having a bunch more "cache trimming threads" would just > >>>>> increase > >>>>> contention, wouldn't it? > >>>> > >>>> Only one cache-trimming thread. The cache trim holds the (global) > >>>> mutex for much longer than any individual nfsd service thread has > >>>> any > >>>> need to, and having N threads doing that in parallel is why it's > >>>> so > >>>> heavily contended. If there's only one thread doing the trim, > >>>> then > >>>> the nfsd service threads aren't spending time either contending > >>>> on > >>>> the > >>>> mutex (it will be held less frequently and for shorter periods). > >>>> > >>> I think the little drc2.patch which will keep the nfsd threads > >>> from > >>> acquiring the mutex and doing the trimming most of the time, might > >>> be > >>> sufficient. I still don't see why a separate trimming thread will > >>> be > >>> an advantage. I'd also be worried that the one cache trimming > >>> thread > >>> won't get the job done soon enough. > >>> > >>> When I did production testing on a 1Gbyte server that saw a peak > >>> load of about 100RPCs/sec, it was necessary to trim aggressively. > >>> (Although I'd be tempted to say that a server with 1Gbyte is no > >>> longer relevant, I recently recall someone trying to run FreeBSD > >>> on a i486, although I doubt they wanted to run the nfsd on it.) > >>> > >>>>> The only negative effect I can think of w.r.t. having the nfsd > >>>>> threads doing it would be a (I believe negligible) increase in > >>>>> RPC > >>>>> response times (the time the nfsd thread spends trimming the > >>>>> cache). > >>>>> As noted, I think this time would be negligible compared to disk > >>>>> I/O > >>>>> and network transit times in the total RPC response time? > >>>> > >>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and > >>>> 10G > >>>> network connectivity, spinning on a contended mutex takes a > >>>> significant amount of CPU time. (For the current design of the > >>>> NFS > >>>> server, it may actually be a win to turn off adaptive mutexes -- > >>>> I > >>>> should give that a try once I'm able to do more testing.) > >>>> > >>> Have fun with it. Let me know when you have what you think is a > >>> good > >>> patch. > >>> > >>> rick > >>> > >>>> -GAWollman > >>>> _______________________________________________ > >>>> freebsd-hackers@freebsd.org mailing list > >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >>>> To unsubscribe, send any mail to > >>>> "freebsd-hackers-unsubscribe@freebsd.org" > >>> _______________________________________________ > >>> freebsd-fs@freebsd.org mailing list > >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >>> To unsubscribe, send any mail to > >>> "freebsd-fs-unsubscribe@freebsd.org" > >> > >> My quest for IOPS over NFS continues :) > >> So far I'm not able to achieve more than about 3000 8K read > >> requests > >> over NFS, > >> while the server locally gives much more. > >> And this is all from a file that is completely in ARC cache, no > >> disk > >> IO involved. > >> > > Just out of curiousity, why do you use 8K reads instead of 64K > > reads. > > Since the RPC overhead (including the DRC functions) is per RPC, > > doing > > fewer larger RPCs should usually work better. (Sometimes large > > rsize/wsize > > values generate too large a burst of traffic for a network interface > > to > > handle and then the rsize/wsize has to be decreased to avoid this > > issue.) > > > > And, although this experiment seems useful for testing patches that > > try > > and reduce DRC CPU overheads, most "real" NFS servers will be doing > > disk > > I/O. > > > > This is the default blocksize the Oracle and probably most databases > use. > It uses also larger blocks, but for small random reads in OLTP > applications this is what is used. > If the client is doing 8K reads, you could increase the read ahead "readahead=N" (N up to 16), to try and increase the bandwidth. (But if the CPU is 99% busy, then I don't think it will matter.) > > >> I've snatched some sample DTrace script from the net : [ > >> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes > >> ] > >> > >> And modified it for our new NFS server : > >> > >> #!/usr/sbin/dtrace -qs > >> > >> fbt:kernel:nfsrvd_*:entry > >> { > >> self->ts = timestamp; > >> @counts[probefunc] = count(); > >> } > >> > >> fbt:kernel:nfsrvd_*:return > >> / self->ts > 0 / > >> { > >> this->delta = (timestamp-self->ts)/1000000; > >> } > >> > >> fbt:kernel:nfsrvd_*:return > >> / self->ts > 0 && this->delta > 100 / > >> { > >> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50); > >> } > >> > >> fbt:kernel:nfsrvd_*:return > >> / self->ts > 0 / > >> { > >> @dist[probefunc, "ms"] = quantize(this->delta); > >> self->ts = 0; > >> } > >> > >> END > >> { > >> printf("\n"); > >> printa("function %-20s %@10d\n", @counts); > >> printf("\n"); > >> printa("function %s(), time in %s:%@d\n", @dist); > >> printf("\n"); > >> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow); > >> } > >> > >> And here's a sample output from one or two minutes during the run > >> of > >> Oracle's ORION benchmark > >> tool from a Linux machine, on a 32G file on NFS mount over 10G > >> ethernet: > >> > >> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d > >> ^C > >> > >> function nfsrvd_access 4 > >> function nfsrvd_statfs 10 > >> function nfsrvd_getattr 14 > >> function nfsrvd_commit 76 > >> function nfsrvd_sentcache 110048 > >> function nfsrvd_write 110048 > >> function nfsrvd_read 283648 > >> function nfsrvd_dorpc 393800 > >> function nfsrvd_getcache 393800 > >> function nfsrvd_rephead 393800 > >> function nfsrvd_updatecache 393800 > >> > >> function nfsrvd_access(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 > >> 1 | 0 > >> > >> function nfsrvd_statfs(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 > >> 1 | 0 > >> > >> function nfsrvd_getattr(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 > >> 1 | 0 > >> > >> function nfsrvd_sentcache(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 > >> 1 | 0 > >> > >> function nfsrvd_rephead(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >> 1 | 0 > >> > >> function nfsrvd_updatecache(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >> 1 | 0 > >> > >> function nfsrvd_getcache(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 > >> 1 | 1 > >> 2 | 0 > >> 4 | 1 > >> 8 | 0 > >> > >> function nfsrvd_write(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 > >> 1 | 5 > >> 2 | 4 > >> 4 | 0 > >> > >> function nfsrvd_read(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 > >> 1 | 19 > >> 2 | 3 > >> 4 | 2 > >> 8 | 0 > >> 16 | 1 > >> 32 | 0 > >> 64 | 0 > >> 128 | 0 > >> 256 | 1 > >> 512 | 0 > >> > >> function nfsrvd_commit(), time in ms: > >> value ------------- Distribution ------------- count > >> -1 | 0 > >> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 > >> 1 |@@@@@@@ 14 > >> 2 | 0 > >> 4 |@ 1 > >> 8 |@ 1 > >> 16 | 0 > >> 32 |@@@@@@@ 14 > >> 64 |@ 2 > >> 128 | 0 > >> > >> > >> function nfsrvd_commit(), time in ms for >= 100 ms: > >> value ------------- Distribution ------------- count > >> < 100 | 0 > >> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >> 150 | 0 > >> > >> function nfsrvd_read(), time in ms for >= 100 ms: > >> value ------------- Distribution ------------- count > >> 250 | 0 > >> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >> 350 | 0 > >> > >> > >> Looks like the nfs server cache functions are quite fast, but > >> extremely frequently called. > >> > > Yep, they are called for every RPC. > > > > I may try coding up a patch that replaces the single mutex with > > one for each hash bucket, for TCP. > > > > I'll post if/when I get this patch to a testing/review stage, rick > > > > Cool. > > I've readjusted the precision of the dtrace script a bit, and I can > see > now the following three functions as taking most of the time : > nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() > > This was recorded during a oracle benchmark run called SLOB, which > caused 99% cpu load on the NFS server. > Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater? (Assuming the mounts are TCP ones.) Have fun with it, rick > > >> I hope someone can find this information useful. > >> > >> _______________________________________________ > >> freebsd-hackers@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >> To unsubscribe, send any mail to > >> "freebsd-hackers-unsubscribe@freebsd.org"