Date: Wed, 10 Oct 2012 17:42:15 +0300 From: Nikolay Denev <ndenev@gmail.com> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>, freebsd-hackers@freebsd.org Subject: Re: NFS server bottlenecks Message-ID: <B2CD757D-25D8-4353-8487-B3583EEC57FC@gmail.com> In-Reply-To: <1492364164.1964483.1349828280211.JavaMail.root@erie.cs.uoguelph.ca> References: <1492364164.1964483.1349828280211.JavaMail.root@erie.cs.uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca> wrote: > Nikolay Denev wrote: >> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca> >> wrote: >>=20 >>> Garrett Wollman wrote: >>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem >>>> <rmacklem@uoguelph.ca> said: >>>>=20 >>>>>> Simple: just use a sepatate mutex for each list that a cache >>>>>> entry >>>>>> is on, rather than a global lock for everything. This would >>>>>> reduce >>>>>> the mutex contention, but I'm not sure how significantly since I >>>>>> don't have the means to measure it yet. >>>>>>=20 >>>>> Well, since the cache trimming is removing entries from the lists, >>>>> I >>>>> don't >>>>> see how that can be done with a global lock for list updates? >>>>=20 >>>> Well, the global lock is what we have now, but the cache trimming >>>> process only looks at one list at a time, so not locking the list >>>> that >>>> isn't being iterated over probably wouldn't hurt, unless there's >>>> some >>>> mechanism (that I didn't see) for entries to move from one list to >>>> another. Note that I'm considering each hash bucket a separate >>>> "list". (One issue to worry about in that case would be cache-line >>>> contention in the array of hash buckets; perhaps >>>> NFSRVCACHE_HASHSIZE >>>> ought to be increased to reduce that.) >>>>=20 >>> Yea, a separate mutex for each hash list might help. There is also >>> the >>> LRU list that all entries end up on, that gets used by the trimming >>> code. >>> (I think? I wrote this stuff about 8 years ago, so I haven't looked >>> at >>> it in a while.) >>>=20 >>> Also, increasing the hash table size is probably a good idea, >>> especially >>> if you reduce how aggressively the cache is trimmed. >>>=20 >>>>> Only doing it once/sec would result in a very large cache when >>>>> bursts of >>>>> traffic arrives. >>>>=20 >>>> My servers have 96 GB of memory so that's not a big deal for me. >>>>=20 >>> This code was originally "production tested" on a server with >>> 1Gbyte, >>> so times have changed a bit;-) >>>=20 >>>>> I'm not sure I see why doing it as a separate thread will improve >>>>> things. >>>>> There are N nfsd threads already (N can be bumped up to 256 if you >>>>> wish) >>>>> and having a bunch more "cache trimming threads" would just >>>>> increase >>>>> contention, wouldn't it? >>>>=20 >>>> Only one cache-trimming thread. The cache trim holds the (global) >>>> mutex for much longer than any individual nfsd service thread has >>>> any >>>> need to, and having N threads doing that in parallel is why it's so >>>> heavily contended. If there's only one thread doing the trim, then >>>> the nfsd service threads aren't spending time either contending on >>>> the >>>> mutex (it will be held less frequently and for shorter periods). >>>>=20 >>> I think the little drc2.patch which will keep the nfsd threads from >>> acquiring the mutex and doing the trimming most of the time, might >>> be >>> sufficient. I still don't see why a separate trimming thread will be >>> an advantage. I'd also be worried that the one cache trimming thread >>> won't get the job done soon enough. >>>=20 >>> When I did production testing on a 1Gbyte server that saw a peak >>> load of about 100RPCs/sec, it was necessary to trim aggressively. >>> (Although I'd be tempted to say that a server with 1Gbyte is no >>> longer relevant, I recently recall someone trying to run FreeBSD >>> on a i486, although I doubt they wanted to run the nfsd on it.) >>>=20 >>>>> The only negative effect I can think of w.r.t. having the nfsd >>>>> threads doing it would be a (I believe negligible) increase in RPC >>>>> response times (the time the nfsd thread spends trimming the >>>>> cache). >>>>> As noted, I think this time would be negligible compared to disk >>>>> I/O >>>>> and network transit times in the total RPC response time? >>>>=20 >>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G >>>> network connectivity, spinning on a contended mutex takes a >>>> significant amount of CPU time. (For the current design of the NFS >>>> server, it may actually be a win to turn off adaptive mutexes -- I >>>> should give that a try once I'm able to do more testing.) >>>>=20 >>> Have fun with it. Let me know when you have what you think is a good >>> patch. >>>=20 >>> rick >>>=20 >>>> -GAWollman >>>> _______________________________________________ >>>> freebsd-hackers@freebsd.org mailing list >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>> To unsubscribe, send any mail to >>>> "freebsd-hackers-unsubscribe@freebsd.org" >>> _______________________________________________ >>> freebsd-fs@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >>> To unsubscribe, send any mail to >>> "freebsd-fs-unsubscribe@freebsd.org" >>=20 >> My quest for IOPS over NFS continues :) >> So far I'm not able to achieve more than about 3000 8K read requests >> over NFS, >> while the server locally gives much more. >> And this is all from a file that is completely in ARC cache, no disk >> IO involved. >>=20 > Just out of curiousity, why do you use 8K reads instead of 64K reads. > Since the RPC overhead (including the DRC functions) is per RPC, doing > fewer larger RPCs should usually work better. (Sometimes large = rsize/wsize > values generate too large a burst of traffic for a network interface = to > handle and then the rsize/wsize has to be decreased to avoid this = issue.) >=20 > And, although this experiment seems useful for testing patches that = try > and reduce DRC CPU overheads, most "real" NFS servers will be doing = disk > I/O. >=20 This is the default blocksize the Oracle and probably most databases = use. It uses also larger blocks, but for small random reads in OLTP = applications this is what is used. >> I've snatched some sample DTrace script from the net : [ >> = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes >> ] >>=20 >> And modified it for our new NFS server : >>=20 >> #!/usr/sbin/dtrace -qs >>=20 >> fbt:kernel:nfsrvd_*:entry >> { >> self->ts =3D timestamp; >> @counts[probefunc] =3D count(); >> } >>=20 >> fbt:kernel:nfsrvd_*:return >> / self->ts > 0 / >> { >> this->delta =3D (timestamp-self->ts)/1000000; >> } >>=20 >> fbt:kernel:nfsrvd_*:return >> / self->ts > 0 && this->delta > 100 / >> { >> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); >> } >>=20 >> fbt:kernel:nfsrvd_*:return >> / self->ts > 0 / >> { >> @dist[probefunc, "ms"] =3D quantize(this->delta); >> self->ts =3D 0; >> } >>=20 >> END >> { >> printf("\n"); >> printa("function %-20s %@10d\n", @counts); >> printf("\n"); >> printa("function %s(), time in %s:%@d\n", @dist); >> printf("\n"); >> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow); >> } >>=20 >> And here's a sample output from one or two minutes during the run of >> Oracle's ORION benchmark >> tool from a Linux machine, on a 32G file on NFS mount over 10G >> ethernet: >>=20 >> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d >> ^C >>=20 >> function nfsrvd_access 4 >> function nfsrvd_statfs 10 >> function nfsrvd_getattr 14 >> function nfsrvd_commit 76 >> function nfsrvd_sentcache 110048 >> function nfsrvd_write 110048 >> function nfsrvd_read 283648 >> function nfsrvd_dorpc 393800 >> function nfsrvd_getcache 393800 >> function nfsrvd_rephead 393800 >> function nfsrvd_updatecache 393800 >>=20 >> function nfsrvd_access(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 >> 1 | 0 >>=20 >> function nfsrvd_statfs(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 >> 1 | 0 >>=20 >> function nfsrvd_getattr(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 >> 1 | 0 >>=20 >> function nfsrvd_sentcache(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 >> 1 | 0 >>=20 >> function nfsrvd_rephead(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >> 1 | 0 >>=20 >> function nfsrvd_updatecache(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >> 1 | 0 >>=20 >> function nfsrvd_getcache(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 >> 1 | 1 >> 2 | 0 >> 4 | 1 >> 8 | 0 >>=20 >> function nfsrvd_write(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 >> 1 | 5 >> 2 | 4 >> 4 | 0 >>=20 >> function nfsrvd_read(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 >> 1 | 19 >> 2 | 3 >> 4 | 2 >> 8 | 0 >> 16 | 1 >> 32 | 0 >> 64 | 0 >> 128 | 0 >> 256 | 1 >> 512 | 0 >>=20 >> function nfsrvd_commit(), time in ms: >> value ------------- Distribution ------------- count >> -1 | 0 >> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 >> 1 |@@@@@@@ 14 >> 2 | 0 >> 4 |@ 1 >> 8 |@ 1 >> 16 | 0 >> 32 |@@@@@@@ 14 >> 64 |@ 2 >> 128 | 0 >>=20 >>=20 >> function nfsrvd_commit(), time in ms for >=3D 100 ms: >> value ------------- Distribution ------------- count >> < 100 | 0 >> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >> 150 | 0 >>=20 >> function nfsrvd_read(), time in ms for >=3D 100 ms: >> value ------------- Distribution ------------- count >> 250 | 0 >> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >> 350 | 0 >>=20 >>=20 >> Looks like the nfs server cache functions are quite fast, but >> extremely frequently called. >>=20 > Yep, they are called for every RPC. >=20 > I may try coding up a patch that replaces the single mutex with > one for each hash bucket, for TCP. >=20 > I'll post if/when I get this patch to a testing/review stage, rick >=20 Cool. I've readjusted the precision of the dtrace script a bit, and I can see now the following three functions as taking most of the time : = nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() This was recorded during a oracle benchmark run called SLOB, which = caused 99% cpu load on the NFS server. >> I hope someone can find this information useful. >>=20 >> _______________________________________________ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to >> "freebsd-hackers-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?B2CD757D-25D8-4353-8487-B3583EEC57FC>