Date: Thu, 11 Oct 2012 08:46:49 +0300 From: Nikolay Denev <ndenev@gmail.com> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>, freebsd-hackers@freebsd.org Subject: Re: NFS server bottlenecks Message-ID: <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com> In-Reply-To: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> References: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmacklem@uoguelph.ca> wrote: > Nikolay Denev wrote: >> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca> >> wrote: >>=20 >>> Nikolay Denev wrote: >>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca> >>>> wrote: >>>>=20 >>>>> Garrett Wollman wrote: >>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem >>>>>> <rmacklem@uoguelph.ca> said: >>>>>>=20 >>>>>>>> Simple: just use a sepatate mutex for each list that a cache >>>>>>>> entry >>>>>>>> is on, rather than a global lock for everything. This would >>>>>>>> reduce >>>>>>>> the mutex contention, but I'm not sure how significantly since >>>>>>>> I >>>>>>>> don't have the means to measure it yet. >>>>>>>>=20 >>>>>>> Well, since the cache trimming is removing entries from the >>>>>>> lists, >>>>>>> I >>>>>>> don't >>>>>>> see how that can be done with a global lock for list updates? >>>>>>=20 >>>>>> Well, the global lock is what we have now, but the cache trimming >>>>>> process only looks at one list at a time, so not locking the list >>>>>> that >>>>>> isn't being iterated over probably wouldn't hurt, unless there's >>>>>> some >>>>>> mechanism (that I didn't see) for entries to move from one list >>>>>> to >>>>>> another. Note that I'm considering each hash bucket a separate >>>>>> "list". (One issue to worry about in that case would be >>>>>> cache-line >>>>>> contention in the array of hash buckets; perhaps >>>>>> NFSRVCACHE_HASHSIZE >>>>>> ought to be increased to reduce that.) >>>>>>=20 >>>>> Yea, a separate mutex for each hash list might help. There is also >>>>> the >>>>> LRU list that all entries end up on, that gets used by the >>>>> trimming >>>>> code. >>>>> (I think? I wrote this stuff about 8 years ago, so I haven't >>>>> looked >>>>> at >>>>> it in a while.) >>>>>=20 >>>>> Also, increasing the hash table size is probably a good idea, >>>>> especially >>>>> if you reduce how aggressively the cache is trimmed. >>>>>=20 >>>>>>> Only doing it once/sec would result in a very large cache when >>>>>>> bursts of >>>>>>> traffic arrives. >>>>>>=20 >>>>>> My servers have 96 GB of memory so that's not a big deal for me. >>>>>>=20 >>>>> This code was originally "production tested" on a server with >>>>> 1Gbyte, >>>>> so times have changed a bit;-) >>>>>=20 >>>>>>> I'm not sure I see why doing it as a separate thread will >>>>>>> improve >>>>>>> things. >>>>>>> There are N nfsd threads already (N can be bumped up to 256 if >>>>>>> you >>>>>>> wish) >>>>>>> and having a bunch more "cache trimming threads" would just >>>>>>> increase >>>>>>> contention, wouldn't it? >>>>>>=20 >>>>>> Only one cache-trimming thread. The cache trim holds the (global) >>>>>> mutex for much longer than any individual nfsd service thread has >>>>>> any >>>>>> need to, and having N threads doing that in parallel is why it's >>>>>> so >>>>>> heavily contended. If there's only one thread doing the trim, >>>>>> then >>>>>> the nfsd service threads aren't spending time either contending >>>>>> on >>>>>> the >>>>>> mutex (it will be held less frequently and for shorter periods). >>>>>>=20 >>>>> I think the little drc2.patch which will keep the nfsd threads >>>>> from >>>>> acquiring the mutex and doing the trimming most of the time, might >>>>> be >>>>> sufficient. I still don't see why a separate trimming thread will >>>>> be >>>>> an advantage. I'd also be worried that the one cache trimming >>>>> thread >>>>> won't get the job done soon enough. >>>>>=20 >>>>> When I did production testing on a 1Gbyte server that saw a peak >>>>> load of about 100RPCs/sec, it was necessary to trim aggressively. >>>>> (Although I'd be tempted to say that a server with 1Gbyte is no >>>>> longer relevant, I recently recall someone trying to run FreeBSD >>>>> on a i486, although I doubt they wanted to run the nfsd on it.) >>>>>=20 >>>>>>> The only negative effect I can think of w.r.t. having the nfsd >>>>>>> threads doing it would be a (I believe negligible) increase in >>>>>>> RPC >>>>>>> response times (the time the nfsd thread spends trimming the >>>>>>> cache). >>>>>>> As noted, I think this time would be negligible compared to disk >>>>>>> I/O >>>>>>> and network transit times in the total RPC response time? >>>>>>=20 >>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and >>>>>> 10G >>>>>> network connectivity, spinning on a contended mutex takes a >>>>>> significant amount of CPU time. (For the current design of the >>>>>> NFS >>>>>> server, it may actually be a win to turn off adaptive mutexes -- >>>>>> I >>>>>> should give that a try once I'm able to do more testing.) >>>>>>=20 >>>>> Have fun with it. Let me know when you have what you think is a >>>>> good >>>>> patch. >>>>>=20 >>>>> rick >>>>>=20 >>>>>> -GAWollman >>>>>> _______________________________________________ >>>>>> freebsd-hackers@freebsd.org mailing list >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>>> To unsubscribe, send any mail to >>>>>> "freebsd-hackers-unsubscribe@freebsd.org" >>>>> _______________________________________________ >>>>> freebsd-fs@freebsd.org mailing list >>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >>>>> To unsubscribe, send any mail to >>>>> "freebsd-fs-unsubscribe@freebsd.org" >>>>=20 >>>> My quest for IOPS over NFS continues :) >>>> So far I'm not able to achieve more than about 3000 8K read >>>> requests >>>> over NFS, >>>> while the server locally gives much more. >>>> And this is all from a file that is completely in ARC cache, no >>>> disk >>>> IO involved. >>>>=20 >>> Just out of curiousity, why do you use 8K reads instead of 64K >>> reads. >>> Since the RPC overhead (including the DRC functions) is per RPC, >>> doing >>> fewer larger RPCs should usually work better. (Sometimes large >>> rsize/wsize >>> values generate too large a burst of traffic for a network interface >>> to >>> handle and then the rsize/wsize has to be decreased to avoid this >>> issue.) >>>=20 >>> And, although this experiment seems useful for testing patches that >>> try >>> and reduce DRC CPU overheads, most "real" NFS servers will be doing >>> disk >>> I/O. >>>=20 >>=20 >> This is the default blocksize the Oracle and probably most databases >> use. >> It uses also larger blocks, but for small random reads in OLTP >> applications this is what is used. >>=20 > If the client is doing 8K reads, you could increase the read ahead > "readahead=3DN" (N up to 16), to try and increase the bandwidth. > (But if the CPU is 99% busy, then I don't think it will matter.) I'll try to check if this is possible to be set, as we are testing not = only with the Linux NFS client, but also with the Oracle's built in so called DirectNFS client that is = built in to the app. >=20 >>=20 >>>> I've snatched some sample DTrace script from the net : [ >>>> = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes >>>> ] >>>>=20 >>>> And modified it for our new NFS server : >>>>=20 >>>> #!/usr/sbin/dtrace -qs >>>>=20 >>>> fbt:kernel:nfsrvd_*:entry >>>> { >>>> self->ts =3D timestamp; >>>> @counts[probefunc] =3D count(); >>>> } >>>>=20 >>>> fbt:kernel:nfsrvd_*:return >>>> / self->ts > 0 / >>>> { >>>> this->delta =3D (timestamp-self->ts)/1000000; >>>> } >>>>=20 >>>> fbt:kernel:nfsrvd_*:return >>>> / self->ts > 0 && this->delta > 100 / >>>> { >>>> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); >>>> } >>>>=20 >>>> fbt:kernel:nfsrvd_*:return >>>> / self->ts > 0 / >>>> { >>>> @dist[probefunc, "ms"] =3D quantize(this->delta); >>>> self->ts =3D 0; >>>> } >>>>=20 >>>> END >>>> { >>>> printf("\n"); >>>> printa("function %-20s %@10d\n", @counts); >>>> printf("\n"); >>>> printa("function %s(), time in %s:%@d\n", @dist); >>>> printf("\n"); >>>> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow); >>>> } >>>>=20 >>>> And here's a sample output from one or two minutes during the run >>>> of >>>> Oracle's ORION benchmark >>>> tool from a Linux machine, on a 32G file on NFS mount over 10G >>>> ethernet: >>>>=20 >>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d >>>> ^C >>>>=20 >>>> function nfsrvd_access 4 >>>> function nfsrvd_statfs 10 >>>> function nfsrvd_getattr 14 >>>> function nfsrvd_commit 76 >>>> function nfsrvd_sentcache 110048 >>>> function nfsrvd_write 110048 >>>> function nfsrvd_read 283648 >>>> function nfsrvd_dorpc 393800 >>>> function nfsrvd_getcache 393800 >>>> function nfsrvd_rephead 393800 >>>> function nfsrvd_updatecache 393800 >>>>=20 >>>> function nfsrvd_access(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_statfs(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_getattr(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_sentcache(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_rephead(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_updatecache(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>> 1 | 0 >>>>=20 >>>> function nfsrvd_getcache(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 >>>> 1 | 1 >>>> 2 | 0 >>>> 4 | 1 >>>> 8 | 0 >>>>=20 >>>> function nfsrvd_write(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 >>>> 1 | 5 >>>> 2 | 4 >>>> 4 | 0 >>>>=20 >>>> function nfsrvd_read(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 >>>> 1 | 19 >>>> 2 | 3 >>>> 4 | 2 >>>> 8 | 0 >>>> 16 | 1 >>>> 32 | 0 >>>> 64 | 0 >>>> 128 | 0 >>>> 256 | 1 >>>> 512 | 0 >>>>=20 >>>> function nfsrvd_commit(), time in ms: >>>> value ------------- Distribution ------------- count >>>> -1 | 0 >>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 >>>> 1 |@@@@@@@ 14 >>>> 2 | 0 >>>> 4 |@ 1 >>>> 8 |@ 1 >>>> 16 | 0 >>>> 32 |@@@@@@@ 14 >>>> 64 |@ 2 >>>> 128 | 0 >>>>=20 >>>>=20 >>>> function nfsrvd_commit(), time in ms for >=3D 100 ms: >>>> value ------------- Distribution ------------- count >>>> < 100 | 0 >>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>> 150 | 0 >>>>=20 >>>> function nfsrvd_read(), time in ms for >=3D 100 ms: >>>> value ------------- Distribution ------------- count >>>> 250 | 0 >>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>> 350 | 0 >>>>=20 >>>>=20 >>>> Looks like the nfs server cache functions are quite fast, but >>>> extremely frequently called. >>>>=20 >>> Yep, they are called for every RPC. >>>=20 >>> I may try coding up a patch that replaces the single mutex with >>> one for each hash bucket, for TCP. >>>=20 >>> I'll post if/when I get this patch to a testing/review stage, rick >>>=20 >>=20 >> Cool. >>=20 >> I've readjusted the precision of the dtrace script a bit, and I can >> see >> now the following three functions as taking most of the time : >> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() >>=20 >> This was recorded during a oracle benchmark run called SLOB, which >> caused 99% cpu load on the NFS server. >>=20 > Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater? > (Assuming the mounts are TCP ones.) >=20 > Have fun with it, rick >=20 I had upped it, but probably not enough. I'm now running with = vfs.nfsd.tcphighwater set to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500. So far it looks like good improvement as those functions no longer show = up in the dtrace script output. I'll run some more benchmarks and testing today. Thanks! >>=20 >>>> I hope someone can find this information useful. >>>>=20 >>>> _______________________________________________ >>>> freebsd-hackers@freebsd.org mailing list >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>> To unsubscribe, send any mail to >>>> "freebsd-hackers-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD>