Date: Thu, 11 Oct 2012 17:04:28 -0400 (EDT) From: Rick Macklem <rmacklem@uoguelph.ca> To: Nikolay Denev <ndenev@gmail.com> Cc: rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>, freebsd-hackers@freebsd.org Subject: Re: NFS server bottlenecks Message-ID: <1517976814.2112914.1349989468096.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Nikolay Denev wrote: > On Oct 11, 2012, at 8:46 AM, Nikolay Denev <ndenev@gmail.com> wrote: > > > > > On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmacklem@uoguelph.ca> > > wrote: > > > >> Nikolay Denev wrote: > >>> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca> > >>> wrote: > >>> > >>>> Nikolay Denev wrote: > >>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca> > >>>>> wrote: > >>>>> > >>>>>> Garrett Wollman wrote: > >>>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem > >>>>>>> <rmacklem@uoguelph.ca> said: > >>>>>>> > >>>>>>>>> Simple: just use a sepatate mutex for each list that a cache > >>>>>>>>> entry > >>>>>>>>> is on, rather than a global lock for everything. This would > >>>>>>>>> reduce > >>>>>>>>> the mutex contention, but I'm not sure how significantly > >>>>>>>>> since > >>>>>>>>> I > >>>>>>>>> don't have the means to measure it yet. > >>>>>>>>> > >>>>>>>> Well, since the cache trimming is removing entries from the > >>>>>>>> lists, > >>>>>>>> I > >>>>>>>> don't > >>>>>>>> see how that can be done with a global lock for list updates? > >>>>>>> > >>>>>>> Well, the global lock is what we have now, but the cache > >>>>>>> trimming > >>>>>>> process only looks at one list at a time, so not locking the > >>>>>>> list > >>>>>>> that > >>>>>>> isn't being iterated over probably wouldn't hurt, unless > >>>>>>> there's > >>>>>>> some > >>>>>>> mechanism (that I didn't see) for entries to move from one > >>>>>>> list > >>>>>>> to > >>>>>>> another. Note that I'm considering each hash bucket a separate > >>>>>>> "list". (One issue to worry about in that case would be > >>>>>>> cache-line > >>>>>>> contention in the array of hash buckets; perhaps > >>>>>>> NFSRVCACHE_HASHSIZE > >>>>>>> ought to be increased to reduce that.) > >>>>>>> > >>>>>> Yea, a separate mutex for each hash list might help. There is > >>>>>> also > >>>>>> the > >>>>>> LRU list that all entries end up on, that gets used by the > >>>>>> trimming > >>>>>> code. > >>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't > >>>>>> looked > >>>>>> at > >>>>>> it in a while.) > >>>>>> > >>>>>> Also, increasing the hash table size is probably a good idea, > >>>>>> especially > >>>>>> if you reduce how aggressively the cache is trimmed. > >>>>>> > >>>>>>>> Only doing it once/sec would result in a very large cache > >>>>>>>> when > >>>>>>>> bursts of > >>>>>>>> traffic arrives. > >>>>>>> > >>>>>>> My servers have 96 GB of memory so that's not a big deal for > >>>>>>> me. > >>>>>>> > >>>>>> This code was originally "production tested" on a server with > >>>>>> 1Gbyte, > >>>>>> so times have changed a bit;-) > >>>>>> > >>>>>>>> I'm not sure I see why doing it as a separate thread will > >>>>>>>> improve > >>>>>>>> things. > >>>>>>>> There are N nfsd threads already (N can be bumped up to 256 > >>>>>>>> if > >>>>>>>> you > >>>>>>>> wish) > >>>>>>>> and having a bunch more "cache trimming threads" would just > >>>>>>>> increase > >>>>>>>> contention, wouldn't it? > >>>>>>> > >>>>>>> Only one cache-trimming thread. The cache trim holds the > >>>>>>> (global) > >>>>>>> mutex for much longer than any individual nfsd service thread > >>>>>>> has > >>>>>>> any > >>>>>>> need to, and having N threads doing that in parallel is why > >>>>>>> it's > >>>>>>> so > >>>>>>> heavily contended. If there's only one thread doing the trim, > >>>>>>> then > >>>>>>> the nfsd service threads aren't spending time either > >>>>>>> contending > >>>>>>> on > >>>>>>> the > >>>>>>> mutex (it will be held less frequently and for shorter > >>>>>>> periods). > >>>>>>> > >>>>>> I think the little drc2.patch which will keep the nfsd threads > >>>>>> from > >>>>>> acquiring the mutex and doing the trimming most of the time, > >>>>>> might > >>>>>> be > >>>>>> sufficient. I still don't see why a separate trimming thread > >>>>>> will > >>>>>> be > >>>>>> an advantage. I'd also be worried that the one cache trimming > >>>>>> thread > >>>>>> won't get the job done soon enough. > >>>>>> > >>>>>> When I did production testing on a 1Gbyte server that saw a > >>>>>> peak > >>>>>> load of about 100RPCs/sec, it was necessary to trim > >>>>>> aggressively. > >>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no > >>>>>> longer relevant, I recently recall someone trying to run > >>>>>> FreeBSD > >>>>>> on a i486, although I doubt they wanted to run the nfsd on it.) > >>>>>> > >>>>>>>> The only negative effect I can think of w.r.t. having the > >>>>>>>> nfsd > >>>>>>>> threads doing it would be a (I believe negligible) increase > >>>>>>>> in > >>>>>>>> RPC > >>>>>>>> response times (the time the nfsd thread spends trimming the > >>>>>>>> cache). > >>>>>>>> As noted, I think this time would be negligible compared to > >>>>>>>> disk > >>>>>>>> I/O > >>>>>>>> and network transit times in the total RPC response time? > >>>>>>> > >>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and > >>>>>>> 10G > >>>>>>> network connectivity, spinning on a contended mutex takes a > >>>>>>> significant amount of CPU time. (For the current design of the > >>>>>>> NFS > >>>>>>> server, it may actually be a win to turn off adaptive mutexes > >>>>>>> -- > >>>>>>> I > >>>>>>> should give that a try once I'm able to do more testing.) > >>>>>>> > >>>>>> Have fun with it. Let me know when you have what you think is a > >>>>>> good > >>>>>> patch. > >>>>>> > >>>>>> rick > >>>>>> > >>>>>>> -GAWollman > >>>>>>> _______________________________________________ > >>>>>>> freebsd-hackers@freebsd.org mailing list > >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >>>>>>> To unsubscribe, send any mail to > >>>>>>> "freebsd-hackers-unsubscribe@freebsd.org" > >>>>>> _______________________________________________ > >>>>>> freebsd-fs@freebsd.org mailing list > >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >>>>>> To unsubscribe, send any mail to > >>>>>> "freebsd-fs-unsubscribe@freebsd.org" > >>>>> > >>>>> My quest for IOPS over NFS continues :) > >>>>> So far I'm not able to achieve more than about 3000 8K read > >>>>> requests > >>>>> over NFS, > >>>>> while the server locally gives much more. > >>>>> And this is all from a file that is completely in ARC cache, no > >>>>> disk > >>>>> IO involved. > >>>>> > >>>> Just out of curiousity, why do you use 8K reads instead of 64K > >>>> reads. > >>>> Since the RPC overhead (including the DRC functions) is per RPC, > >>>> doing > >>>> fewer larger RPCs should usually work better. (Sometimes large > >>>> rsize/wsize > >>>> values generate too large a burst of traffic for a network > >>>> interface > >>>> to > >>>> handle and then the rsize/wsize has to be decreased to avoid this > >>>> issue.) > >>>> > >>>> And, although this experiment seems useful for testing patches > >>>> that > >>>> try > >>>> and reduce DRC CPU overheads, most "real" NFS servers will be > >>>> doing > >>>> disk > >>>> I/O. > >>>> > >>> > >>> This is the default blocksize the Oracle and probably most > >>> databases > >>> use. > >>> It uses also larger blocks, but for small random reads in OLTP > >>> applications this is what is used. > >>> > >> If the client is doing 8K reads, you could increase the read ahead > >> "readahead=N" (N up to 16), to try and increase the bandwidth. > >> (But if the CPU is 99% busy, then I don't think it will matter.) > > > > I'll try to check if this is possible to be set, as we are testing > > not only with the Linux NFS client, > > but also with the Oracle's built in so called DirectNFS client that > > is built in to the app. > > > >> > >>> > >>>>> I've snatched some sample DTrace script from the net : [ > >>>>> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes > >>>>> ] > >>>>> > >>>>> And modified it for our new NFS server : > >>>>> > >>>>> #!/usr/sbin/dtrace -qs > >>>>> > >>>>> fbt:kernel:nfsrvd_*:entry > >>>>> { > >>>>> self->ts = timestamp; > >>>>> @counts[probefunc] = count(); > >>>>> } > >>>>> > >>>>> fbt:kernel:nfsrvd_*:return > >>>>> / self->ts > 0 / > >>>>> { > >>>>> this->delta = (timestamp-self->ts)/1000000; > >>>>> } > >>>>> > >>>>> fbt:kernel:nfsrvd_*:return > >>>>> / self->ts > 0 && this->delta > 100 / > >>>>> { > >>>>> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50); > >>>>> } > >>>>> > >>>>> fbt:kernel:nfsrvd_*:return > >>>>> / self->ts > 0 / > >>>>> { > >>>>> @dist[probefunc, "ms"] = quantize(this->delta); > >>>>> self->ts = 0; > >>>>> } > >>>>> > >>>>> END > >>>>> { > >>>>> printf("\n"); > >>>>> printa("function %-20s %@10d\n", @counts); > >>>>> printf("\n"); > >>>>> printa("function %s(), time in %s:%@d\n", @dist); > >>>>> printf("\n"); > >>>>> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow); > >>>>> } > >>>>> > >>>>> And here's a sample output from one or two minutes during the > >>>>> run > >>>>> of > >>>>> Oracle's ORION benchmark > >>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G > >>>>> ethernet: > >>>>> > >>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d > >>>>> ^C > >>>>> > >>>>> function nfsrvd_access 4 > >>>>> function nfsrvd_statfs 10 > >>>>> function nfsrvd_getattr 14 > >>>>> function nfsrvd_commit 76 > >>>>> function nfsrvd_sentcache 110048 > >>>>> function nfsrvd_write 110048 > >>>>> function nfsrvd_read 283648 > >>>>> function nfsrvd_dorpc 393800 > >>>>> function nfsrvd_getcache 393800 > >>>>> function nfsrvd_rephead 393800 > >>>>> function nfsrvd_updatecache 393800 > >>>>> > >>>>> function nfsrvd_access(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_statfs(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_getattr(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_sentcache(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_rephead(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_updatecache(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > >>>>> 1 | 0 > >>>>> > >>>>> function nfsrvd_getcache(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 > >>>>> 1 | 1 > >>>>> 2 | 0 > >>>>> 4 | 1 > >>>>> 8 | 0 > >>>>> > >>>>> function nfsrvd_write(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 > >>>>> 1 | 5 > >>>>> 2 | 4 > >>>>> 4 | 0 > >>>>> > >>>>> function nfsrvd_read(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 > >>>>> 1 | 19 > >>>>> 2 | 3 > >>>>> 4 | 2 > >>>>> 8 | 0 > >>>>> 16 | 1 > >>>>> 32 | 0 > >>>>> 64 | 0 > >>>>> 128 | 0 > >>>>> 256 | 1 > >>>>> 512 | 0 > >>>>> > >>>>> function nfsrvd_commit(), time in ms: > >>>>> value ------------- Distribution ------------- count > >>>>> -1 | 0 > >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 > >>>>> 1 |@@@@@@@ 14 > >>>>> 2 | 0 > >>>>> 4 |@ 1 > >>>>> 8 |@ 1 > >>>>> 16 | 0 > >>>>> 32 |@@@@@@@ 14 > >>>>> 64 |@ 2 > >>>>> 128 | 0 > >>>>> > >>>>> > >>>>> function nfsrvd_commit(), time in ms for >= 100 ms: > >>>>> value ------------- Distribution ------------- count > >>>>> < 100 | 0 > >>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >>>>> 150 | 0 > >>>>> > >>>>> function nfsrvd_read(), time in ms for >= 100 ms: > >>>>> value ------------- Distribution ------------- count > >>>>> 250 | 0 > >>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > >>>>> 350 | 0 > >>>>> > >>>>> > >>>>> Looks like the nfs server cache functions are quite fast, but > >>>>> extremely frequently called. > >>>>> > >>>> Yep, they are called for every RPC. > >>>> > >>>> I may try coding up a patch that replaces the single mutex with > >>>> one for each hash bucket, for TCP. > >>>> > >>>> I'll post if/when I get this patch to a testing/review stage, > >>>> rick > >>>> > >>> > >>> Cool. > >>> > >>> I've readjusted the precision of the dtrace script a bit, and I > >>> can > >>> see > >>> now the following three functions as taking most of the time : > >>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() > >>> > >>> This was recorded during a oracle benchmark run called SLOB, which > >>> caused 99% cpu load on the NFS server. > >>> > >> Even with the drc2.patch and a large value for > >> vfs.nfsd.tcphighwater? > >> (Assuming the mounts are TCP ones.) > >> > >> Have fun with it, rick > >> > > > > I had upped it, but probably not enough. I'm now running with > > vfs.nfsd.tcphighwater set > > to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500. > > So far it looks like good improvement as those functions no longer > > show up in the dtrace script output. > > I'll run some more benchmarks and testing today. > > > > Thanks! > > > >>> > >>>>> I hope someone can find this information useful. > >>>>> > >>>>> _______________________________________________ > >>>>> freebsd-hackers@freebsd.org mailing list > >>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > >>>>> To unsubscribe, send any mail to > >>>>> "freebsd-hackers-unsubscribe@freebsd.org" > > > > I haven't had the opportunity today to run more DB tests over NFS as > the DBA was busy with something else, > however I tested a bit the large file transfers. > And I'm seeing something strange probably not only NFS but also ZFS > and ARC related. > > When I first tested the drc2.patch I reported a huge bandwidth > improvement, > but now I think that this was probably because of the machine freshly > rebooted instead of just the patch. > The patch surely improved things, especially CPU utilization combined > with the increased cache. > But today I'm again having a file completely cached in ZFS's ARC > cache, which when transferred over NFS > reaches about 300MB/s, when at some tests it reached 900+MB/s (as > reported in my first email). > The file locally can be read at about 3GB/s as reported by dd. > > Local: > [17:51]root@goliath:/tank/spa_db/undo# dd if=data.dbf of=/dev/null > bs=1M > 30720+1 records in > 30720+1 records out > 32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec) > > Over NFS: > [17:48]root@spa:/mnt/spa_db/undo# dd if=data.dbf of=/dev/null bs=1M > 30720+1 records in > 30720+1 records out > 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s > > The machines are almost idle during this transfer and I can't see a > reason why it can't reach the full bandwith when it's > just reading it from RAM. > Well, one thing is that you must fill the network pipe with bits. That means that "rsize * readahead" must be >= "data rate * transit delay" (both in same units, such as bytes/sec). For example (given a 10Gb/s network): 65536 * 16 = 1Mbyte - a 10Gbps rate is about 1200Mbytes/sec --> this should fill the network "pipe" so long as the RPC transit delay is less than 1/1200th of a sec (or somewhat less than 1msec) If you were using rsize=8192 and the default readahead=2: 8192 * 2 = 16Kbytes - a 10Gbps rate is about 1200000Kbytes/sec --> the network pipe would be full if the RPC transit delay is less than 16/1200000 (or a little over 10usec, unlikely for any real network, I think?) I don't know what your RPC transit delay is (half of a NULL RPC RTT, but I don't know that, either;-). That was why I suggested "readahead=16". I'd also use "rsize=65536", but if that worked very poorly, I'd try decreasing rsize by half (32768, 16384..) to see what was optimal. rick ps: I'm assuming your dd runs are from a FreeBSD client and not the Oracle one. pss: Did I actually get the above calculations correct without a calculator;-) > I've tried again tracing with DTrace to see what's happening with this > script : > > fbt:kernel:nfs*:entry > { > this->ts = timestamp; > @counts[probefunc] = count(); > } > > fbt:kernel:nfs*:return > / this->ts > 0 / > { > @time[probefunc] = avg(timestamp - this->ts); > } > > END > { > trunc(@counts, 10); > trunc(@time, 10); > printf("Top 10 called functions\n\n"); > printa(@counts); > printf("\n\nTop 10 slowest functions\n\n"); > printa(@time); > } > > And here's the result (several seconds during the dd test): > > Top 10 called functions > nfsrc_freecache 88849 > nfsrc_wanted 88849 > nfsrv_fillattr 88849 > nfsrv_postopattr 88849 > nfsrvd_read 88849 > nfsrvd_rephead 88849 > nfsrvd_updatecache 88849 > nfsvno_testexp 88849 > nfsrc_trimcache 177697 > nfsvno_getattr 177698 > > Top 10 slowest functions > nfsd_excred 5673 > nfsrc_freecache 5674 > nfsrv_postopattr 5970 > nfsrv_servertimer 6327 > nfssvc_nfscommon 6596 > nfsd_fhtovp 8000 > nfsrvd_read 8380 > nfssvc_program 92752 > nfsvno_read 124979 > nfsvno_fhtovp 1789523 > > I might try now to trace what nfsvno_fhtovp() is doing and where is > spending it's time. > > Any other ideas are welcome :) > > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1517976814.2112914.1349989468096.JavaMail.root>