Date: Tue, 9 Oct 2012 18:23:37 +0300 From: Nikolay Denev <ndenev@gmail.com> To: freebsd-hackers@freebsd.org Cc: rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org> Subject: Re: NFS server bottlenecks Message-ID: <59B593A0-DA96-4395-B6B9-196F73A1415C@gmail.com> In-Reply-To: <F865A337-7A68-4DC5-8B9E-627C9E6F3518@gmail.com> References: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca> <F865A337-7A68-4DC5-8B9E-627C9E6F3518@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Oct 9, 2012, at 5:12 PM, Nikolay Denev <ndenev@gmail.com> wrote: >=20 > On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca> = wrote: >=20 >> Garrett Wollman wrote: >>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem >>> <rmacklem@uoguelph.ca> said: >>>=20 >>>>> Simple: just use a sepatate mutex for each list that a cache entry >>>>> is on, rather than a global lock for everything. This would reduce >>>>> the mutex contention, but I'm not sure how significantly since I >>>>> don't have the means to measure it yet. >>>>>=20 >>>> Well, since the cache trimming is removing entries from the lists, = I >>>> don't >>>> see how that can be done with a global lock for list updates? >>>=20 >>> Well, the global lock is what we have now, but the cache trimming >>> process only looks at one list at a time, so not locking the list = that >>> isn't being iterated over probably wouldn't hurt, unless there's = some >>> mechanism (that I didn't see) for entries to move from one list to >>> another. Note that I'm considering each hash bucket a separate >>> "list". (One issue to worry about in that case would be cache-line >>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE >>> ought to be increased to reduce that.) >>>=20 >> Yea, a separate mutex for each hash list might help. There is also = the >> LRU list that all entries end up on, that gets used by the trimming = code. >> (I think? I wrote this stuff about 8 years ago, so I haven't looked = at >> it in a while.) >>=20 >> Also, increasing the hash table size is probably a good idea, = especially >> if you reduce how aggressively the cache is trimmed. >>=20 >>>> Only doing it once/sec would result in a very large cache when >>>> bursts of >>>> traffic arrives. >>>=20 >>> My servers have 96 GB of memory so that's not a big deal for me. >>>=20 >> This code was originally "production tested" on a server with 1Gbyte, >> so times have changed a bit;-) >>=20 >>>> I'm not sure I see why doing it as a separate thread will improve >>>> things. >>>> There are N nfsd threads already (N can be bumped up to 256 if you >>>> wish) >>>> and having a bunch more "cache trimming threads" would just = increase >>>> contention, wouldn't it? >>>=20 >>> Only one cache-trimming thread. The cache trim holds the (global) >>> mutex for much longer than any individual nfsd service thread has = any >>> need to, and having N threads doing that in parallel is why it's so >>> heavily contended. If there's only one thread doing the trim, then >>> the nfsd service threads aren't spending time either contending on = the >>> mutex (it will be held less frequently and for shorter periods). >>>=20 >> I think the little drc2.patch which will keep the nfsd threads from >> acquiring the mutex and doing the trimming most of the time, might be >> sufficient. I still don't see why a separate trimming thread will be >> an advantage. I'd also be worried that the one cache trimming thread >> won't get the job done soon enough. >>=20 >> When I did production testing on a 1Gbyte server that saw a peak >> load of about 100RPCs/sec, it was necessary to trim aggressively. >> (Although I'd be tempted to say that a server with 1Gbyte is no >> longer relevant, I recently recall someone trying to run FreeBSD >> on a i486, although I doubt they wanted to run the nfsd on it.) >>=20 >>>> The only negative effect I can think of w.r.t. having the nfsd >>>> threads doing it would be a (I believe negligible) increase in RPC >>>> response times (the time the nfsd thread spends trimming the = cache). >>>> As noted, I think this time would be negligible compared to disk = I/O >>>> and network transit times in the total RPC response time? >>>=20 >>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G >>> network connectivity, spinning on a contended mutex takes a >>> significant amount of CPU time. (For the current design of the NFS >>> server, it may actually be a win to turn off adaptive mutexes -- I >>> should give that a try once I'm able to do more testing.) >>>=20 >> Have fun with it. Let me know when you have what you think is a good = patch. >>=20 >> rick >>=20 >>> -GAWollman >>> _______________________________________________ >>> freebsd-hackers@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>> To unsubscribe, send any mail to >>> "freebsd-hackers-unsubscribe@freebsd.org" >> _______________________________________________ >> freebsd-fs@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >=20 > My quest for IOPS over NFS continues :) > So far I'm not able to achieve more than about 3000 8K read requests = over NFS, > while the server locally gives much more. > And this is all from a file that is completely in ARC cache, no disk = IO involved. >=20 > I've snatched some sample DTrace script from the net : [ = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes = ] >=20 > And modified it for our new NFS server : >=20 > #!/usr/sbin/dtrace -qs=20 >=20 > fbt:kernel:nfsrvd_*:entry > { > self->ts =3D timestamp;=20 > @counts[probefunc] =3D count(); > } >=20 > fbt:kernel:nfsrvd_*:return > / self->ts > 0 / > { > this->delta =3D (timestamp-self->ts)/1000000; > } >=20 > fbt:kernel:nfsrvd_*:return > / self->ts > 0 && this->delta > 100 / > { > @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); > } >=20 > fbt:kernel:nfsrvd_*:return > / self->ts > 0 / > { > @dist[probefunc, "ms"] =3D quantize(this->delta); > self->ts =3D 0; > } >=20 > END > { > printf("\n"); > printa("function %-20s %@10d\n", @counts); > printf("\n"); > printa("function %s(), time in %s:%@d\n", @dist); > printf("\n"); > printa("function %s(), time in %s for >=3D 100 ms:%@d\n", = @slow); > } >=20 > And here's a sample output from one or two minutes during the run of = Oracle's ORION benchmark > tool from a Linux machine, on a 32G file on NFS mount over 10G = ethernet: >=20 > [16:01]root@goliath:/home/ndenev# ./nfsrvd.d =20 > ^C >=20 > function nfsrvd_access 4 > function nfsrvd_statfs 10 > function nfsrvd_getattr 14 > function nfsrvd_commit 76 > function nfsrvd_sentcache 110048 > function nfsrvd_write 110048 > function nfsrvd_read 283648 > function nfsrvd_dorpc 393800 > function nfsrvd_getcache 393800 > function nfsrvd_rephead 393800 > function nfsrvd_updatecache 393800 >=20 > function nfsrvd_access(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 =20 > 1 | 0 =20 >=20 > function nfsrvd_statfs(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 =20 > 1 | 0 =20 >=20 > function nfsrvd_getattr(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 =20 > 1 | 0 =20 >=20 > function nfsrvd_sentcache(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 =20 > 1 | 0 =20 >=20 > function nfsrvd_rephead(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 =20 > 1 | 0 =20 >=20 > function nfsrvd_updatecache(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 =20 > 1 | 0 =20 >=20 > function nfsrvd_getcache(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 =20 > 1 | 1 =20 > 2 | 0 =20 > 4 | 1 =20 > 8 | 0 =20 >=20 > function nfsrvd_write(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 =20 > 1 | 5 =20 > 2 | 4 =20 > 4 | 0 =20 >=20 > function nfsrvd_read(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 =20 > 1 | 19 =20 > 2 | 3 =20 > 4 | 2 =20 > 8 | 0 =20 > 16 | 1 =20 > 32 | 0 =20 > 64 | 0 =20 > 128 | 0 =20 > 256 | 1 =20 > 512 | 0 =20 >=20 > function nfsrvd_commit(), time in ms: > value ------------- Distribution ------------- count =20 > -1 | 0 =20 > 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 =20 > 1 |@@@@@@@ 14 =20 > 2 | 0 =20 > 4 |@ 1 =20 > 8 |@ 1 =20 > 16 | 0 =20 > 32 |@@@@@@@ 14 =20 > 64 |@ 2 =20 > 128 | 0 =20 >=20 >=20 > function nfsrvd_commit(), time in ms for >=3D 100 ms: > value ------------- Distribution ------------- count =20 > < 100 | 0 =20 > 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 =20 > 150 | 0 =20 >=20 > function nfsrvd_read(), time in ms for >=3D 100 ms: > value ------------- Distribution ------------- count =20 > 250 | 0 =20 > 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 =20 > 350 | 0 =20 >=20 >=20 > Looks like the nfs server cache functions are quite fast, but = extremely frequently called. >=20 > I hope someone can find this information useful. >=20 Here's another quick one : #!/usr/sbin/dtrace -qs=20 #pragma D option quiet fbt:kernel:nfsrvd_*:entry { self->trace =3D 1; } fbt:kernel:nfsrvd_*:return / self->trace / { @calls[probefunc] =3D count(); } tick-1sec { printf("%40s | %s\n", "function", "calls per second"); printa("%40s %10@d\n", @calls); clear(@calls); printf("\n"); } Showing the number of calls per second to the nfsrvd_* functions.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59B593A0-DA96-4395-B6B9-196F73A1415C>