Date: Tue, 9 Oct 2012 17:12:43 +0300 From: Nikolay Denev <ndenev@gmail.com> To: freebsd-hackers@freebsd.org Cc: rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org> Subject: Re: NFS server bottlenecks Message-ID: <F865A337-7A68-4DC5-8B9E-627C9E6F3518@gmail.com> In-Reply-To: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca> References: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca> wrote: > Garrett Wollman wrote: >> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem >> <rmacklem@uoguelph.ca> said: >>=20 >>>> Simple: just use a sepatate mutex for each list that a cache entry >>>> is on, rather than a global lock for everything. This would reduce >>>> the mutex contention, but I'm not sure how significantly since I >>>> don't have the means to measure it yet. >>>>=20 >>> Well, since the cache trimming is removing entries from the lists, I >>> don't >>> see how that can be done with a global lock for list updates? >>=20 >> Well, the global lock is what we have now, but the cache trimming >> process only looks at one list at a time, so not locking the list = that >> isn't being iterated over probably wouldn't hurt, unless there's some >> mechanism (that I didn't see) for entries to move from one list to >> another. Note that I'm considering each hash bucket a separate >> "list". (One issue to worry about in that case would be cache-line >> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE >> ought to be increased to reduce that.) >>=20 > Yea, a separate mutex for each hash list might help. There is also the > LRU list that all entries end up on, that gets used by the trimming = code. > (I think? I wrote this stuff about 8 years ago, so I haven't looked at > it in a while.) >=20 > Also, increasing the hash table size is probably a good idea, = especially > if you reduce how aggressively the cache is trimmed. >=20 >>> Only doing it once/sec would result in a very large cache when >>> bursts of >>> traffic arrives. >>=20 >> My servers have 96 GB of memory so that's not a big deal for me. >>=20 > This code was originally "production tested" on a server with 1Gbyte, > so times have changed a bit;-) >=20 >>> I'm not sure I see why doing it as a separate thread will improve >>> things. >>> There are N nfsd threads already (N can be bumped up to 256 if you >>> wish) >>> and having a bunch more "cache trimming threads" would just increase >>> contention, wouldn't it? >>=20 >> Only one cache-trimming thread. The cache trim holds the (global) >> mutex for much longer than any individual nfsd service thread has any >> need to, and having N threads doing that in parallel is why it's so >> heavily contended. If there's only one thread doing the trim, then >> the nfsd service threads aren't spending time either contending on = the >> mutex (it will be held less frequently and for shorter periods). >>=20 > I think the little drc2.patch which will keep the nfsd threads from > acquiring the mutex and doing the trimming most of the time, might be > sufficient. I still don't see why a separate trimming thread will be > an advantage. I'd also be worried that the one cache trimming thread > won't get the job done soon enough. >=20 > When I did production testing on a 1Gbyte server that saw a peak > load of about 100RPCs/sec, it was necessary to trim aggressively. > (Although I'd be tempted to say that a server with 1Gbyte is no > longer relevant, I recently recall someone trying to run FreeBSD > on a i486, although I doubt they wanted to run the nfsd on it.) >=20 >>> The only negative effect I can think of w.r.t. having the nfsd >>> threads doing it would be a (I believe negligible) increase in RPC >>> response times (the time the nfsd thread spends trimming the cache). >>> As noted, I think this time would be negligible compared to disk I/O >>> and network transit times in the total RPC response time? >>=20 >> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G >> network connectivity, spinning on a contended mutex takes a >> significant amount of CPU time. (For the current design of the NFS >> server, it may actually be a win to turn off adaptive mutexes -- I >> should give that a try once I'm able to do more testing.) >>=20 > Have fun with it. Let me know when you have what you think is a good = patch. >=20 > rick >=20 >> -GAWollman >> _______________________________________________ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to >> "freebsd-hackers-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests = over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO = involved. I've snatched some sample DTrace script from the net : [ = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes = ] And modified it for our new NFS server : #!/usr/sbin/dtrace -qs=20 fbt:kernel:nfsrvd_*:entry { self->ts =3D timestamp;=20 @counts[probefunc] =3D count(); } fbt:kernel:nfsrvd_*:return / self->ts > 0 / { this->delta =3D (timestamp-self->ts)/1000000; } fbt:kernel:nfsrvd_*:return / self->ts > 0 && this->delta > 100 / { @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); } fbt:kernel:nfsrvd_*:return / self->ts > 0 / { @dist[probefunc, "ms"] =3D quantize(this->delta); self->ts =3D 0; } END { printf("\n"); printa("function %-20s %@10d\n", @counts); printf("\n"); printa("function %s(), time in %s:%@d\n", @dist); printf("\n"); printa("function %s(), time in %s for >=3D 100 ms:%@d\n", = @slow); } And here's a sample output from one or two minutes during the run of = Oracle's ORION benchmark tool from a Linux machine, on a 32G file on NFS mount over 10G ethernet: [16:01]root@goliath:/home/ndenev# ./nfsrvd.d =20 ^C function nfsrvd_access 4 function nfsrvd_statfs 10 function nfsrvd_getattr 14 function nfsrvd_commit 76 function nfsrvd_sentcache 110048 function nfsrvd_write 110048 function nfsrvd_read 283648 function nfsrvd_dorpc 393800 function nfsrvd_getcache 393800 function nfsrvd_rephead 393800 function nfsrvd_updatecache 393800 function nfsrvd_access(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 =20 1 | 0 =20 function nfsrvd_statfs(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 =20 1 | 0 =20 function nfsrvd_getattr(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 =20 1 | 0 =20 function nfsrvd_sentcache(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 =20 1 | 0 =20 function nfsrvd_rephead(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 =20 1 | 0 =20 function nfsrvd_updatecache(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 =20 1 | 0 =20 function nfsrvd_getcache(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 =20 1 | 1 =20 2 | 0 =20 4 | 1 =20 8 | 0 =20 function nfsrvd_write(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 =20 1 | 5 =20 2 | 4 =20 4 | 0 =20 function nfsrvd_read(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 =20 1 | 19 =20 2 | 3 =20 4 | 2 =20 8 | 0 =20 16 | 1 =20 32 | 0 =20 64 | 0 =20 128 | 0 =20 256 | 1 =20 512 | 0 =20 function nfsrvd_commit(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 =20 1 |@@@@@@@ 14 =20 2 | 0 =20 4 |@ 1 =20 8 |@ 1 =20 16 | 0 =20 32 |@@@@@@@ 14 =20 64 |@ 2 =20 128 | 0 =20 function nfsrvd_commit(), time in ms for >=3D 100 ms: value ------------- Distribution ------------- count =20 < 100 | 0 =20 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 =20 150 | 0 =20 function nfsrvd_read(), time in ms for >=3D 100 ms: value ------------- Distribution ------------- count =20 250 | 0 =20 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 =20 350 | 0 =20 Looks like the nfs server cache functions are quite fast, but extremely = frequently called. I hope someone can find this information useful.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?F865A337-7A68-4DC5-8B9E-627C9E6F3518>