Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Oct 2012 17:42:15 +0300
From:      Nikolay Denev <ndenev@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>, freebsd-hackers@freebsd.org
Subject:   Re: NFS server bottlenecks
Message-ID:  <B2CD757D-25D8-4353-8487-B3583EEC57FC@gmail.com>
In-Reply-To: <1492364164.1964483.1349828280211.JavaMail.root@erie.cs.uoguelph.ca>
References:  <1492364164.1964483.1349828280211.JavaMail.root@erie.cs.uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help

On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Nikolay Denev wrote:
>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca>
>> wrote:
>>=20
>>> Garrett Wollman wrote:
>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>>>> <rmacklem@uoguelph.ca> said:
>>>>=20
>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>> entry
>>>>>> is on, rather than a global lock for everything. This would
>>>>>> reduce
>>>>>> the mutex contention, but I'm not sure how significantly since I
>>>>>> don't have the means to measure it yet.
>>>>>>=20
>>>>> Well, since the cache trimming is removing entries from the lists,
>>>>> I
>>>>> don't
>>>>> see how that can be done with a global lock for list updates?
>>>>=20
>>>> Well, the global lock is what we have now, but the cache trimming
>>>> process only looks at one list at a time, so not locking the list
>>>> that
>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>> some
>>>> mechanism (that I didn't see) for entries to move from one list to
>>>> another. Note that I'm considering each hash bucket a separate
>>>> "list". (One issue to worry about in that case would be cache-line
>>>> contention in the array of hash buckets; perhaps
>>>> NFSRVCACHE_HASHSIZE
>>>> ought to be increased to reduce that.)
>>>>=20
>>> Yea, a separate mutex for each hash list might help. There is also
>>> the
>>> LRU list that all entries end up on, that gets used by the trimming
>>> code.
>>> (I think? I wrote this stuff about 8 years ago, so I haven't looked
>>> at
>>> it in a while.)
>>>=20
>>> Also, increasing the hash table size is probably a good idea,
>>> especially
>>> if you reduce how aggressively the cache is trimmed.
>>>=20
>>>>> Only doing it once/sec would result in a very large cache when
>>>>> bursts of
>>>>> traffic arrives.
>>>>=20
>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>>=20
>>> This code was originally "production tested" on a server with
>>> 1Gbyte,
>>> so times have changed a bit;-)
>>>=20
>>>>> I'm not sure I see why doing it as a separate thread will improve
>>>>> things.
>>>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>>>> wish)
>>>>> and having a bunch more "cache trimming threads" would just
>>>>> increase
>>>>> contention, wouldn't it?
>>>>=20
>>>> Only one cache-trimming thread. The cache trim holds the (global)
>>>> mutex for much longer than any individual nfsd service thread has
>>>> any
>>>> need to, and having N threads doing that in parallel is why it's so
>>>> heavily contended. If there's only one thread doing the trim, then
>>>> the nfsd service threads aren't spending time either contending on
>>>> the
>>>> mutex (it will be held less frequently and for shorter periods).
>>>>=20
>>> I think the little drc2.patch which will keep the nfsd threads from
>>> acquiring the mutex and doing the trimming most of the time, might
>>> be
>>> sufficient. I still don't see why a separate trimming thread will be
>>> an advantage. I'd also be worried that the one cache trimming thread
>>> won't get the job done soon enough.
>>>=20
>>> When I did production testing on a 1Gbyte server that saw a peak
>>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>>> (Although I'd be tempted to say that a server with 1Gbyte is no
>>> longer relevant, I recently recall someone trying to run FreeBSD
>>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>>=20
>>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>>> threads doing it would be a (I believe negligible) increase in RPC
>>>>> response times (the time the nfsd thread spends trimming the
>>>>> cache).
>>>>> As noted, I think this time would be negligible compared to disk
>>>>> I/O
>>>>> and network transit times in the total RPC response time?
>>>>=20
>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
>>>> network connectivity, spinning on a contended mutex takes a
>>>> significant amount of CPU time. (For the current design of the NFS
>>>> server, it may actually be a win to turn off adaptive mutexes -- I
>>>> should give that a try once I'm able to do more testing.)
>>>>=20
>>> Have fun with it. Let me know when you have what you think is a good
>>> patch.
>>>=20
>>> rick
>>>=20
>>>> -GAWollman
>>>> _______________________________________________
>>>> freebsd-hackers@freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>> To unsubscribe, send any mail to
>>>> "freebsd-hackers-unsubscribe@freebsd.org"
>>> _______________________________________________
>>> freebsd-fs@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>> To unsubscribe, send any mail to
>>> "freebsd-fs-unsubscribe@freebsd.org"
>>=20
>> My quest for IOPS over NFS continues :)
>> So far I'm not able to achieve more than about 3000 8K read requests
>> over NFS,
>> while the server locally gives much more.
>> And this is all from a file that is completely in ARC cache, no disk
>> IO involved.
>>=20
> Just out of curiousity, why do you use 8K reads instead of 64K reads.
> Since the RPC overhead (including the DRC functions) is per RPC, doing
> fewer larger RPCs should usually work better. (Sometimes large =
rsize/wsize
> values generate too large a burst of traffic for a network interface =
to
> handle and then the rsize/wsize has to be decreased to avoid this =
issue.)
>=20
> And, although this experiment seems useful for testing patches that =
try
> and reduce DRC CPU overheads, most "real" NFS servers will be doing =
disk
> I/O.
>=20

This is the default blocksize the Oracle and probably most databases =
use.
It uses also larger blocks, but for small random reads in OLTP =
applications this is what is used.


>> I've snatched some sample DTrace script from the net : [
>> =
http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
>> ]
>>=20
>> And modified it for our new NFS server :
>>=20
>> #!/usr/sbin/dtrace -qs
>>=20
>> fbt:kernel:nfsrvd_*:entry
>> {
>> self->ts =3D timestamp;
>> @counts[probefunc] =3D count();
>> }
>>=20
>> fbt:kernel:nfsrvd_*:return
>> / self->ts > 0 /
>> {
>> this->delta =3D (timestamp-self->ts)/1000000;
>> }
>>=20
>> fbt:kernel:nfsrvd_*:return
>> / self->ts > 0 && this->delta > 100 /
>> {
>> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50);
>> }
>>=20
>> fbt:kernel:nfsrvd_*:return
>> / self->ts > 0 /
>> {
>> @dist[probefunc, "ms"] =3D quantize(this->delta);
>> self->ts =3D 0;
>> }
>>=20
>> END
>> {
>> printf("\n");
>> printa("function %-20s %@10d\n", @counts);
>> printf("\n");
>> printa("function %s(), time in %s:%@d\n", @dist);
>> printf("\n");
>> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow);
>> }
>>=20
>> And here's a sample output from one or two minutes during the run of
>> Oracle's ORION benchmark
>> tool from a Linux machine, on a 32G file on NFS mount over 10G
>> ethernet:
>>=20
>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d
>> ^C
>>=20
>> function nfsrvd_access 4
>> function nfsrvd_statfs 10
>> function nfsrvd_getattr 14
>> function nfsrvd_commit 76
>> function nfsrvd_sentcache 110048
>> function nfsrvd_write 110048
>> function nfsrvd_read 283648
>> function nfsrvd_dorpc 393800
>> function nfsrvd_getcache 393800
>> function nfsrvd_rephead 393800
>> function nfsrvd_updatecache 393800
>>=20
>> function nfsrvd_access(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
>> 1 | 0
>>=20
>> function nfsrvd_statfs(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
>> 1 | 0
>>=20
>> function nfsrvd_getattr(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
>> 1 | 0
>>=20
>> function nfsrvd_sentcache(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
>> 1 | 0
>>=20
>> function nfsrvd_rephead(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>> 1 | 0
>>=20
>> function nfsrvd_updatecache(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>> 1 | 0
>>=20
>> function nfsrvd_getcache(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
>> 1 | 1
>> 2 | 0
>> 4 | 1
>> 8 | 0
>>=20
>> function nfsrvd_write(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
>> 1 | 5
>> 2 | 4
>> 4 | 0
>>=20
>> function nfsrvd_read(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
>> 1 | 19
>> 2 | 3
>> 4 | 2
>> 8 | 0
>> 16 | 1
>> 32 | 0
>> 64 | 0
>> 128 | 0
>> 256 | 1
>> 512 | 0
>>=20
>> function nfsrvd_commit(), time in ms:
>> value ------------- Distribution ------------- count
>> -1 | 0
>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
>> 1 |@@@@@@@ 14
>> 2 | 0
>> 4 |@ 1
>> 8 |@ 1
>> 16 | 0
>> 32 |@@@@@@@ 14
>> 64 |@ 2
>> 128 | 0
>>=20
>>=20
>> function nfsrvd_commit(), time in ms for >=3D 100 ms:
>> value ------------- Distribution ------------- count
>> < 100 | 0
>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>> 150 | 0
>>=20
>> function nfsrvd_read(), time in ms for >=3D 100 ms:
>> value ------------- Distribution ------------- count
>> 250 | 0
>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>> 350 | 0
>>=20
>>=20
>> Looks like the nfs server cache functions are quite fast, but
>> extremely frequently called.
>>=20
> Yep, they are called for every RPC.
>=20
> I may try coding up a patch that replaces the single mutex with
> one for each hash bucket, for TCP.
>=20
> I'll post if/when I get this patch to a testing/review stage, rick
>=20

Cool.

I've readjusted the precision of the dtrace script a bit, and I can see
now the following three functions as taking most of the time : =
nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()

This was recorded during a oracle benchmark run called SLOB, which =
caused 99% cpu load on the NFS server.


>> I hope someone can find this information useful.
>>=20
>> _______________________________________________
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to
>> "freebsd-hackers-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?B2CD757D-25D8-4353-8487-B3583EEC57FC>