Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Oct 2012 08:46:49 +0300
From:      Nikolay Denev <ndenev@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>, freebsd-hackers@freebsd.org
Subject:   Re: NFS server bottlenecks
Message-ID:  <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com>
In-Reply-To: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca>
References:  <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help

On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Nikolay Denev wrote:
>> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca>
>> wrote:
>>=20
>>> Nikolay Denev wrote:
>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca>
>>>> wrote:
>>>>=20
>>>>> Garrett Wollman wrote:
>>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>>>>>> <rmacklem@uoguelph.ca> said:
>>>>>>=20
>>>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>>>> entry
>>>>>>>> is on, rather than a global lock for everything. This would
>>>>>>>> reduce
>>>>>>>> the mutex contention, but I'm not sure how significantly since
>>>>>>>> I
>>>>>>>> don't have the means to measure it yet.
>>>>>>>>=20
>>>>>>> Well, since the cache trimming is removing entries from the
>>>>>>> lists,
>>>>>>> I
>>>>>>> don't
>>>>>>> see how that can be done with a global lock for list updates?
>>>>>>=20
>>>>>> Well, the global lock is what we have now, but the cache trimming
>>>>>> process only looks at one list at a time, so not locking the list
>>>>>> that
>>>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>>>> some
>>>>>> mechanism (that I didn't see) for entries to move from one list
>>>>>> to
>>>>>> another. Note that I'm considering each hash bucket a separate
>>>>>> "list". (One issue to worry about in that case would be
>>>>>> cache-line
>>>>>> contention in the array of hash buckets; perhaps
>>>>>> NFSRVCACHE_HASHSIZE
>>>>>> ought to be increased to reduce that.)
>>>>>>=20
>>>>> Yea, a separate mutex for each hash list might help. There is also
>>>>> the
>>>>> LRU list that all entries end up on, that gets used by the
>>>>> trimming
>>>>> code.
>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
>>>>> looked
>>>>> at
>>>>> it in a while.)
>>>>>=20
>>>>> Also, increasing the hash table size is probably a good idea,
>>>>> especially
>>>>> if you reduce how aggressively the cache is trimmed.
>>>>>=20
>>>>>>> Only doing it once/sec would result in a very large cache when
>>>>>>> bursts of
>>>>>>> traffic arrives.
>>>>>>=20
>>>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>>>>=20
>>>>> This code was originally "production tested" on a server with
>>>>> 1Gbyte,
>>>>> so times have changed a bit;-)
>>>>>=20
>>>>>>> I'm not sure I see why doing it as a separate thread will
>>>>>>> improve
>>>>>>> things.
>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if
>>>>>>> you
>>>>>>> wish)
>>>>>>> and having a bunch more "cache trimming threads" would just
>>>>>>> increase
>>>>>>> contention, wouldn't it?
>>>>>>=20
>>>>>> Only one cache-trimming thread. The cache trim holds the (global)
>>>>>> mutex for much longer than any individual nfsd service thread has
>>>>>> any
>>>>>> need to, and having N threads doing that in parallel is why it's
>>>>>> so
>>>>>> heavily contended. If there's only one thread doing the trim,
>>>>>> then
>>>>>> the nfsd service threads aren't spending time either contending
>>>>>> on
>>>>>> the
>>>>>> mutex (it will be held less frequently and for shorter periods).
>>>>>>=20
>>>>> I think the little drc2.patch which will keep the nfsd threads
>>>>> from
>>>>> acquiring the mutex and doing the trimming most of the time, might
>>>>> be
>>>>> sufficient. I still don't see why a separate trimming thread will
>>>>> be
>>>>> an advantage. I'd also be worried that the one cache trimming
>>>>> thread
>>>>> won't get the job done soon enough.
>>>>>=20
>>>>> When I did production testing on a 1Gbyte server that saw a peak
>>>>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no
>>>>> longer relevant, I recently recall someone trying to run FreeBSD
>>>>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>>>>=20
>>>>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>>>>> threads doing it would be a (I believe negligible) increase in
>>>>>>> RPC
>>>>>>> response times (the time the nfsd thread spends trimming the
>>>>>>> cache).
>>>>>>> As noted, I think this time would be negligible compared to disk
>>>>>>> I/O
>>>>>>> and network transit times in the total RPC response time?
>>>>>>=20
>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and
>>>>>> 10G
>>>>>> network connectivity, spinning on a contended mutex takes a
>>>>>> significant amount of CPU time. (For the current design of the
>>>>>> NFS
>>>>>> server, it may actually be a win to turn off adaptive mutexes --
>>>>>> I
>>>>>> should give that a try once I'm able to do more testing.)
>>>>>>=20
>>>>> Have fun with it. Let me know when you have what you think is a
>>>>> good
>>>>> patch.
>>>>>=20
>>>>> rick
>>>>>=20
>>>>>> -GAWollman
>>>>>> _______________________________________________
>>>>>> freebsd-hackers@freebsd.org mailing list
>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>>> To unsubscribe, send any mail to
>>>>>> "freebsd-hackers-unsubscribe@freebsd.org"
>>>>> _______________________________________________
>>>>> freebsd-fs@freebsd.org mailing list
>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>>> To unsubscribe, send any mail to
>>>>> "freebsd-fs-unsubscribe@freebsd.org"
>>>>=20
>>>> My quest for IOPS over NFS continues :)
>>>> So far I'm not able to achieve more than about 3000 8K read
>>>> requests
>>>> over NFS,
>>>> while the server locally gives much more.
>>>> And this is all from a file that is completely in ARC cache, no
>>>> disk
>>>> IO involved.
>>>>=20
>>> Just out of curiousity, why do you use 8K reads instead of 64K
>>> reads.
>>> Since the RPC overhead (including the DRC functions) is per RPC,
>>> doing
>>> fewer larger RPCs should usually work better. (Sometimes large
>>> rsize/wsize
>>> values generate too large a burst of traffic for a network interface
>>> to
>>> handle and then the rsize/wsize has to be decreased to avoid this
>>> issue.)
>>>=20
>>> And, although this experiment seems useful for testing patches that
>>> try
>>> and reduce DRC CPU overheads, most "real" NFS servers will be doing
>>> disk
>>> I/O.
>>>=20
>>=20
>> This is the default blocksize the Oracle and probably most databases
>> use.
>> It uses also larger blocks, but for small random reads in OLTP
>> applications this is what is used.
>>=20
> If the client is doing 8K reads, you could increase the read ahead
> "readahead=3DN" (N up to 16), to try and increase the bandwidth.
> (But if the CPU is 99% busy, then I don't think it will matter.)

I'll try to check if this is possible to be set, as we are testing not =
only with the Linux NFS client,
but also with the Oracle's built in so called DirectNFS client that is =
built in to the app.

>=20
>>=20
>>>> I've snatched some sample DTrace script from the net : [
>>>> =
http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
>>>> ]
>>>>=20
>>>> And modified it for our new NFS server :
>>>>=20
>>>> #!/usr/sbin/dtrace -qs
>>>>=20
>>>> fbt:kernel:nfsrvd_*:entry
>>>> {
>>>> self->ts =3D timestamp;
>>>> @counts[probefunc] =3D count();
>>>> }
>>>>=20
>>>> fbt:kernel:nfsrvd_*:return
>>>> / self->ts > 0 /
>>>> {
>>>> this->delta =3D (timestamp-self->ts)/1000000;
>>>> }
>>>>=20
>>>> fbt:kernel:nfsrvd_*:return
>>>> / self->ts > 0 && this->delta > 100 /
>>>> {
>>>> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50);
>>>> }
>>>>=20
>>>> fbt:kernel:nfsrvd_*:return
>>>> / self->ts > 0 /
>>>> {
>>>> @dist[probefunc, "ms"] =3D quantize(this->delta);
>>>> self->ts =3D 0;
>>>> }
>>>>=20
>>>> END
>>>> {
>>>> printf("\n");
>>>> printa("function %-20s %@10d\n", @counts);
>>>> printf("\n");
>>>> printa("function %s(), time in %s:%@d\n", @dist);
>>>> printf("\n");
>>>> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow);
>>>> }
>>>>=20
>>>> And here's a sample output from one or two minutes during the run
>>>> of
>>>> Oracle's ORION benchmark
>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G
>>>> ethernet:
>>>>=20
>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d
>>>> ^C
>>>>=20
>>>> function nfsrvd_access 4
>>>> function nfsrvd_statfs 10
>>>> function nfsrvd_getattr 14
>>>> function nfsrvd_commit 76
>>>> function nfsrvd_sentcache 110048
>>>> function nfsrvd_write 110048
>>>> function nfsrvd_read 283648
>>>> function nfsrvd_dorpc 393800
>>>> function nfsrvd_getcache 393800
>>>> function nfsrvd_rephead 393800
>>>> function nfsrvd_updatecache 393800
>>>>=20
>>>> function nfsrvd_access(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
>>>> 1 | 0
>>>>=20
>>>> function nfsrvd_statfs(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
>>>> 1 | 0
>>>>=20
>>>> function nfsrvd_getattr(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
>>>> 1 | 0
>>>>=20
>>>> function nfsrvd_sentcache(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
>>>> 1 | 0
>>>>=20
>>>> function nfsrvd_rephead(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>> 1 | 0
>>>>=20
>>>> function nfsrvd_updatecache(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>> 1 | 0
>>>>=20
>>>> function nfsrvd_getcache(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
>>>> 1 | 1
>>>> 2 | 0
>>>> 4 | 1
>>>> 8 | 0
>>>>=20
>>>> function nfsrvd_write(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
>>>> 1 | 5
>>>> 2 | 4
>>>> 4 | 0
>>>>=20
>>>> function nfsrvd_read(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
>>>> 1 | 19
>>>> 2 | 3
>>>> 4 | 2
>>>> 8 | 0
>>>> 16 | 1
>>>> 32 | 0
>>>> 64 | 0
>>>> 128 | 0
>>>> 256 | 1
>>>> 512 | 0
>>>>=20
>>>> function nfsrvd_commit(), time in ms:
>>>> value ------------- Distribution ------------- count
>>>> -1 | 0
>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
>>>> 1 |@@@@@@@ 14
>>>> 2 | 0
>>>> 4 |@ 1
>>>> 8 |@ 1
>>>> 16 | 0
>>>> 32 |@@@@@@@ 14
>>>> 64 |@ 2
>>>> 128 | 0
>>>>=20
>>>>=20
>>>> function nfsrvd_commit(), time in ms for >=3D 100 ms:
>>>> value ------------- Distribution ------------- count
>>>> < 100 | 0
>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>> 150 | 0
>>>>=20
>>>> function nfsrvd_read(), time in ms for >=3D 100 ms:
>>>> value ------------- Distribution ------------- count
>>>> 250 | 0
>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>> 350 | 0
>>>>=20
>>>>=20
>>>> Looks like the nfs server cache functions are quite fast, but
>>>> extremely frequently called.
>>>>=20
>>> Yep, they are called for every RPC.
>>>=20
>>> I may try coding up a patch that replaces the single mutex with
>>> one for each hash bucket, for TCP.
>>>=20
>>> I'll post if/when I get this patch to a testing/review stage, rick
>>>=20
>>=20
>> Cool.
>>=20
>> I've readjusted the precision of the dtrace script a bit, and I can
>> see
>> now the following three functions as taking most of the time :
>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()
>>=20
>> This was recorded during a oracle benchmark run called SLOB, which
>> caused 99% cpu load on the NFS server.
>>=20
> Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater?
> (Assuming the mounts are TCP ones.)
>=20
> Have fun with it, rick
>=20

I had upped it, but probably not enough. I'm now running with =
vfs.nfsd.tcphighwater set
to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500.
So far it looks like good improvement as those functions no longer show =
up in the dtrace script output.
I'll run some more benchmarks and testing today.

Thanks!

>>=20
>>>> I hope someone can find this information useful.
>>>>=20
>>>> _______________________________________________
>>>> freebsd-hackers@freebsd.org mailing list
>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>> To unsubscribe, send any mail to
>>>> "freebsd-hackers-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD>