Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Oct 2012 19:47:12 +0300
From:      Nikolay Denev <ndenev@gmail.com>
To:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>
Subject:   Re: NFS server bottlenecks
Message-ID:  <E211199E-026C-41C8-AFA4-0DB5A787391D@gmail.com>
In-Reply-To: <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com>
References:  <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com> <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

On Oct 11, 2012, at 7:20 PM, Nikolay Denev <ndenev@gmail.com> wrote:

> On Oct 11, 2012, at 8:46 AM, Nikolay Denev <ndenev@gmail.com> wrote:
>=20
>>=20
>> On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmacklem@uoguelph.ca> =
wrote:
>>=20
>>> Nikolay Denev wrote:
>>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca>
>>>> wrote:
>>>>=20
>>>>> Nikolay Denev wrote:
>>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca>
>>>>>> wrote:
>>>>>>=20
>>>>>>> Garrett Wollman wrote:
>>>>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>>>>>>>> <rmacklem@uoguelph.ca> said:
>>>>>>>>=20
>>>>>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>>>>>> entry
>>>>>>>>>> is on, rather than a global lock for everything. This would
>>>>>>>>>> reduce
>>>>>>>>>> the mutex contention, but I'm not sure how significantly =
since
>>>>>>>>>> I
>>>>>>>>>> don't have the means to measure it yet.
>>>>>>>>>>=20
>>>>>>>>> Well, since the cache trimming is removing entries from the
>>>>>>>>> lists,
>>>>>>>>> I
>>>>>>>>> don't
>>>>>>>>> see how that can be done with a global lock for list updates?
>>>>>>>>=20
>>>>>>>> Well, the global lock is what we have now, but the cache =
trimming
>>>>>>>> process only looks at one list at a time, so not locking the =
list
>>>>>>>> that
>>>>>>>> isn't being iterated over probably wouldn't hurt, unless =
there's
>>>>>>>> some
>>>>>>>> mechanism (that I didn't see) for entries to move from one list
>>>>>>>> to
>>>>>>>> another. Note that I'm considering each hash bucket a separate
>>>>>>>> "list". (One issue to worry about in that case would be
>>>>>>>> cache-line
>>>>>>>> contention in the array of hash buckets; perhaps
>>>>>>>> NFSRVCACHE_HASHSIZE
>>>>>>>> ought to be increased to reduce that.)
>>>>>>>>=20
>>>>>>> Yea, a separate mutex for each hash list might help. There is =
also
>>>>>>> the
>>>>>>> LRU list that all entries end up on, that gets used by the
>>>>>>> trimming
>>>>>>> code.
>>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
>>>>>>> looked
>>>>>>> at
>>>>>>> it in a while.)
>>>>>>>=20
>>>>>>> Also, increasing the hash table size is probably a good idea,
>>>>>>> especially
>>>>>>> if you reduce how aggressively the cache is trimmed.
>>>>>>>=20
>>>>>>>>> Only doing it once/sec would result in a very large cache when
>>>>>>>>> bursts of
>>>>>>>>> traffic arrives.
>>>>>>>>=20
>>>>>>>> My servers have 96 GB of memory so that's not a big deal for =
me.
>>>>>>>>=20
>>>>>>> This code was originally "production tested" on a server with
>>>>>>> 1Gbyte,
>>>>>>> so times have changed a bit;-)
>>>>>>>=20
>>>>>>>>> I'm not sure I see why doing it as a separate thread will
>>>>>>>>> improve
>>>>>>>>> things.
>>>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if
>>>>>>>>> you
>>>>>>>>> wish)
>>>>>>>>> and having a bunch more "cache trimming threads" would just
>>>>>>>>> increase
>>>>>>>>> contention, wouldn't it?
>>>>>>>>=20
>>>>>>>> Only one cache-trimming thread. The cache trim holds the =
(global)
>>>>>>>> mutex for much longer than any individual nfsd service thread =
has
>>>>>>>> any
>>>>>>>> need to, and having N threads doing that in parallel is why =
it's
>>>>>>>> so
>>>>>>>> heavily contended. If there's only one thread doing the trim,
>>>>>>>> then
>>>>>>>> the nfsd service threads aren't spending time either contending
>>>>>>>> on
>>>>>>>> the
>>>>>>>> mutex (it will be held less frequently and for shorter =
periods).
>>>>>>>>=20
>>>>>>> I think the little drc2.patch which will keep the nfsd threads
>>>>>>> from
>>>>>>> acquiring the mutex and doing the trimming most of the time, =
might
>>>>>>> be
>>>>>>> sufficient. I still don't see why a separate trimming thread =
will
>>>>>>> be
>>>>>>> an advantage. I'd also be worried that the one cache trimming
>>>>>>> thread
>>>>>>> won't get the job done soon enough.
>>>>>>>=20
>>>>>>> When I did production testing on a 1Gbyte server that saw a peak
>>>>>>> load of about 100RPCs/sec, it was necessary to trim =
aggressively.
>>>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no
>>>>>>> longer relevant, I recently recall someone trying to run FreeBSD
>>>>>>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>>>>>>=20
>>>>>>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>>>>>>> threads doing it would be a (I believe negligible) increase in
>>>>>>>>> RPC
>>>>>>>>> response times (the time the nfsd thread spends trimming the
>>>>>>>>> cache).
>>>>>>>>> As noted, I think this time would be negligible compared to =
disk
>>>>>>>>> I/O
>>>>>>>>> and network transit times in the total RPC response time?
>>>>>>>>=20
>>>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and
>>>>>>>> 10G
>>>>>>>> network connectivity, spinning on a contended mutex takes a
>>>>>>>> significant amount of CPU time. (For the current design of the
>>>>>>>> NFS
>>>>>>>> server, it may actually be a win to turn off adaptive mutexes =
--
>>>>>>>> I
>>>>>>>> should give that a try once I'm able to do more testing.)
>>>>>>>>=20
>>>>>>> Have fun with it. Let me know when you have what you think is a
>>>>>>> good
>>>>>>> patch.
>>>>>>>=20
>>>>>>> rick
>>>>>>>=20
>>>>>>>> -GAWollman
>>>>>>>> _______________________________________________
>>>>>>>> freebsd-hackers@freebsd.org mailing list
>>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>>>>> To unsubscribe, send any mail to
>>>>>>>> "freebsd-hackers-unsubscribe@freebsd.org"
>>>>>>> _______________________________________________
>>>>>>> freebsd-fs@freebsd.org mailing list
>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>>>>> To unsubscribe, send any mail to
>>>>>>> "freebsd-fs-unsubscribe@freebsd.org"
>>>>>>=20
>>>>>> My quest for IOPS over NFS continues :)
>>>>>> So far I'm not able to achieve more than about 3000 8K read
>>>>>> requests
>>>>>> over NFS,
>>>>>> while the server locally gives much more.
>>>>>> And this is all from a file that is completely in ARC cache, no
>>>>>> disk
>>>>>> IO involved.
>>>>>>=20
>>>>> Just out of curiousity, why do you use 8K reads instead of 64K
>>>>> reads.
>>>>> Since the RPC overhead (including the DRC functions) is per RPC,
>>>>> doing
>>>>> fewer larger RPCs should usually work better. (Sometimes large
>>>>> rsize/wsize
>>>>> values generate too large a burst of traffic for a network =
interface
>>>>> to
>>>>> handle and then the rsize/wsize has to be decreased to avoid this
>>>>> issue.)
>>>>>=20
>>>>> And, although this experiment seems useful for testing patches =
that
>>>>> try
>>>>> and reduce DRC CPU overheads, most "real" NFS servers will be =
doing
>>>>> disk
>>>>> I/O.
>>>>>=20
>>>>=20
>>>> This is the default blocksize the Oracle and probably most =
databases
>>>> use.
>>>> It uses also larger blocks, but for small random reads in OLTP
>>>> applications this is what is used.
>>>>=20
>>> If the client is doing 8K reads, you could increase the read ahead
>>> "readahead=3DN" (N up to 16), to try and increase the bandwidth.
>>> (But if the CPU is 99% busy, then I don't think it will matter.)
>>=20
>> I'll try to check if this is possible to be set, as we are testing =
not only with the Linux NFS client,
>> but also with the Oracle's built in so called DirectNFS client that =
is built in to the app.
>>=20
>>>=20
>>>>=20
>>>>>> I've snatched some sample DTrace script from the net : [
>>>>>> =
http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
>>>>>> ]
>>>>>>=20
>>>>>> And modified it for our new NFS server :
>>>>>>=20
>>>>>> #!/usr/sbin/dtrace -qs
>>>>>>=20
>>>>>> fbt:kernel:nfsrvd_*:entry
>>>>>> {
>>>>>> self->ts =3D timestamp;
>>>>>> @counts[probefunc] =3D count();
>>>>>> }
>>>>>>=20
>>>>>> fbt:kernel:nfsrvd_*:return
>>>>>> / self->ts > 0 /
>>>>>> {
>>>>>> this->delta =3D (timestamp-self->ts)/1000000;
>>>>>> }
>>>>>>=20
>>>>>> fbt:kernel:nfsrvd_*:return
>>>>>> / self->ts > 0 && this->delta > 100 /
>>>>>> {
>>>>>> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50);
>>>>>> }
>>>>>>=20
>>>>>> fbt:kernel:nfsrvd_*:return
>>>>>> / self->ts > 0 /
>>>>>> {
>>>>>> @dist[probefunc, "ms"] =3D quantize(this->delta);
>>>>>> self->ts =3D 0;
>>>>>> }
>>>>>>=20
>>>>>> END
>>>>>> {
>>>>>> printf("\n");
>>>>>> printa("function %-20s %@10d\n", @counts);
>>>>>> printf("\n");
>>>>>> printa("function %s(), time in %s:%@d\n", @dist);
>>>>>> printf("\n");
>>>>>> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow);
>>>>>> }
>>>>>>=20
>>>>>> And here's a sample output from one or two minutes during the run
>>>>>> of
>>>>>> Oracle's ORION benchmark
>>>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G
>>>>>> ethernet:
>>>>>>=20
>>>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d
>>>>>> ^C
>>>>>>=20
>>>>>> function nfsrvd_access 4
>>>>>> function nfsrvd_statfs 10
>>>>>> function nfsrvd_getattr 14
>>>>>> function nfsrvd_commit 76
>>>>>> function nfsrvd_sentcache 110048
>>>>>> function nfsrvd_write 110048
>>>>>> function nfsrvd_read 283648
>>>>>> function nfsrvd_dorpc 393800
>>>>>> function nfsrvd_getcache 393800
>>>>>> function nfsrvd_rephead 393800
>>>>>> function nfsrvd_updatecache 393800
>>>>>>=20
>>>>>> function nfsrvd_access(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
>>>>>> 1 | 0
>>>>>>=20
>>>>>> function nfsrvd_statfs(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
>>>>>> 1 | 0
>>>>>>=20
>>>>>> function nfsrvd_getattr(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
>>>>>> 1 | 0
>>>>>>=20
>>>>>> function nfsrvd_sentcache(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
>>>>>> 1 | 0
>>>>>>=20
>>>>>> function nfsrvd_rephead(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>>>> 1 | 0
>>>>>>=20
>>>>>> function nfsrvd_updatecache(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>>>> 1 | 0
>>>>>>=20
>>>>>> function nfsrvd_getcache(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
>>>>>> 1 | 1
>>>>>> 2 | 0
>>>>>> 4 | 1
>>>>>> 8 | 0
>>>>>>=20
>>>>>> function nfsrvd_write(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
>>>>>> 1 | 5
>>>>>> 2 | 4
>>>>>> 4 | 0
>>>>>>=20
>>>>>> function nfsrvd_read(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
>>>>>> 1 | 19
>>>>>> 2 | 3
>>>>>> 4 | 2
>>>>>> 8 | 0
>>>>>> 16 | 1
>>>>>> 32 | 0
>>>>>> 64 | 0
>>>>>> 128 | 0
>>>>>> 256 | 1
>>>>>> 512 | 0
>>>>>>=20
>>>>>> function nfsrvd_commit(), time in ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> -1 | 0
>>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
>>>>>> 1 |@@@@@@@ 14
>>>>>> 2 | 0
>>>>>> 4 |@ 1
>>>>>> 8 |@ 1
>>>>>> 16 | 0
>>>>>> 32 |@@@@@@@ 14
>>>>>> 64 |@ 2
>>>>>> 128 | 0
>>>>>>=20
>>>>>>=20
>>>>>> function nfsrvd_commit(), time in ms for >=3D 100 ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> < 100 | 0
>>>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>>>> 150 | 0
>>>>>>=20
>>>>>> function nfsrvd_read(), time in ms for >=3D 100 ms:
>>>>>> value ------------- Distribution ------------- count
>>>>>> 250 | 0
>>>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>>>> 350 | 0
>>>>>>=20
>>>>>>=20
>>>>>> Looks like the nfs server cache functions are quite fast, but
>>>>>> extremely frequently called.
>>>>>>=20
>>>>> Yep, they are called for every RPC.
>>>>>=20
>>>>> I may try coding up a patch that replaces the single mutex with
>>>>> one for each hash bucket, for TCP.
>>>>>=20
>>>>> I'll post if/when I get this patch to a testing/review stage, rick
>>>>>=20
>>>>=20
>>>> Cool.
>>>>=20
>>>> I've readjusted the precision of the dtrace script a bit, and I can
>>>> see
>>>> now the following three functions as taking most of the time :
>>>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()
>>>>=20
>>>> This was recorded during a oracle benchmark run called SLOB, which
>>>> caused 99% cpu load on the NFS server.
>>>>=20
>>> Even with the drc2.patch and a large value for =
vfs.nfsd.tcphighwater?
>>> (Assuming the mounts are TCP ones.)
>>>=20
>>> Have fun with it, rick
>>>=20
>>=20
>> I had upped it, but probably not enough. I'm now running with =
vfs.nfsd.tcphighwater set
>> to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500.
>> So far it looks like good improvement as those functions no longer =
show up in the dtrace script output.
>> I'll run some more benchmarks and testing today.
>>=20
>> Thanks!
>>=20
>>>>=20
>>>>>> I hope someone can find this information useful.
>>>>>>=20
>>>>>> _______________________________________________
>>>>>> freebsd-hackers@freebsd.org mailing list
>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>>> To unsubscribe, send any mail to
>>>>>> "freebsd-hackers-unsubscribe@freebsd.org"
>>=20
>=20
> I haven't had the opportunity today to run more DB tests over NFS as =
the DBA was busy with something else,
> however I tested a bit the large file transfers.
> And I'm seeing something strange probably not only NFS but also ZFS =
and ARC related.
>=20
> When I first tested the drc2.patch I reported a huge bandwidth =
improvement,
> but now I think that this was probably because of the machine freshly =
rebooted instead of just the patch.
> The patch surely improved things, especially CPU utilization combined =
with the increased cache.
> But today I'm again having a file completely cached in ZFS's ARC =
cache, which when transferred over NFS
> reaches about 300MB/s, when at some tests it reached 900+MB/s (as =
reported in my first email).
> The file locally can be read at about 3GB/s as reported by dd.
>=20
> Local:
> [17:51]root@goliath:/tank/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null =
bs=3D1M             =20
> 30720+1 records in
> 30720+1 records out
> 32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec)
>=20
> Over NFS:
> [17:48]root@spa:/mnt/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null =
bs=3D1M            =20
> 30720+1 records in
> 30720+1 records out
> 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s
>=20
> The machines are almost idle during this transfer and I can't see a =
reason why it can't reach the full bandwith when it's
> just reading it from RAM.
>=20
> I've tried again tracing with DTrace to see what's happening with this =
script :
>=20
> fbt:kernel:nfs*:entry
> {
> 	this->ts =3D timestamp;
> 	@counts[probefunc] =3D count();
> }
>=20
> fbt:kernel:nfs*:return
> / this->ts > 0 /
> {
> 	@time[probefunc] =3D avg(timestamp - this->ts);
> }
>=20
> END
> {
> 	trunc(@counts, 10);
> 	trunc(@time, 10);
> 	printf("Top 10 called functions\n\n");=09
> 	printa(@counts);
> 	printf("\n\nTop 10 slowest functions\n\n");
> 	printa(@time);
> }
>=20
> And here's the result (several seconds during the dd test):
>=20
> Top 10 called functions
>  nfsrc_freecache                                               88849
>  nfsrc_wanted                                                  88849
>  nfsrv_fillattr                                                88849
>  nfsrv_postopattr                                              88849
>  nfsrvd_read                                                   88849
>  nfsrvd_rephead                                                88849
>  nfsrvd_updatecache                                            88849
>  nfsvno_testexp                                                88849
>  nfsrc_trimcache                                              177697
>  nfsvno_getattr                                               177698
>=20
> Top 10 slowest functions
>  nfsd_excred                                                    5673
>  nfsrc_freecache                                                5674
>  nfsrv_postopattr                                               5970
>  nfsrv_servertimer                                              6327
>  nfssvc_nfscommon                                               6596
>  nfsd_fhtovp                                                    8000
>  nfsrvd_read                                                    8380
>  nfssvc_program                                                92752
>  nfsvno_read                                                  124979
>  nfsvno_fhtovp                                               1789523
>=20
> I might try now to trace what nfsvno_fhtovp() is doing and where is =
spending it's time.
>=20
> Any other ideas are welcome :)
>=20

To take the network out of the equation I redid the test by mounting the =
same filesystem over NFS on the server:

[18:23]root@goliath:~#  mount -t nfs -o =
rw,hard,intr,tcp,nfsv3,rsize=3D1048576,wsize=3D1048576 =
localhost:/tank/spa_db/undo /mnt
[18:24]root@goliath:~# dd if=3D/mnt/data.dbf of=3D/dev/null bs=3D1M=20
30720+1 records in
30720+1 records out
32212262912 bytes transferred in 79.793343 secs (403696120 bytes/sec)
[18:25]root@goliath:~# dd if=3D/mnt/data.dbf of=3D/dev/null bs=3D1M
30720+1 records in
30720+1 records out
32212262912 bytes transferred in 12.033420 secs (2676900110 bytes/sec)

During the first run I saw several nfsd threads in top, along with dd =
and again zero disk I/O.
There was increase in memory usage because of the double buffering =
ARC->buffercahe.
The second run was with all of the nfsd threads totally idle, and read =
directly from the buffercache.





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E211199E-026C-41C8-AFA4-0DB5A787391D>