Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Oct 2012 19:20:53 +0300
From:      Nikolay Denev <ndenev@gmail.com>
To:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>
Subject:   Re: NFS server bottlenecks
Message-ID:  <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com>
In-Reply-To: <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com>
References:  <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Oct 11, 2012, at 8:46 AM, Nikolay Denev <ndenev@gmail.com> wrote:

>=20
> On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmacklem@uoguelph.ca> =
wrote:
>=20
>> Nikolay Denev wrote:
>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca>
>>> wrote:
>>>=20
>>>> Nikolay Denev wrote:
>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca>
>>>>> wrote:
>>>>>=20
>>>>>> Garrett Wollman wrote:
>>>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>>>>>>> <rmacklem@uoguelph.ca> said:
>>>>>>>=20
>>>>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>>>>> entry
>>>>>>>>> is on, rather than a global lock for everything. This would
>>>>>>>>> reduce
>>>>>>>>> the mutex contention, but I'm not sure how significantly since
>>>>>>>>> I
>>>>>>>>> don't have the means to measure it yet.
>>>>>>>>>=20
>>>>>>>> Well, since the cache trimming is removing entries from the
>>>>>>>> lists,
>>>>>>>> I
>>>>>>>> don't
>>>>>>>> see how that can be done with a global lock for list updates?
>>>>>>>=20
>>>>>>> Well, the global lock is what we have now, but the cache =
trimming
>>>>>>> process only looks at one list at a time, so not locking the =
list
>>>>>>> that
>>>>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>>>>> some
>>>>>>> mechanism (that I didn't see) for entries to move from one list
>>>>>>> to
>>>>>>> another. Note that I'm considering each hash bucket a separate
>>>>>>> "list". (One issue to worry about in that case would be
>>>>>>> cache-line
>>>>>>> contention in the array of hash buckets; perhaps
>>>>>>> NFSRVCACHE_HASHSIZE
>>>>>>> ought to be increased to reduce that.)
>>>>>>>=20
>>>>>> Yea, a separate mutex for each hash list might help. There is =
also
>>>>>> the
>>>>>> LRU list that all entries end up on, that gets used by the
>>>>>> trimming
>>>>>> code.
>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
>>>>>> looked
>>>>>> at
>>>>>> it in a while.)
>>>>>>=20
>>>>>> Also, increasing the hash table size is probably a good idea,
>>>>>> especially
>>>>>> if you reduce how aggressively the cache is trimmed.
>>>>>>=20
>>>>>>>> Only doing it once/sec would result in a very large cache when
>>>>>>>> bursts of
>>>>>>>> traffic arrives.
>>>>>>>=20
>>>>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>>>>>=20
>>>>>> This code was originally "production tested" on a server with
>>>>>> 1Gbyte,
>>>>>> so times have changed a bit;-)
>>>>>>=20
>>>>>>>> I'm not sure I see why doing it as a separate thread will
>>>>>>>> improve
>>>>>>>> things.
>>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if
>>>>>>>> you
>>>>>>>> wish)
>>>>>>>> and having a bunch more "cache trimming threads" would just
>>>>>>>> increase
>>>>>>>> contention, wouldn't it?
>>>>>>>=20
>>>>>>> Only one cache-trimming thread. The cache trim holds the =
(global)
>>>>>>> mutex for much longer than any individual nfsd service thread =
has
>>>>>>> any
>>>>>>> need to, and having N threads doing that in parallel is why it's
>>>>>>> so
>>>>>>> heavily contended. If there's only one thread doing the trim,
>>>>>>> then
>>>>>>> the nfsd service threads aren't spending time either contending
>>>>>>> on
>>>>>>> the
>>>>>>> mutex (it will be held less frequently and for shorter periods).
>>>>>>>=20
>>>>>> I think the little drc2.patch which will keep the nfsd threads
>>>>>> from
>>>>>> acquiring the mutex and doing the trimming most of the time, =
might
>>>>>> be
>>>>>> sufficient. I still don't see why a separate trimming thread will
>>>>>> be
>>>>>> an advantage. I'd also be worried that the one cache trimming
>>>>>> thread
>>>>>> won't get the job done soon enough.
>>>>>>=20
>>>>>> When I did production testing on a 1Gbyte server that saw a peak
>>>>>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no
>>>>>> longer relevant, I recently recall someone trying to run FreeBSD
>>>>>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>>>>>=20
>>>>>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>>>>>> threads doing it would be a (I believe negligible) increase in
>>>>>>>> RPC
>>>>>>>> response times (the time the nfsd thread spends trimming the
>>>>>>>> cache).
>>>>>>>> As noted, I think this time would be negligible compared to =
disk
>>>>>>>> I/O
>>>>>>>> and network transit times in the total RPC response time?
>>>>>>>=20
>>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and
>>>>>>> 10G
>>>>>>> network connectivity, spinning on a contended mutex takes a
>>>>>>> significant amount of CPU time. (For the current design of the
>>>>>>> NFS
>>>>>>> server, it may actually be a win to turn off adaptive mutexes --
>>>>>>> I
>>>>>>> should give that a try once I'm able to do more testing.)
>>>>>>>=20
>>>>>> Have fun with it. Let me know when you have what you think is a
>>>>>> good
>>>>>> patch.
>>>>>>=20
>>>>>> rick
>>>>>>=20
>>>>>>> -GAWollman
>>>>>>> _______________________________________________
>>>>>>> freebsd-hackers@freebsd.org mailing list
>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>>>> To unsubscribe, send any mail to
>>>>>>> "freebsd-hackers-unsubscribe@freebsd.org"
>>>>>> _______________________________________________
>>>>>> freebsd-fs@freebsd.org mailing list
>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>>>> To unsubscribe, send any mail to
>>>>>> "freebsd-fs-unsubscribe@freebsd.org"
>>>>>=20
>>>>> My quest for IOPS over NFS continues :)
>>>>> So far I'm not able to achieve more than about 3000 8K read
>>>>> requests
>>>>> over NFS,
>>>>> while the server locally gives much more.
>>>>> And this is all from a file that is completely in ARC cache, no
>>>>> disk
>>>>> IO involved.
>>>>>=20
>>>> Just out of curiousity, why do you use 8K reads instead of 64K
>>>> reads.
>>>> Since the RPC overhead (including the DRC functions) is per RPC,
>>>> doing
>>>> fewer larger RPCs should usually work better. (Sometimes large
>>>> rsize/wsize
>>>> values generate too large a burst of traffic for a network =
interface
>>>> to
>>>> handle and then the rsize/wsize has to be decreased to avoid this
>>>> issue.)
>>>>=20
>>>> And, although this experiment seems useful for testing patches that
>>>> try
>>>> and reduce DRC CPU overheads, most "real" NFS servers will be doing
>>>> disk
>>>> I/O.
>>>>=20
>>>=20
>>> This is the default blocksize the Oracle and probably most databases
>>> use.
>>> It uses also larger blocks, but for small random reads in OLTP
>>> applications this is what is used.
>>>=20
>> If the client is doing 8K reads, you could increase the read ahead
>> "readahead=3DN" (N up to 16), to try and increase the bandwidth.
>> (But if the CPU is 99% busy, then I don't think it will matter.)
>=20
> I'll try to check if this is possible to be set, as we are testing not =
only with the Linux NFS client,
> but also with the Oracle's built in so called DirectNFS client that is =
built in to the app.
>=20
>>=20
>>>=20
>>>>> I've snatched some sample DTrace script from the net : [
>>>>> =
http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
>>>>> ]
>>>>>=20
>>>>> And modified it for our new NFS server :
>>>>>=20
>>>>> #!/usr/sbin/dtrace -qs
>>>>>=20
>>>>> fbt:kernel:nfsrvd_*:entry
>>>>> {
>>>>> self->ts =3D timestamp;
>>>>> @counts[probefunc] =3D count();
>>>>> }
>>>>>=20
>>>>> fbt:kernel:nfsrvd_*:return
>>>>> / self->ts > 0 /
>>>>> {
>>>>> this->delta =3D (timestamp-self->ts)/1000000;
>>>>> }
>>>>>=20
>>>>> fbt:kernel:nfsrvd_*:return
>>>>> / self->ts > 0 && this->delta > 100 /
>>>>> {
>>>>> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50);
>>>>> }
>>>>>=20
>>>>> fbt:kernel:nfsrvd_*:return
>>>>> / self->ts > 0 /
>>>>> {
>>>>> @dist[probefunc, "ms"] =3D quantize(this->delta);
>>>>> self->ts =3D 0;
>>>>> }
>>>>>=20
>>>>> END
>>>>> {
>>>>> printf("\n");
>>>>> printa("function %-20s %@10d\n", @counts);
>>>>> printf("\n");
>>>>> printa("function %s(), time in %s:%@d\n", @dist);
>>>>> printf("\n");
>>>>> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow);
>>>>> }
>>>>>=20
>>>>> And here's a sample output from one or two minutes during the run
>>>>> of
>>>>> Oracle's ORION benchmark
>>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G
>>>>> ethernet:
>>>>>=20
>>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d
>>>>> ^C
>>>>>=20
>>>>> function nfsrvd_access 4
>>>>> function nfsrvd_statfs 10
>>>>> function nfsrvd_getattr 14
>>>>> function nfsrvd_commit 76
>>>>> function nfsrvd_sentcache 110048
>>>>> function nfsrvd_write 110048
>>>>> function nfsrvd_read 283648
>>>>> function nfsrvd_dorpc 393800
>>>>> function nfsrvd_getcache 393800
>>>>> function nfsrvd_rephead 393800
>>>>> function nfsrvd_updatecache 393800
>>>>>=20
>>>>> function nfsrvd_access(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
>>>>> 1 | 0
>>>>>=20
>>>>> function nfsrvd_statfs(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
>>>>> 1 | 0
>>>>>=20
>>>>> function nfsrvd_getattr(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
>>>>> 1 | 0
>>>>>=20
>>>>> function nfsrvd_sentcache(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
>>>>> 1 | 0
>>>>>=20
>>>>> function nfsrvd_rephead(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>>> 1 | 0
>>>>>=20
>>>>> function nfsrvd_updatecache(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>>> 1 | 0
>>>>>=20
>>>>> function nfsrvd_getcache(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
>>>>> 1 | 1
>>>>> 2 | 0
>>>>> 4 | 1
>>>>> 8 | 0
>>>>>=20
>>>>> function nfsrvd_write(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
>>>>> 1 | 5
>>>>> 2 | 4
>>>>> 4 | 0
>>>>>=20
>>>>> function nfsrvd_read(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
>>>>> 1 | 19
>>>>> 2 | 3
>>>>> 4 | 2
>>>>> 8 | 0
>>>>> 16 | 1
>>>>> 32 | 0
>>>>> 64 | 0
>>>>> 128 | 0
>>>>> 256 | 1
>>>>> 512 | 0
>>>>>=20
>>>>> function nfsrvd_commit(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
>>>>> 1 |@@@@@@@ 14
>>>>> 2 | 0
>>>>> 4 |@ 1
>>>>> 8 |@ 1
>>>>> 16 | 0
>>>>> 32 |@@@@@@@ 14
>>>>> 64 |@ 2
>>>>> 128 | 0
>>>>>=20
>>>>>=20
>>>>> function nfsrvd_commit(), time in ms for >=3D 100 ms:
>>>>> value ------------- Distribution ------------- count
>>>>> < 100 | 0
>>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>>> 150 | 0
>>>>>=20
>>>>> function nfsrvd_read(), time in ms for >=3D 100 ms:
>>>>> value ------------- Distribution ------------- count
>>>>> 250 | 0
>>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>>> 350 | 0
>>>>>=20
>>>>>=20
>>>>> Looks like the nfs server cache functions are quite fast, but
>>>>> extremely frequently called.
>>>>>=20
>>>> Yep, they are called for every RPC.
>>>>=20
>>>> I may try coding up a patch that replaces the single mutex with
>>>> one for each hash bucket, for TCP.
>>>>=20
>>>> I'll post if/when I get this patch to a testing/review stage, rick
>>>>=20
>>>=20
>>> Cool.
>>>=20
>>> I've readjusted the precision of the dtrace script a bit, and I can
>>> see
>>> now the following three functions as taking most of the time :
>>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()
>>>=20
>>> This was recorded during a oracle benchmark run called SLOB, which
>>> caused 99% cpu load on the NFS server.
>>>=20
>> Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater?
>> (Assuming the mounts are TCP ones.)
>>=20
>> Have fun with it, rick
>>=20
>=20
> I had upped it, but probably not enough. I'm now running with =
vfs.nfsd.tcphighwater set
> to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500.
> So far it looks like good improvement as those functions no longer =
show up in the dtrace script output.
> I'll run some more benchmarks and testing today.
>=20
> Thanks!
>=20
>>>=20
>>>>> I hope someone can find this information useful.
>>>>>=20
>>>>> _______________________________________________
>>>>> freebsd-hackers@freebsd.org mailing list
>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>> To unsubscribe, send any mail to
>>>>> "freebsd-hackers-unsubscribe@freebsd.org"
>=20

I haven't had the opportunity today to run more DB tests over NFS as the =
DBA was busy with something else,
however I tested a bit the large file transfers.
And I'm seeing something strange probably not only NFS but also ZFS and =
ARC related.

When I first tested the drc2.patch I reported a huge bandwidth =
improvement,
but now I think that this was probably because of the machine freshly =
rebooted instead of just the patch.
The patch surely improved things, especially CPU utilization combined =
with the increased cache.
But today I'm again having a file completely cached in ZFS's ARC cache, =
which when transferred over NFS
reaches about 300MB/s, when at some tests it reached 900+MB/s (as =
reported in my first email).
The file locally can be read at about 3GB/s as reported by dd.

Local:
[17:51]root@goliath:/tank/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null =
bs=3D1M             =20
30720+1 records in
30720+1 records out
32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec)

Over NFS:
[17:48]root@spa:/mnt/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null bs=3D1M =
           =20
30720+1 records in
30720+1 records out
32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s

The machines are almost idle during this transfer and I can't see a =
reason why it can't reach the full bandwith when it's
just reading it from RAM.

I've tried again tracing with DTrace to see what's happening with this =
script :

fbt:kernel:nfs*:entry
{
	this->ts =3D timestamp;
	@counts[probefunc] =3D count();
}

fbt:kernel:nfs*:return
/ this->ts > 0 /
{
	@time[probefunc] =3D avg(timestamp - this->ts);
}

END
{
	trunc(@counts, 10);
	trunc(@time, 10);
	printf("Top 10 called functions\n\n");=09
	printa(@counts);
	printf("\n\nTop 10 slowest functions\n\n");
	printa(@time);
}

And here's the result (several seconds during the dd test):

Top 10 called functions
  nfsrc_freecache                                               88849
  nfsrc_wanted                                                  88849
  nfsrv_fillattr                                                88849
  nfsrv_postopattr                                              88849
  nfsrvd_read                                                   88849
  nfsrvd_rephead                                                88849
  nfsrvd_updatecache                                            88849
  nfsvno_testexp                                                88849
  nfsrc_trimcache                                              177697
  nfsvno_getattr                                               177698

Top 10 slowest functions
  nfsd_excred                                                    5673
  nfsrc_freecache                                                5674
  nfsrv_postopattr                                               5970
  nfsrv_servertimer                                              6327
  nfssvc_nfscommon                                               6596
  nfsd_fhtovp                                                    8000
  nfsrvd_read                                                    8380
  nfssvc_program                                                92752
  nfsvno_read                                                  124979
  nfsvno_fhtovp                                               1789523

I might try now to trace what nfsvno_fhtovp() is doing and where is =
spending it's time.

Any other ideas are welcome :)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0A8CDBF9-28C3-46D2-BB58-0559D00BD545>