FreeBSD Mail Archives

Date:      Thu, 11 Oct 2012 19:20:53 +0300
From:      Nikolay Denev <ndenev@gmail.com>
To:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>
Subject:   Re: NFS server bottlenecks
Message-ID:  <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com>
In-Reply-To: <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com>
References:  <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com>


On Oct 11, 2012, at 8:46 AM, Nikolay Denev <ndenev@gmail.com> wrote:

> 
> On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmacklem@uoguelph.ca> wrote:
> 
>> Nikolay Denev wrote:
>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca>
>>> wrote:
>>> 
>>>> Nikolay Denev wrote:
>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca>
>>>>> wrote:
>>>>> 
>>>>>> Garrett Wollman wrote:
>>>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>>>>>>> <rmacklem@uoguelph.ca> said:
>>>>>>> 
>>>>>>>>> Simple: just use a sepatate mutex for each list that a cache
>>>>>>>>> entry
>>>>>>>>> is on, rather than a global lock for everything. This would
>>>>>>>>> reduce
>>>>>>>>> the mutex contention, but I'm not sure how significantly since
>>>>>>>>> I
>>>>>>>>> don't have the means to measure it yet.
>>>>>>>>> 
>>>>>>>> Well, since the cache trimming is removing entries from the
>>>>>>>> lists,
>>>>>>>> I
>>>>>>>> don't
>>>>>>>> see how that can be done with a global lock for list updates?
>>>>>>> 
>>>>>>> Well, the global lock is what we have now, but the cache trimming
>>>>>>> process only looks at one list at a time, so not locking the list
>>>>>>> that
>>>>>>> isn't being iterated over probably wouldn't hurt, unless there's
>>>>>>> some
>>>>>>> mechanism (that I didn't see) for entries to move from one list
>>>>>>> to
>>>>>>> another. Note that I'm considering each hash bucket a separate
>>>>>>> "list". (One issue to worry about in that case would be
>>>>>>> cache-line
>>>>>>> contention in the array of hash buckets; perhaps
>>>>>>> NFSRVCACHE_HASHSIZE
>>>>>>> ought to be increased to reduce that.)
>>>>>>> 
>>>>>> Yea, a separate mutex for each hash list might help. There is also
>>>>>> the
>>>>>> LRU list that all entries end up on, that gets used by the
>>>>>> trimming
>>>>>> code.
>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
>>>>>> looked
>>>>>> at
>>>>>> it in a while.)
>>>>>> 
>>>>>> Also, increasing the hash table size is probably a good idea,
>>>>>> especially
>>>>>> if you reduce how aggressively the cache is trimmed.
>>>>>> 
>>>>>>>> Only doing it once/sec would result in a very large cache when
>>>>>>>> bursts of
>>>>>>>> traffic arrives.
>>>>>>> 
>>>>>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>>>>> 
>>>>>> This code was originally "production tested" on a server with
>>>>>> 1Gbyte,
>>>>>> so times have changed a bit;-)
>>>>>> 
>>>>>>>> I'm not sure I see why doing it as a separate thread will
>>>>>>>> improve
>>>>>>>> things.
>>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if
>>>>>>>> you
>>>>>>>> wish)
>>>>>>>> and having a bunch more "cache trimming threads" would just
>>>>>>>> increase
>>>>>>>> contention, wouldn't it?
>>>>>>> 
>>>>>>> Only one cache-trimming thread. The cache trim holds the (global)
>>>>>>> mutex for much longer than any individual nfsd service thread has
>>>>>>> any
>>>>>>> need to, and having N threads doing that in parallel is why it's
>>>>>>> so
>>>>>>> heavily contended. If there's only one thread doing the trim,
>>>>>>> then
>>>>>>> the nfsd service threads aren't spending time either contending
>>>>>>> on
>>>>>>> the
>>>>>>> mutex (it will be held less frequently and for shorter periods).
>>>>>>> 
>>>>>> I think the little drc2.patch which will keep the nfsd threads
>>>>>> from
>>>>>> acquiring the mutex and doing the trimming most of the time, might
>>>>>> be
>>>>>> sufficient. I still don't see why a separate trimming thread will
>>>>>> be
>>>>>> an advantage. I'd also be worried that the one cache trimming
>>>>>> thread
>>>>>> won't get the job done soon enough.
>>>>>> 
>>>>>> When I did production testing on a 1Gbyte server that saw a peak
>>>>>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no
>>>>>> longer relevant, I recently recall someone trying to run FreeBSD
>>>>>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>>>>> 
>>>>>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>>>>>> threads doing it would be a (I believe negligible) increase in
>>>>>>>> RPC
>>>>>>>> response times (the time the nfsd thread spends trimming the
>>>>>>>> cache).
>>>>>>>> As noted, I think this time would be negligible compared to disk
>>>>>>>> I/O
>>>>>>>> and network transit times in the total RPC response time?
>>>>>>> 
>>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and
>>>>>>> 10G
>>>>>>> network connectivity, spinning on a contended mutex takes a
>>>>>>> significant amount of CPU time. (For the current design of the
>>>>>>> NFS
>>>>>>> server, it may actually be a win to turn off adaptive mutexes --
>>>>>>> I
>>>>>>> should give that a try once I'm able to do more testing.)
>>>>>>> 
>>>>>> Have fun with it. Let me know when you have what you think is a
>>>>>> good
>>>>>> patch.
>>>>>> 
>>>>>> rick
>>>>>> 
>>>>>>> -GAWollman
>>>>>>> _______________________________________________
>>>>>>> freebsd-hackers@freebsd.org mailing list
>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>>>> To unsubscribe, send any mail to
>>>>>>> "freebsd-hackers-unsubscribe@freebsd.org"
>>>>>> _______________________________________________
>>>>>> freebsd-fs@freebsd.org mailing list
>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>>>>> To unsubscribe, send any mail to
>>>>>> "freebsd-fs-unsubscribe@freebsd.org"
>>>>> 
>>>>> My quest for IOPS over NFS continues :)
>>>>> So far I'm not able to achieve more than about 3000 8K read
>>>>> requests
>>>>> over NFS,
>>>>> while the server locally gives much more.
>>>>> And this is all from a file that is completely in ARC cache, no
>>>>> disk
>>>>> IO involved.
>>>>> 
>>>> Just out of curiousity, why do you use 8K reads instead of 64K
>>>> reads.
>>>> Since the RPC overhead (including the DRC functions) is per RPC,
>>>> doing
>>>> fewer larger RPCs should usually work better. (Sometimes large
>>>> rsize/wsize
>>>> values generate too large a burst of traffic for a network interface
>>>> to
>>>> handle and then the rsize/wsize has to be decreased to avoid this
>>>> issue.)
>>>> 
>>>> And, although this experiment seems useful for testing patches that
>>>> try
>>>> and reduce DRC CPU overheads, most "real" NFS servers will be doing
>>>> disk
>>>> I/O.
>>>> 
>>> 
>>> This is the default blocksize the Oracle and probably most databases
>>> use.
>>> It uses also larger blocks, but for small random reads in OLTP
>>> applications this is what is used.
>>> 
>> If the client is doing 8K reads, you could increase the read ahead
>> "readahead=N" (N up to 16), to try and increase the bandwidth.
>> (But if the CPU is 99% busy, then I don't think it will matter.)
> 
> I'll try to check if this is possible to be set, as we are testing not only with the Linux NFS client,
> but also with the Oracle's built in so called DirectNFS client that is built in to the app.
> 
>> 
>>> 
>>>>> I've snatched some sample DTrace script from the net : [
>>>>> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
>>>>> ]
>>>>> 
>>>>> And modified it for our new NFS server :
>>>>> 
>>>>> #!/usr/sbin/dtrace -qs
>>>>> 
>>>>> fbt:kernel:nfsrvd_*:entry
>>>>> {
>>>>> self->ts = timestamp;
>>>>> @counts[probefunc] = count();
>>>>> }
>>>>> 
>>>>> fbt:kernel:nfsrvd_*:return
>>>>> / self->ts > 0 /
>>>>> {
>>>>> this->delta = (timestamp-self->ts)/1000000;
>>>>> }
>>>>> 
>>>>> fbt:kernel:nfsrvd_*:return
>>>>> / self->ts > 0 && this->delta > 100 /
>>>>> {
>>>>> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50);
>>>>> }
>>>>> 
>>>>> fbt:kernel:nfsrvd_*:return
>>>>> / self->ts > 0 /
>>>>> {
>>>>> @dist[probefunc, "ms"] = quantize(this->delta);
>>>>> self->ts = 0;
>>>>> }
>>>>> 
>>>>> END
>>>>> {
>>>>> printf("\n");
>>>>> printa("function %-20s %@10d\n", @counts);
>>>>> printf("\n");
>>>>> printa("function %s(), time in %s:%@d\n", @dist);
>>>>> printf("\n");
>>>>> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
>>>>> }
>>>>> 
>>>>> And here's a sample output from one or two minutes during the run
>>>>> of
>>>>> Oracle's ORION benchmark
>>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G
>>>>> ethernet:
>>>>> 
>>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d
>>>>> ^C
>>>>> 
>>>>> function nfsrvd_access 4
>>>>> function nfsrvd_statfs 10
>>>>> function nfsrvd_getattr 14
>>>>> function nfsrvd_commit 76
>>>>> function nfsrvd_sentcache 110048
>>>>> function nfsrvd_write 110048
>>>>> function nfsrvd_read 283648
>>>>> function nfsrvd_dorpc 393800
>>>>> function nfsrvd_getcache 393800
>>>>> function nfsrvd_rephead 393800
>>>>> function nfsrvd_updatecache 393800
>>>>> 
>>>>> function nfsrvd_access(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
>>>>> 1 | 0
>>>>> 
>>>>> function nfsrvd_statfs(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
>>>>> 1 | 0
>>>>> 
>>>>> function nfsrvd_getattr(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
>>>>> 1 | 0
>>>>> 
>>>>> function nfsrvd_sentcache(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
>>>>> 1 | 0
>>>>> 
>>>>> function nfsrvd_rephead(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>>> 1 | 0
>>>>> 
>>>>> function nfsrvd_updatecache(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
>>>>> 1 | 0
>>>>> 
>>>>> function nfsrvd_getcache(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
>>>>> 1 | 1
>>>>> 2 | 0
>>>>> 4 | 1
>>>>> 8 | 0
>>>>> 
>>>>> function nfsrvd_write(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
>>>>> 1 | 5
>>>>> 2 | 4
>>>>> 4 | 0
>>>>> 
>>>>> function nfsrvd_read(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
>>>>> 1 | 19
>>>>> 2 | 3
>>>>> 4 | 2
>>>>> 8 | 0
>>>>> 16 | 1
>>>>> 32 | 0
>>>>> 64 | 0
>>>>> 128 | 0
>>>>> 256 | 1
>>>>> 512 | 0
>>>>> 
>>>>> function nfsrvd_commit(), time in ms:
>>>>> value ------------- Distribution ------------- count
>>>>> -1 | 0
>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
>>>>> 1 |@@@@@@@ 14
>>>>> 2 | 0
>>>>> 4 |@ 1
>>>>> 8 |@ 1
>>>>> 16 | 0
>>>>> 32 |@@@@@@@ 14
>>>>> 64 |@ 2
>>>>> 128 | 0
>>>>> 
>>>>> 
>>>>> function nfsrvd_commit(), time in ms for >= 100 ms:
>>>>> value ------------- Distribution ------------- count
>>>>> < 100 | 0
>>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>>> 150 | 0
>>>>> 
>>>>> function nfsrvd_read(), time in ms for >= 100 ms:
>>>>> value ------------- Distribution ------------- count
>>>>> 250 | 0
>>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
>>>>> 350 | 0
>>>>> 
>>>>> 
>>>>> Looks like the nfs server cache functions are quite fast, but
>>>>> extremely frequently called.
>>>>> 
>>>> Yep, they are called for every RPC.
>>>> 
>>>> I may try coding up a patch that replaces the single mutex with
>>>> one for each hash bucket, for TCP.
>>>> 
>>>> I'll post if/when I get this patch to a testing/review stage, rick
>>>> 
>>> 
>>> Cool.
>>> 
>>> I've readjusted the precision of the dtrace script a bit, and I can
>>> see
>>> now the following three functions as taking most of the time :
>>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()
>>> 
>>> This was recorded during a oracle benchmark run called SLOB, which
>>> caused 99% cpu load on the NFS server.
>>> 
>> Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater?
>> (Assuming the mounts are TCP ones.)
>> 
>> Have fun with it, rick
>> 
> 
> I had upped it, but probably not enough. I'm now running with vfs.nfsd.tcphighwater set
> to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500.
> So far it looks like good improvement as those functions no longer show up in the dtrace script output.
> I'll run some more benchmarks and testing today.
> 
> Thanks!
> 
>>> 
>>>>> I hope someone can find this information useful.
>>>>> 
>>>>> _______________________________________________
>>>>> freebsd-hackers@freebsd.org mailing list
>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>>>> To unsubscribe, send any mail to
>>>>> "freebsd-hackers-unsubscribe@freebsd.org"
> 

I haven't had the opportunity today to run more DB tests over NFS as the DBA was busy with something else,
however I tested a bit the large file transfers.
And I'm seeing something strange probably not only NFS but also ZFS and ARC related.

When I first tested the drc2.patch I reported a huge bandwidth improvement,
but now I think that this was probably because of the machine freshly rebooted instead of just the patch.
The patch surely improved things, especially CPU utilization combined with the increased cache.
But today I'm again having a file completely cached in ZFS's ARC cache, which when transferred over NFS
reaches about 300MB/s, when at some tests it reached 900+MB/s (as reported in my first email).
The file locally can be read at about 3GB/s as reported by dd.

Local:
[17:51]root@goliath:/tank/spa_db/undo# dd if=data.dbf of=/dev/null bs=1M              
30720+1 records in
30720+1 records out
32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec)

Over NFS:
[17:48]root@spa:/mnt/spa_db/undo# dd if=data.dbf of=/dev/null bs=1M             
30720+1 records in
30720+1 records out
32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s

The machines are almost idle during this transfer and I can't see a reason why it can't reach the full bandwith when it's
just reading it from RAM.

I've tried again tracing with DTrace to see what's happening with this script :

fbt:kernel:nfs*:entry
{
	this->ts = timestamp;
	@counts[probefunc] = count();
}

fbt:kernel:nfs*:return
/ this->ts > 0 /
{
	@time[probefunc] = avg(timestamp - this->ts);
}

END
{
	trunc(@counts, 10);
	trunc(@time, 10);
	printf("Top 10 called functions\n\n");	
	printa(@counts);
	printf("\n\nTop 10 slowest functions\n\n");
	printa(@time);
}

And here's the result (several seconds during the dd test):

Top 10 called functions
  nfsrc_freecache                                               88849
  nfsrc_wanted                                                  88849
  nfsrv_fillattr                                                88849
  nfsrv_postopattr                                              88849
  nfsrvd_read                                                   88849
  nfsrvd_rephead                                                88849
  nfsrvd_updatecache                                            88849
  nfsvno_testexp                                                88849
  nfsrc_trimcache                                              177697
  nfsvno_getattr                                               177698

Top 10 slowest functions
  nfsd_excred                                                    5673
  nfsrc_freecache                                                5674
  nfsrv_postopattr                                               5970
  nfsrv_servertimer                                              6327
  nfssvc_nfscommon                                               6596
  nfsd_fhtovp                                                    8000
  nfsrvd_read                                                    8380
  nfssvc_program                                                92752
  nfsvno_read                                                  124979
  nfsvno_fhtovp                                               1789523

I might try now to trace what nfsvno_fhtovp() is doing and where is spending it's time.

Any other ideas are welcome :)

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0A8CDBF9-28C3-46D2-BB58-0559D00BD545>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation