Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Oct 2012 17:04:28 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Nikolay Denev <ndenev@gmail.com>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>, freebsd-hackers@freebsd.org
Subject:   Re: NFS server bottlenecks
Message-ID:  <1517976814.2112914.1349989468096.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Nikolay Denev wrote:
> On Oct 11, 2012, at 8:46 AM, Nikolay Denev <ndenev@gmail.com> wrote:
> 
> >
> > On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmacklem@uoguelph.ca>
> > wrote:
> >
> >> Nikolay Denev wrote:
> >>> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca>
> >>> wrote:
> >>>
> >>>> Nikolay Denev wrote:
> >>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca>
> >>>>> wrote:
> >>>>>
> >>>>>> Garrett Wollman wrote:
> >>>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
> >>>>>>> <rmacklem@uoguelph.ca> said:
> >>>>>>>
> >>>>>>>>> Simple: just use a sepatate mutex for each list that a cache
> >>>>>>>>> entry
> >>>>>>>>> is on, rather than a global lock for everything. This would
> >>>>>>>>> reduce
> >>>>>>>>> the mutex contention, but I'm not sure how significantly
> >>>>>>>>> since
> >>>>>>>>> I
> >>>>>>>>> don't have the means to measure it yet.
> >>>>>>>>>
> >>>>>>>> Well, since the cache trimming is removing entries from the
> >>>>>>>> lists,
> >>>>>>>> I
> >>>>>>>> don't
> >>>>>>>> see how that can be done with a global lock for list updates?
> >>>>>>>
> >>>>>>> Well, the global lock is what we have now, but the cache
> >>>>>>> trimming
> >>>>>>> process only looks at one list at a time, so not locking the
> >>>>>>> list
> >>>>>>> that
> >>>>>>> isn't being iterated over probably wouldn't hurt, unless
> >>>>>>> there's
> >>>>>>> some
> >>>>>>> mechanism (that I didn't see) for entries to move from one
> >>>>>>> list
> >>>>>>> to
> >>>>>>> another. Note that I'm considering each hash bucket a separate
> >>>>>>> "list". (One issue to worry about in that case would be
> >>>>>>> cache-line
> >>>>>>> contention in the array of hash buckets; perhaps
> >>>>>>> NFSRVCACHE_HASHSIZE
> >>>>>>> ought to be increased to reduce that.)
> >>>>>>>
> >>>>>> Yea, a separate mutex for each hash list might help. There is
> >>>>>> also
> >>>>>> the
> >>>>>> LRU list that all entries end up on, that gets used by the
> >>>>>> trimming
> >>>>>> code.
> >>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
> >>>>>> looked
> >>>>>> at
> >>>>>> it in a while.)
> >>>>>>
> >>>>>> Also, increasing the hash table size is probably a good idea,
> >>>>>> especially
> >>>>>> if you reduce how aggressively the cache is trimmed.
> >>>>>>
> >>>>>>>> Only doing it once/sec would result in a very large cache
> >>>>>>>> when
> >>>>>>>> bursts of
> >>>>>>>> traffic arrives.
> >>>>>>>
> >>>>>>> My servers have 96 GB of memory so that's not a big deal for
> >>>>>>> me.
> >>>>>>>
> >>>>>> This code was originally "production tested" on a server with
> >>>>>> 1Gbyte,
> >>>>>> so times have changed a bit;-)
> >>>>>>
> >>>>>>>> I'm not sure I see why doing it as a separate thread will
> >>>>>>>> improve
> >>>>>>>> things.
> >>>>>>>> There are N nfsd threads already (N can be bumped up to 256
> >>>>>>>> if
> >>>>>>>> you
> >>>>>>>> wish)
> >>>>>>>> and having a bunch more "cache trimming threads" would just
> >>>>>>>> increase
> >>>>>>>> contention, wouldn't it?
> >>>>>>>
> >>>>>>> Only one cache-trimming thread. The cache trim holds the
> >>>>>>> (global)
> >>>>>>> mutex for much longer than any individual nfsd service thread
> >>>>>>> has
> >>>>>>> any
> >>>>>>> need to, and having N threads doing that in parallel is why
> >>>>>>> it's
> >>>>>>> so
> >>>>>>> heavily contended. If there's only one thread doing the trim,
> >>>>>>> then
> >>>>>>> the nfsd service threads aren't spending time either
> >>>>>>> contending
> >>>>>>> on
> >>>>>>> the
> >>>>>>> mutex (it will be held less frequently and for shorter
> >>>>>>> periods).
> >>>>>>>
> >>>>>> I think the little drc2.patch which will keep the nfsd threads
> >>>>>> from
> >>>>>> acquiring the mutex and doing the trimming most of the time,
> >>>>>> might
> >>>>>> be
> >>>>>> sufficient. I still don't see why a separate trimming thread
> >>>>>> will
> >>>>>> be
> >>>>>> an advantage. I'd also be worried that the one cache trimming
> >>>>>> thread
> >>>>>> won't get the job done soon enough.
> >>>>>>
> >>>>>> When I did production testing on a 1Gbyte server that saw a
> >>>>>> peak
> >>>>>> load of about 100RPCs/sec, it was necessary to trim
> >>>>>> aggressively.
> >>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no
> >>>>>> longer relevant, I recently recall someone trying to run
> >>>>>> FreeBSD
> >>>>>> on a i486, although I doubt they wanted to run the nfsd on it.)
> >>>>>>
> >>>>>>>> The only negative effect I can think of w.r.t. having the
> >>>>>>>> nfsd
> >>>>>>>> threads doing it would be a (I believe negligible) increase
> >>>>>>>> in
> >>>>>>>> RPC
> >>>>>>>> response times (the time the nfsd thread spends trimming the
> >>>>>>>> cache).
> >>>>>>>> As noted, I think this time would be negligible compared to
> >>>>>>>> disk
> >>>>>>>> I/O
> >>>>>>>> and network transit times in the total RPC response time?
> >>>>>>>
> >>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and
> >>>>>>> 10G
> >>>>>>> network connectivity, spinning on a contended mutex takes a
> >>>>>>> significant amount of CPU time. (For the current design of the
> >>>>>>> NFS
> >>>>>>> server, it may actually be a win to turn off adaptive mutexes
> >>>>>>> --
> >>>>>>> I
> >>>>>>> should give that a try once I'm able to do more testing.)
> >>>>>>>
> >>>>>> Have fun with it. Let me know when you have what you think is a
> >>>>>> good
> >>>>>> patch.
> >>>>>>
> >>>>>> rick
> >>>>>>
> >>>>>>> -GAWollman
> >>>>>>> _______________________________________________
> >>>>>>> freebsd-hackers@freebsd.org mailing list
> >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> >>>>>>> To unsubscribe, send any mail to
> >>>>>>> "freebsd-hackers-unsubscribe@freebsd.org"
> >>>>>> _______________________________________________
> >>>>>> freebsd-fs@freebsd.org mailing list
> >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>>>>> To unsubscribe, send any mail to
> >>>>>> "freebsd-fs-unsubscribe@freebsd.org"
> >>>>>
> >>>>> My quest for IOPS over NFS continues :)
> >>>>> So far I'm not able to achieve more than about 3000 8K read
> >>>>> requests
> >>>>> over NFS,
> >>>>> while the server locally gives much more.
> >>>>> And this is all from a file that is completely in ARC cache, no
> >>>>> disk
> >>>>> IO involved.
> >>>>>
> >>>> Just out of curiousity, why do you use 8K reads instead of 64K
> >>>> reads.
> >>>> Since the RPC overhead (including the DRC functions) is per RPC,
> >>>> doing
> >>>> fewer larger RPCs should usually work better. (Sometimes large
> >>>> rsize/wsize
> >>>> values generate too large a burst of traffic for a network
> >>>> interface
> >>>> to
> >>>> handle and then the rsize/wsize has to be decreased to avoid this
> >>>> issue.)
> >>>>
> >>>> And, although this experiment seems useful for testing patches
> >>>> that
> >>>> try
> >>>> and reduce DRC CPU overheads, most "real" NFS servers will be
> >>>> doing
> >>>> disk
> >>>> I/O.
> >>>>
> >>>
> >>> This is the default blocksize the Oracle and probably most
> >>> databases
> >>> use.
> >>> It uses also larger blocks, but for small random reads in OLTP
> >>> applications this is what is used.
> >>>
> >> If the client is doing 8K reads, you could increase the read ahead
> >> "readahead=N" (N up to 16), to try and increase the bandwidth.
> >> (But if the CPU is 99% busy, then I don't think it will matter.)
> >
> > I'll try to check if this is possible to be set, as we are testing
> > not only with the Linux NFS client,
> > but also with the Oracle's built in so called DirectNFS client that
> > is built in to the app.
> >
> >>
> >>>
> >>>>> I've snatched some sample DTrace script from the net : [
> >>>>> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
> >>>>> ]
> >>>>>
> >>>>> And modified it for our new NFS server :
> >>>>>
> >>>>> #!/usr/sbin/dtrace -qs
> >>>>>
> >>>>> fbt:kernel:nfsrvd_*:entry
> >>>>> {
> >>>>> self->ts = timestamp;
> >>>>> @counts[probefunc] = count();
> >>>>> }
> >>>>>
> >>>>> fbt:kernel:nfsrvd_*:return
> >>>>> / self->ts > 0 /
> >>>>> {
> >>>>> this->delta = (timestamp-self->ts)/1000000;
> >>>>> }
> >>>>>
> >>>>> fbt:kernel:nfsrvd_*:return
> >>>>> / self->ts > 0 && this->delta > 100 /
> >>>>> {
> >>>>> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50);
> >>>>> }
> >>>>>
> >>>>> fbt:kernel:nfsrvd_*:return
> >>>>> / self->ts > 0 /
> >>>>> {
> >>>>> @dist[probefunc, "ms"] = quantize(this->delta);
> >>>>> self->ts = 0;
> >>>>> }
> >>>>>
> >>>>> END
> >>>>> {
> >>>>> printf("\n");
> >>>>> printa("function %-20s %@10d\n", @counts);
> >>>>> printf("\n");
> >>>>> printa("function %s(), time in %s:%@d\n", @dist);
> >>>>> printf("\n");
> >>>>> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
> >>>>> }
> >>>>>
> >>>>> And here's a sample output from one or two minutes during the
> >>>>> run
> >>>>> of
> >>>>> Oracle's ORION benchmark
> >>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G
> >>>>> ethernet:
> >>>>>
> >>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d
> >>>>> ^C
> >>>>>
> >>>>> function nfsrvd_access 4
> >>>>> function nfsrvd_statfs 10
> >>>>> function nfsrvd_getattr 14
> >>>>> function nfsrvd_commit 76
> >>>>> function nfsrvd_sentcache 110048
> >>>>> function nfsrvd_write 110048
> >>>>> function nfsrvd_read 283648
> >>>>> function nfsrvd_dorpc 393800
> >>>>> function nfsrvd_getcache 393800
> >>>>> function nfsrvd_rephead 393800
> >>>>> function nfsrvd_updatecache 393800
> >>>>>
> >>>>> function nfsrvd_access(), time in ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> -1 | 0
> >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
> >>>>> 1 | 0
> >>>>>
> >>>>> function nfsrvd_statfs(), time in ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> -1 | 0
> >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
> >>>>> 1 | 0
> >>>>>
> >>>>> function nfsrvd_getattr(), time in ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> -1 | 0
> >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
> >>>>> 1 | 0
> >>>>>
> >>>>> function nfsrvd_sentcache(), time in ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> -1 | 0
> >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
> >>>>> 1 | 0
> >>>>>
> >>>>> function nfsrvd_rephead(), time in ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> -1 | 0
> >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
> >>>>> 1 | 0
> >>>>>
> >>>>> function nfsrvd_updatecache(), time in ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> -1 | 0
> >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
> >>>>> 1 | 0
> >>>>>
> >>>>> function nfsrvd_getcache(), time in ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> -1 | 0
> >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
> >>>>> 1 | 1
> >>>>> 2 | 0
> >>>>> 4 | 1
> >>>>> 8 | 0
> >>>>>
> >>>>> function nfsrvd_write(), time in ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> -1 | 0
> >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
> >>>>> 1 | 5
> >>>>> 2 | 4
> >>>>> 4 | 0
> >>>>>
> >>>>> function nfsrvd_read(), time in ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> -1 | 0
> >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
> >>>>> 1 | 19
> >>>>> 2 | 3
> >>>>> 4 | 2
> >>>>> 8 | 0
> >>>>> 16 | 1
> >>>>> 32 | 0
> >>>>> 64 | 0
> >>>>> 128 | 0
> >>>>> 256 | 1
> >>>>> 512 | 0
> >>>>>
> >>>>> function nfsrvd_commit(), time in ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> -1 | 0
> >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
> >>>>> 1 |@@@@@@@ 14
> >>>>> 2 | 0
> >>>>> 4 |@ 1
> >>>>> 8 |@ 1
> >>>>> 16 | 0
> >>>>> 32 |@@@@@@@ 14
> >>>>> 64 |@ 2
> >>>>> 128 | 0
> >>>>>
> >>>>>
> >>>>> function nfsrvd_commit(), time in ms for >= 100 ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> < 100 | 0
> >>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
> >>>>> 150 | 0
> >>>>>
> >>>>> function nfsrvd_read(), time in ms for >= 100 ms:
> >>>>> value ------------- Distribution ------------- count
> >>>>> 250 | 0
> >>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
> >>>>> 350 | 0
> >>>>>
> >>>>>
> >>>>> Looks like the nfs server cache functions are quite fast, but
> >>>>> extremely frequently called.
> >>>>>
> >>>> Yep, they are called for every RPC.
> >>>>
> >>>> I may try coding up a patch that replaces the single mutex with
> >>>> one for each hash bucket, for TCP.
> >>>>
> >>>> I'll post if/when I get this patch to a testing/review stage,
> >>>> rick
> >>>>
> >>>
> >>> Cool.
> >>>
> >>> I've readjusted the precision of the dtrace script a bit, and I
> >>> can
> >>> see
> >>> now the following three functions as taking most of the time :
> >>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()
> >>>
> >>> This was recorded during a oracle benchmark run called SLOB, which
> >>> caused 99% cpu load on the NFS server.
> >>>
> >> Even with the drc2.patch and a large value for
> >> vfs.nfsd.tcphighwater?
> >> (Assuming the mounts are TCP ones.)
> >>
> >> Have fun with it, rick
> >>
> >
> > I had upped it, but probably not enough. I'm now running with
> > vfs.nfsd.tcphighwater set
> > to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500.
> > So far it looks like good improvement as those functions no longer
> > show up in the dtrace script output.
> > I'll run some more benchmarks and testing today.
> >
> > Thanks!
> >
> >>>
> >>>>> I hope someone can find this information useful.
> >>>>>
> >>>>> _______________________________________________
> >>>>> freebsd-hackers@freebsd.org mailing list
> >>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> >>>>> To unsubscribe, send any mail to
> >>>>> "freebsd-hackers-unsubscribe@freebsd.org"
> >
> 
> I haven't had the opportunity today to run more DB tests over NFS as
> the DBA was busy with something else,
> however I tested a bit the large file transfers.
> And I'm seeing something strange probably not only NFS but also ZFS
> and ARC related.
> 
> When I first tested the drc2.patch I reported a huge bandwidth
> improvement,
> but now I think that this was probably because of the machine freshly
> rebooted instead of just the patch.
> The patch surely improved things, especially CPU utilization combined
> with the increased cache.
> But today I'm again having a file completely cached in ZFS's ARC
> cache, which when transferred over NFS
> reaches about 300MB/s, when at some tests it reached 900+MB/s (as
> reported in my first email).
> The file locally can be read at about 3GB/s as reported by dd.
> 
> Local:
> [17:51]root@goliath:/tank/spa_db/undo# dd if=data.dbf of=/dev/null
> bs=1M
> 30720+1 records in
> 30720+1 records out
> 32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec)
> 
> Over NFS:
> [17:48]root@spa:/mnt/spa_db/undo# dd if=data.dbf of=/dev/null bs=1M
> 30720+1 records in
> 30720+1 records out
> 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s
> 
> The machines are almost idle during this transfer and I can't see a
> reason why it can't reach the full bandwith when it's
> just reading it from RAM.
> 
Well, one thing is that you must fill the network pipe with bits. That
means that "rsize * readahead" must be >= "data rate * transit delay"
(both in same units, such as bytes/sec).
For example (given a 10Gb/s network):
65536 * 16 = 1Mbyte
- a 10Gbps rate is about 1200Mbytes/sec
--> this should fill the network "pipe" so long as the RPC transit
  delay is less than 1/1200th of a sec (or somewhat less than 1msec)

If you were using rsize=8192 and the default readahead=2:
8192 * 2 = 16Kbytes
- a 10Gbps rate is about 1200000Kbytes/sec
--> the network pipe would be full if the RPC transit delay is
  less than 16/1200000 (or a little over 10usec, unlikely for any
  real network, I think?)

I don't know what your RPC transit delay is (half of a NULL RPC RTT, but
I don't know that, either;-).
That was why I suggested "readahead=16". I'd also use "rsize=65536", but if
that worked very poorly, I'd try decreasing rsize by half (32768, 16384..)
to see what was optimal.

rick
ps: I'm assuming your dd runs are from a FreeBSD client and not the Oracle one.
pss: Did I actually get the above calculations correct without a calculator;-)

> I've tried again tracing with DTrace to see what's happening with this
> script :
> 
> fbt:kernel:nfs*:entry
> {
> this->ts = timestamp;
> @counts[probefunc] = count();
> }
> 
> fbt:kernel:nfs*:return
> / this->ts > 0 /
> {
> @time[probefunc] = avg(timestamp - this->ts);
> }
> 
> END
> {
> trunc(@counts, 10);
> trunc(@time, 10);
> printf("Top 10 called functions\n\n");
> printa(@counts);
> printf("\n\nTop 10 slowest functions\n\n");
> printa(@time);
> }
> 
> And here's the result (several seconds during the dd test):
> 
> Top 10 called functions
> nfsrc_freecache 88849
> nfsrc_wanted 88849
> nfsrv_fillattr 88849
> nfsrv_postopattr 88849
> nfsrvd_read 88849
> nfsrvd_rephead 88849
> nfsrvd_updatecache 88849
> nfsvno_testexp 88849
> nfsrc_trimcache 177697
> nfsvno_getattr 177698
> 
> Top 10 slowest functions
> nfsd_excred 5673
> nfsrc_freecache 5674
> nfsrv_postopattr 5970
> nfsrv_servertimer 6327
> nfssvc_nfscommon 6596
> nfsd_fhtovp 8000
> nfsrvd_read 8380
> nfssvc_program 92752
> nfsvno_read 124979
> nfsvno_fhtovp 1789523
> 
> I might try now to trace what nfsvno_fhtovp() is doing and where is
> spending it's time.
> 
> Any other ideas are welcome :)
> 
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to
> "freebsd-hackers-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1517976814.2112914.1349989468096.JavaMail.root>