Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Oct 2012 17:33:08 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Nikolay Denev <ndenev@gmail.com>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>, freebsd-hackers@freebsd.org
Subject:   Re: NFS server bottlenecks
Message-ID:  <314705086.2114438.1349991188290.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <E211199E-026C-41C8-AFA4-0DB5A787391D@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Nikolay Denev wrote:
> On Oct 11, 2012, at 7:20 PM, Nikolay Denev <ndenev@gmail.com> wrote:
> 
> > On Oct 11, 2012, at 8:46 AM, Nikolay Denev <ndenev@gmail.com> wrote:
> >
> >>
> >> On Oct 11, 2012, at 1:09 AM, Rick Macklem <rmacklem@uoguelph.ca>
> >> wrote:
> >>
> >>> Nikolay Denev wrote:
> >>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca>
> >>>> wrote:
> >>>>
> >>>>> Nikolay Denev wrote:
> >>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem
> >>>>>> <rmacklem@uoguelph.ca>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Garrett Wollman wrote:
> >>>>>>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
> >>>>>>>> <rmacklem@uoguelph.ca> said:
> >>>>>>>>
> >>>>>>>>>> Simple: just use a sepatate mutex for each list that a
> >>>>>>>>>> cache
> >>>>>>>>>> entry
> >>>>>>>>>> is on, rather than a global lock for everything. This would
> >>>>>>>>>> reduce
> >>>>>>>>>> the mutex contention, but I'm not sure how significantly
> >>>>>>>>>> since
> >>>>>>>>>> I
> >>>>>>>>>> don't have the means to measure it yet.
> >>>>>>>>>>
> >>>>>>>>> Well, since the cache trimming is removing entries from the
> >>>>>>>>> lists,
> >>>>>>>>> I
> >>>>>>>>> don't
> >>>>>>>>> see how that can be done with a global lock for list
> >>>>>>>>> updates?
> >>>>>>>>
> >>>>>>>> Well, the global lock is what we have now, but the cache
> >>>>>>>> trimming
> >>>>>>>> process only looks at one list at a time, so not locking the
> >>>>>>>> list
> >>>>>>>> that
> >>>>>>>> isn't being iterated over probably wouldn't hurt, unless
> >>>>>>>> there's
> >>>>>>>> some
> >>>>>>>> mechanism (that I didn't see) for entries to move from one
> >>>>>>>> list
> >>>>>>>> to
> >>>>>>>> another. Note that I'm considering each hash bucket a
> >>>>>>>> separate
> >>>>>>>> "list". (One issue to worry about in that case would be
> >>>>>>>> cache-line
> >>>>>>>> contention in the array of hash buckets; perhaps
> >>>>>>>> NFSRVCACHE_HASHSIZE
> >>>>>>>> ought to be increased to reduce that.)
> >>>>>>>>
> >>>>>>> Yea, a separate mutex for each hash list might help. There is
> >>>>>>> also
> >>>>>>> the
> >>>>>>> LRU list that all entries end up on, that gets used by the
> >>>>>>> trimming
> >>>>>>> code.
> >>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't
> >>>>>>> looked
> >>>>>>> at
> >>>>>>> it in a while.)
> >>>>>>>
> >>>>>>> Also, increasing the hash table size is probably a good idea,
> >>>>>>> especially
> >>>>>>> if you reduce how aggressively the cache is trimmed.
> >>>>>>>
> >>>>>>>>> Only doing it once/sec would result in a very large cache
> >>>>>>>>> when
> >>>>>>>>> bursts of
> >>>>>>>>> traffic arrives.
> >>>>>>>>
> >>>>>>>> My servers have 96 GB of memory so that's not a big deal for
> >>>>>>>> me.
> >>>>>>>>
> >>>>>>> This code was originally "production tested" on a server with
> >>>>>>> 1Gbyte,
> >>>>>>> so times have changed a bit;-)
> >>>>>>>
> >>>>>>>>> I'm not sure I see why doing it as a separate thread will
> >>>>>>>>> improve
> >>>>>>>>> things.
> >>>>>>>>> There are N nfsd threads already (N can be bumped up to 256
> >>>>>>>>> if
> >>>>>>>>> you
> >>>>>>>>> wish)
> >>>>>>>>> and having a bunch more "cache trimming threads" would just
> >>>>>>>>> increase
> >>>>>>>>> contention, wouldn't it?
> >>>>>>>>
> >>>>>>>> Only one cache-trimming thread. The cache trim holds the
> >>>>>>>> (global)
> >>>>>>>> mutex for much longer than any individual nfsd service thread
> >>>>>>>> has
> >>>>>>>> any
> >>>>>>>> need to, and having N threads doing that in parallel is why
> >>>>>>>> it's
> >>>>>>>> so
> >>>>>>>> heavily contended. If there's only one thread doing the trim,
> >>>>>>>> then
> >>>>>>>> the nfsd service threads aren't spending time either
> >>>>>>>> contending
> >>>>>>>> on
> >>>>>>>> the
> >>>>>>>> mutex (it will be held less frequently and for shorter
> >>>>>>>> periods).
> >>>>>>>>
> >>>>>>> I think the little drc2.patch which will keep the nfsd threads
> >>>>>>> from
> >>>>>>> acquiring the mutex and doing the trimming most of the time,
> >>>>>>> might
> >>>>>>> be
> >>>>>>> sufficient. I still don't see why a separate trimming thread
> >>>>>>> will
> >>>>>>> be
> >>>>>>> an advantage. I'd also be worried that the one cache trimming
> >>>>>>> thread
> >>>>>>> won't get the job done soon enough.
> >>>>>>>
> >>>>>>> When I did production testing on a 1Gbyte server that saw a
> >>>>>>> peak
> >>>>>>> load of about 100RPCs/sec, it was necessary to trim
> >>>>>>> aggressively.
> >>>>>>> (Although I'd be tempted to say that a server with 1Gbyte is
> >>>>>>> no
> >>>>>>> longer relevant, I recently recall someone trying to run
> >>>>>>> FreeBSD
> >>>>>>> on a i486, although I doubt they wanted to run the nfsd on
> >>>>>>> it.)
> >>>>>>>
> >>>>>>>>> The only negative effect I can think of w.r.t. having the
> >>>>>>>>> nfsd
> >>>>>>>>> threads doing it would be a (I believe negligible) increase
> >>>>>>>>> in
> >>>>>>>>> RPC
> >>>>>>>>> response times (the time the nfsd thread spends trimming the
> >>>>>>>>> cache).
> >>>>>>>>> As noted, I think this time would be negligible compared to
> >>>>>>>>> disk
> >>>>>>>>> I/O
> >>>>>>>>> and network transit times in the total RPC response time?
> >>>>>>>>
> >>>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache,
> >>>>>>>> and
> >>>>>>>> 10G
> >>>>>>>> network connectivity, spinning on a contended mutex takes a
> >>>>>>>> significant amount of CPU time. (For the current design of
> >>>>>>>> the
> >>>>>>>> NFS
> >>>>>>>> server, it may actually be a win to turn off adaptive mutexes
> >>>>>>>> --
> >>>>>>>> I
> >>>>>>>> should give that a try once I'm able to do more testing.)
> >>>>>>>>
> >>>>>>> Have fun with it. Let me know when you have what you think is
> >>>>>>> a
> >>>>>>> good
> >>>>>>> patch.
> >>>>>>>
> >>>>>>> rick
> >>>>>>>
> >>>>>>>> -GAWollman
> >>>>>>>> _______________________________________________
> >>>>>>>> freebsd-hackers@freebsd.org mailing list
> >>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> >>>>>>>> To unsubscribe, send any mail to
> >>>>>>>> "freebsd-hackers-unsubscribe@freebsd.org"
> >>>>>>> _______________________________________________
> >>>>>>> freebsd-fs@freebsd.org mailing list
> >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>>>>>> To unsubscribe, send any mail to
> >>>>>>> "freebsd-fs-unsubscribe@freebsd.org"
> >>>>>>
> >>>>>> My quest for IOPS over NFS continues :)
> >>>>>> So far I'm not able to achieve more than about 3000 8K read
> >>>>>> requests
> >>>>>> over NFS,
> >>>>>> while the server locally gives much more.
> >>>>>> And this is all from a file that is completely in ARC cache, no
> >>>>>> disk
> >>>>>> IO involved.
> >>>>>>
> >>>>> Just out of curiousity, why do you use 8K reads instead of 64K
> >>>>> reads.
> >>>>> Since the RPC overhead (including the DRC functions) is per RPC,
> >>>>> doing
> >>>>> fewer larger RPCs should usually work better. (Sometimes large
> >>>>> rsize/wsize
> >>>>> values generate too large a burst of traffic for a network
> >>>>> interface
> >>>>> to
> >>>>> handle and then the rsize/wsize has to be decreased to avoid
> >>>>> this
> >>>>> issue.)
> >>>>>
> >>>>> And, although this experiment seems useful for testing patches
> >>>>> that
> >>>>> try
> >>>>> and reduce DRC CPU overheads, most "real" NFS servers will be
> >>>>> doing
> >>>>> disk
> >>>>> I/O.
> >>>>>
> >>>>
> >>>> This is the default blocksize the Oracle and probably most
> >>>> databases
> >>>> use.
> >>>> It uses also larger blocks, but for small random reads in OLTP
> >>>> applications this is what is used.
> >>>>
> >>> If the client is doing 8K reads, you could increase the read ahead
> >>> "readahead=N" (N up to 16), to try and increase the bandwidth.
> >>> (But if the CPU is 99% busy, then I don't think it will matter.)
> >>
> >> I'll try to check if this is possible to be set, as we are testing
> >> not only with the Linux NFS client,
> >> but also with the Oracle's built in so called DirectNFS client that
> >> is built in to the app.
> >>
> >>>
> >>>>
> >>>>>> I've snatched some sample DTrace script from the net : [
> >>>>>> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
> >>>>>> ]
> >>>>>>
> >>>>>> And modified it for our new NFS server :
> >>>>>>
> >>>>>> #!/usr/sbin/dtrace -qs
> >>>>>>
> >>>>>> fbt:kernel:nfsrvd_*:entry
> >>>>>> {
> >>>>>> self->ts = timestamp;
> >>>>>> @counts[probefunc] = count();
> >>>>>> }
> >>>>>>
> >>>>>> fbt:kernel:nfsrvd_*:return
> >>>>>> / self->ts > 0 /
> >>>>>> {
> >>>>>> this->delta = (timestamp-self->ts)/1000000;
> >>>>>> }
> >>>>>>
> >>>>>> fbt:kernel:nfsrvd_*:return
> >>>>>> / self->ts > 0 && this->delta > 100 /
> >>>>>> {
> >>>>>> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50);
> >>>>>> }
> >>>>>>
> >>>>>> fbt:kernel:nfsrvd_*:return
> >>>>>> / self->ts > 0 /
> >>>>>> {
> >>>>>> @dist[probefunc, "ms"] = quantize(this->delta);
> >>>>>> self->ts = 0;
> >>>>>> }
> >>>>>>
> >>>>>> END
> >>>>>> {
> >>>>>> printf("\n");
> >>>>>> printa("function %-20s %@10d\n", @counts);
> >>>>>> printf("\n");
> >>>>>> printa("function %s(), time in %s:%@d\n", @dist);
> >>>>>> printf("\n");
> >>>>>> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
> >>>>>> }
> >>>>>>
> >>>>>> And here's a sample output from one or two minutes during the
> >>>>>> run
> >>>>>> of
> >>>>>> Oracle's ORION benchmark
> >>>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G
> >>>>>> ethernet:
> >>>>>>
> >>>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d
> >>>>>> ^C
> >>>>>>
> >>>>>> function nfsrvd_access 4
> >>>>>> function nfsrvd_statfs 10
> >>>>>> function nfsrvd_getattr 14
> >>>>>> function nfsrvd_commit 76
> >>>>>> function nfsrvd_sentcache 110048
> >>>>>> function nfsrvd_write 110048
> >>>>>> function nfsrvd_read 283648
> >>>>>> function nfsrvd_dorpc 393800
> >>>>>> function nfsrvd_getcache 393800
> >>>>>> function nfsrvd_rephead 393800
> >>>>>> function nfsrvd_updatecache 393800
> >>>>>>
> >>>>>> function nfsrvd_access(), time in ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> -1 | 0
> >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
> >>>>>> 1 | 0
> >>>>>>
> >>>>>> function nfsrvd_statfs(), time in ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> -1 | 0
> >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
> >>>>>> 1 | 0
> >>>>>>
> >>>>>> function nfsrvd_getattr(), time in ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> -1 | 0
> >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
> >>>>>> 1 | 0
> >>>>>>
> >>>>>> function nfsrvd_sentcache(), time in ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> -1 | 0
> >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
> >>>>>> 1 | 0
> >>>>>>
> >>>>>> function nfsrvd_rephead(), time in ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> -1 | 0
> >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
> >>>>>> 1 | 0
> >>>>>>
> >>>>>> function nfsrvd_updatecache(), time in ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> -1 | 0
> >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
> >>>>>> 1 | 0
> >>>>>>
> >>>>>> function nfsrvd_getcache(), time in ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> -1 | 0
> >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
> >>>>>> 1 | 1
> >>>>>> 2 | 0
> >>>>>> 4 | 1
> >>>>>> 8 | 0
> >>>>>>
> >>>>>> function nfsrvd_write(), time in ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> -1 | 0
> >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
> >>>>>> 1 | 5
> >>>>>> 2 | 4
> >>>>>> 4 | 0
> >>>>>>
> >>>>>> function nfsrvd_read(), time in ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> -1 | 0
> >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
> >>>>>> 1 | 19
> >>>>>> 2 | 3
> >>>>>> 4 | 2
> >>>>>> 8 | 0
> >>>>>> 16 | 1
> >>>>>> 32 | 0
> >>>>>> 64 | 0
> >>>>>> 128 | 0
> >>>>>> 256 | 1
> >>>>>> 512 | 0
> >>>>>>
> >>>>>> function nfsrvd_commit(), time in ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> -1 | 0
> >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
> >>>>>> 1 |@@@@@@@ 14
> >>>>>> 2 | 0
> >>>>>> 4 |@ 1
> >>>>>> 8 |@ 1
> >>>>>> 16 | 0
> >>>>>> 32 |@@@@@@@ 14
> >>>>>> 64 |@ 2
> >>>>>> 128 | 0
> >>>>>>
> >>>>>>
> >>>>>> function nfsrvd_commit(), time in ms for >= 100 ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> < 100 | 0
> >>>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
> >>>>>> 150 | 0
> >>>>>>
> >>>>>> function nfsrvd_read(), time in ms for >= 100 ms:
> >>>>>> value ------------- Distribution ------------- count
> >>>>>> 250 | 0
> >>>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
> >>>>>> 350 | 0
> >>>>>>
> >>>>>>
> >>>>>> Looks like the nfs server cache functions are quite fast, but
> >>>>>> extremely frequently called.
> >>>>>>
> >>>>> Yep, they are called for every RPC.
> >>>>>
> >>>>> I may try coding up a patch that replaces the single mutex with
> >>>>> one for each hash bucket, for TCP.
> >>>>>
> >>>>> I'll post if/when I get this patch to a testing/review stage,
> >>>>> rick
> >>>>>
> >>>>
> >>>> Cool.
> >>>>
> >>>> I've readjusted the precision of the dtrace script a bit, and I
> >>>> can
> >>>> see
> >>>> now the following three functions as taking most of the time :
> >>>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()
> >>>>
> >>>> This was recorded during a oracle benchmark run called SLOB,
> >>>> which
> >>>> caused 99% cpu load on the NFS server.
> >>>>
> >>> Even with the drc2.patch and a large value for
> >>> vfs.nfsd.tcphighwater?
> >>> (Assuming the mounts are TCP ones.)
> >>>
> >>> Have fun with it, rick
> >>>
> >>
> >> I had upped it, but probably not enough. I'm now running with
> >> vfs.nfsd.tcphighwater set
> >> to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500.
> >> So far it looks like good improvement as those functions no longer
> >> show up in the dtrace script output.
> >> I'll run some more benchmarks and testing today.
> >>
> >> Thanks!
> >>
> >>>>
> >>>>>> I hope someone can find this information useful.
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> freebsd-hackers@freebsd.org mailing list
> >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> >>>>>> To unsubscribe, send any mail to
> >>>>>> "freebsd-hackers-unsubscribe@freebsd.org"
> >>
> >
> > I haven't had the opportunity today to run more DB tests over NFS as
> > the DBA was busy with something else,
> > however I tested a bit the large file transfers.
> > And I'm seeing something strange probably not only NFS but also ZFS
> > and ARC related.
> >
> > When I first tested the drc2.patch I reported a huge bandwidth
> > improvement,
> > but now I think that this was probably because of the machine
> > freshly rebooted instead of just the patch.
> > The patch surely improved things, especially CPU utilization
> > combined with the increased cache.
> > But today I'm again having a file completely cached in ZFS's ARC
> > cache, which when transferred over NFS
> > reaches about 300MB/s, when at some tests it reached 900+MB/s (as
> > reported in my first email).
> > The file locally can be read at about 3GB/s as reported by dd.
> >
> > Local:
> > [17:51]root@goliath:/tank/spa_db/undo# dd if=data.dbf of=/dev/null
> > bs=1M
> > 30720+1 records in
> > 30720+1 records out
> > 32212262912 bytes transferred in 10.548485 secs (3053733573
> > bytes/sec)
> >
> > Over NFS:
> > [17:48]root@spa:/mnt/spa_db/undo# dd if=data.dbf of=/dev/null bs=1M
> > 30720+1 records in
> > 30720+1 records out
> > 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s
> >
> > The machines are almost idle during this transfer and I can't see a
> > reason why it can't reach the full bandwith when it's
> > just reading it from RAM.
> >
> > I've tried again tracing with DTrace to see what's happening with
> > this script :
> >
> > fbt:kernel:nfs*:entry
> > {
> > 	this->ts = timestamp;
> > 	@counts[probefunc] = count();
> > }
> >
> > fbt:kernel:nfs*:return
> > / this->ts > 0 /
> > {
> > 	@time[probefunc] = avg(timestamp - this->ts);
> > }
> >
> > END
> > {
> > 	trunc(@counts, 10);
> > 	trunc(@time, 10);
> > 	printf("Top 10 called functions\n\n");
> > 	printa(@counts);
> > 	printf("\n\nTop 10 slowest functions\n\n");
> > 	printa(@time);
> > }
> >
> > And here's the result (several seconds during the dd test):
> >
> > Top 10 called functions
> >  nfsrc_freecache 88849
> >  nfsrc_wanted 88849
> >  nfsrv_fillattr 88849
> >  nfsrv_postopattr 88849
> >  nfsrvd_read 88849
> >  nfsrvd_rephead 88849
> >  nfsrvd_updatecache 88849
> >  nfsvno_testexp 88849
> >  nfsrc_trimcache 177697
> >  nfsvno_getattr 177698
> >
> > Top 10 slowest functions
> >  nfsd_excred 5673
> >  nfsrc_freecache 5674
> >  nfsrv_postopattr 5970
> >  nfsrv_servertimer 6327
> >  nfssvc_nfscommon 6596
> >  nfsd_fhtovp 8000
> >  nfsrvd_read 8380
> >  nfssvc_program 92752
> >  nfsvno_read 124979
> >  nfsvno_fhtovp 1789523
> >
> > I might try now to trace what nfsvno_fhtovp() is doing and where is
> > spending it's time.
> >
> > Any other ideas are welcome :)
> >
> 
> To take the network out of the equation I redid the test by mounting
> the same filesystem over NFS on the server:
> 
> [18:23]root@goliath:~# mount -t nfs -o
> rw,hard,intr,tcp,nfsv3,rsize=1048576,wsize=1048576
Just fyi, the maximum rsize,wsize is MAXBSIZE, which is 65536
for FreeBSD currently. As I noted in the other email, I'd
suggest "rsize=65536,wsize=65536,readahead=16,...".

> localhost:/tank/spa_db/undo /mnt
> [18:24]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M
> 30720+1 records in
> 30720+1 records out
> 32212262912 bytes transferred in 79.793343 secs (403696120 bytes/sec)
> [18:25]root@goliath:~# dd if=/mnt/data.dbf of=/dev/null bs=1M
> 30720+1 records in
> 30720+1 records out
> 32212262912 bytes transferred in 12.033420 secs (2676900110 bytes/sec)
> 
> During the first run I saw several nfsd threads in top, along with dd
> and again zero disk I/O.
> There was increase in memory usage because of the double buffering
> ARC->buffercahe.
> The second run was with all of the nfsd threads totally idle, and read
> directly from the buffercache.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?314705086.2114438.1349991188290.JavaMail.root>