Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 9 Oct 2012 20:18:00 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Nikolay Denev <ndenev@gmail.com>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>, freebsd-hackers@freebsd.org
Subject:   Re: NFS server bottlenecks
Message-ID:  <1492364164.1964483.1349828280211.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <F865A337-7A68-4DC5-8B9E-627C9E6F3518@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Nikolay Denev wrote:
> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca>
> wrote:
> 
> > Garrett Wollman wrote:
> >> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
> >> <rmacklem@uoguelph.ca> said:
> >>
> >>>> Simple: just use a sepatate mutex for each list that a cache
> >>>> entry
> >>>> is on, rather than a global lock for everything. This would
> >>>> reduce
> >>>> the mutex contention, but I'm not sure how significantly since I
> >>>> don't have the means to measure it yet.
> >>>>
> >>> Well, since the cache trimming is removing entries from the lists,
> >>> I
> >>> don't
> >>> see how that can be done with a global lock for list updates?
> >>
> >> Well, the global lock is what we have now, but the cache trimming
> >> process only looks at one list at a time, so not locking the list
> >> that
> >> isn't being iterated over probably wouldn't hurt, unless there's
> >> some
> >> mechanism (that I didn't see) for entries to move from one list to
> >> another. Note that I'm considering each hash bucket a separate
> >> "list". (One issue to worry about in that case would be cache-line
> >> contention in the array of hash buckets; perhaps
> >> NFSRVCACHE_HASHSIZE
> >> ought to be increased to reduce that.)
> >>
> > Yea, a separate mutex for each hash list might help. There is also
> > the
> > LRU list that all entries end up on, that gets used by the trimming
> > code.
> > (I think? I wrote this stuff about 8 years ago, so I haven't looked
> > at
> > it in a while.)
> >
> > Also, increasing the hash table size is probably a good idea,
> > especially
> > if you reduce how aggressively the cache is trimmed.
> >
> >>> Only doing it once/sec would result in a very large cache when
> >>> bursts of
> >>> traffic arrives.
> >>
> >> My servers have 96 GB of memory so that's not a big deal for me.
> >>
> > This code was originally "production tested" on a server with
> > 1Gbyte,
> > so times have changed a bit;-)
> >
> >>> I'm not sure I see why doing it as a separate thread will improve
> >>> things.
> >>> There are N nfsd threads already (N can be bumped up to 256 if you
> >>> wish)
> >>> and having a bunch more "cache trimming threads" would just
> >>> increase
> >>> contention, wouldn't it?
> >>
> >> Only one cache-trimming thread. The cache trim holds the (global)
> >> mutex for much longer than any individual nfsd service thread has
> >> any
> >> need to, and having N threads doing that in parallel is why it's so
> >> heavily contended. If there's only one thread doing the trim, then
> >> the nfsd service threads aren't spending time either contending on
> >> the
> >> mutex (it will be held less frequently and for shorter periods).
> >>
> > I think the little drc2.patch which will keep the nfsd threads from
> > acquiring the mutex and doing the trimming most of the time, might
> > be
> > sufficient. I still don't see why a separate trimming thread will be
> > an advantage. I'd also be worried that the one cache trimming thread
> > won't get the job done soon enough.
> >
> > When I did production testing on a 1Gbyte server that saw a peak
> > load of about 100RPCs/sec, it was necessary to trim aggressively.
> > (Although I'd be tempted to say that a server with 1Gbyte is no
> > longer relevant, I recently recall someone trying to run FreeBSD
> > on a i486, although I doubt they wanted to run the nfsd on it.)
> >
> >>> The only negative effect I can think of w.r.t. having the nfsd
> >>> threads doing it would be a (I believe negligible) increase in RPC
> >>> response times (the time the nfsd thread spends trimming the
> >>> cache).
> >>> As noted, I think this time would be negligible compared to disk
> >>> I/O
> >>> and network transit times in the total RPC response time?
> >>
> >> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
> >> network connectivity, spinning on a contended mutex takes a
> >> significant amount of CPU time. (For the current design of the NFS
> >> server, it may actually be a win to turn off adaptive mutexes -- I
> >> should give that a try once I'm able to do more testing.)
> >>
> > Have fun with it. Let me know when you have what you think is a good
> > patch.
> >
> > rick
> >
> >> -GAWollman
> >> _______________________________________________
> >> freebsd-hackers@freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> >> To unsubscribe, send any mail to
> >> "freebsd-hackers-unsubscribe@freebsd.org"
> > _______________________________________________
> > freebsd-fs@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to
> > "freebsd-fs-unsubscribe@freebsd.org"
> 
> My quest for IOPS over NFS continues :)
> So far I'm not able to achieve more than about 3000 8K read requests
> over NFS,
> while the server locally gives much more.
> And this is all from a file that is completely in ARC cache, no disk
> IO involved.
> 
Just out of curiousity, why do you use 8K reads instead of 64K reads.
Since the RPC overhead (including the DRC functions) is per RPC, doing
fewer larger RPCs should usually work better. (Sometimes large rsize/wsize
values generate too large a burst of traffic for a network interface to
handle and then the rsize/wsize has to be decreased to avoid this issue.)

And, although this experiment seems useful for testing patches that try
and reduce DRC CPU overheads, most "real" NFS servers will be doing disk
I/O.

> I've snatched some sample DTrace script from the net : [
> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
> ]
> 
> And modified it for our new NFS server :
> 
> #!/usr/sbin/dtrace -qs
> 
> fbt:kernel:nfsrvd_*:entry
> {
> self->ts = timestamp;
> @counts[probefunc] = count();
> }
> 
> fbt:kernel:nfsrvd_*:return
> / self->ts > 0 /
> {
> this->delta = (timestamp-self->ts)/1000000;
> }
> 
> fbt:kernel:nfsrvd_*:return
> / self->ts > 0 && this->delta > 100 /
> {
> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50);
> }
> 
> fbt:kernel:nfsrvd_*:return
> / self->ts > 0 /
> {
> @dist[probefunc, "ms"] = quantize(this->delta);
> self->ts = 0;
> }
> 
> END
> {
> printf("\n");
> printa("function %-20s %@10d\n", @counts);
> printf("\n");
> printa("function %s(), time in %s:%@d\n", @dist);
> printf("\n");
> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
> }
> 
> And here's a sample output from one or two minutes during the run of
> Oracle's ORION benchmark
> tool from a Linux machine, on a 32G file on NFS mount over 10G
> ethernet:
> 
> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d
> ^C
> 
> function nfsrvd_access 4
> function nfsrvd_statfs 10
> function nfsrvd_getattr 14
> function nfsrvd_commit 76
> function nfsrvd_sentcache 110048
> function nfsrvd_write 110048
> function nfsrvd_read 283648
> function nfsrvd_dorpc 393800
> function nfsrvd_getcache 393800
> function nfsrvd_rephead 393800
> function nfsrvd_updatecache 393800
> 
> function nfsrvd_access(), time in ms:
> value ------------- Distribution ------------- count
> -1 | 0
> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
> 1 | 0
> 
> function nfsrvd_statfs(), time in ms:
> value ------------- Distribution ------------- count
> -1 | 0
> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
> 1 | 0
> 
> function nfsrvd_getattr(), time in ms:
> value ------------- Distribution ------------- count
> -1 | 0
> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
> 1 | 0
> 
> function nfsrvd_sentcache(), time in ms:
> value ------------- Distribution ------------- count
> -1 | 0
> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
> 1 | 0
> 
> function nfsrvd_rephead(), time in ms:
> value ------------- Distribution ------------- count
> -1 | 0
> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
> 1 | 0
> 
> function nfsrvd_updatecache(), time in ms:
> value ------------- Distribution ------------- count
> -1 | 0
> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
> 1 | 0
> 
> function nfsrvd_getcache(), time in ms:
> value ------------- Distribution ------------- count
> -1 | 0
> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
> 1 | 1
> 2 | 0
> 4 | 1
> 8 | 0
> 
> function nfsrvd_write(), time in ms:
> value ------------- Distribution ------------- count
> -1 | 0
> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
> 1 | 5
> 2 | 4
> 4 | 0
> 
> function nfsrvd_read(), time in ms:
> value ------------- Distribution ------------- count
> -1 | 0
> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
> 1 | 19
> 2 | 3
> 4 | 2
> 8 | 0
> 16 | 1
> 32 | 0
> 64 | 0
> 128 | 0
> 256 | 1
> 512 | 0
> 
> function nfsrvd_commit(), time in ms:
> value ------------- Distribution ------------- count
> -1 | 0
> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
> 1 |@@@@@@@ 14
> 2 | 0
> 4 |@ 1
> 8 |@ 1
> 16 | 0
> 32 |@@@@@@@ 14
> 64 |@ 2
> 128 | 0
> 
> 
> function nfsrvd_commit(), time in ms for >= 100 ms:
> value ------------- Distribution ------------- count
> < 100 | 0
> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
> 150 | 0
> 
> function nfsrvd_read(), time in ms for >= 100 ms:
> value ------------- Distribution ------------- count
> 250 | 0
> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
> 350 | 0
> 
> 
> Looks like the nfs server cache functions are quite fast, but
> extremely frequently called.
> 
Yep, they are called for every RPC.

I may try coding up a patch that replaces the single mutex with
one for each hash bucket, for TCP.

I'll post if/when I get this patch to a testing/review stage, rick

> I hope someone can find this information useful.
> 
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to
> "freebsd-hackers-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1492364164.1964483.1349828280211.JavaMail.root>