Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 10 Oct 2012 18:09:07 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Nikolay Denev <ndenev@gmail.com>
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>, freebsd-hackers@freebsd.org
Subject:   Re: NFS server bottlenecks
Message-ID:  <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <B2CD757D-25D8-4353-8487-B3583EEC57FC@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Nikolay Denev wrote:
> On Oct 10, 2012, at 3:18 AM, Rick Macklem <rmacklem@uoguelph.ca>
> wrote:
> 
> > Nikolay Denev wrote:
> >> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca>
> >> wrote:
> >>
> >>> Garrett Wollman wrote:
> >>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
> >>>> <rmacklem@uoguelph.ca> said:
> >>>>
> >>>>>> Simple: just use a sepatate mutex for each list that a cache
> >>>>>> entry
> >>>>>> is on, rather than a global lock for everything. This would
> >>>>>> reduce
> >>>>>> the mutex contention, but I'm not sure how significantly since
> >>>>>> I
> >>>>>> don't have the means to measure it yet.
> >>>>>>
> >>>>> Well, since the cache trimming is removing entries from the
> >>>>> lists,
> >>>>> I
> >>>>> don't
> >>>>> see how that can be done with a global lock for list updates?
> >>>>
> >>>> Well, the global lock is what we have now, but the cache trimming
> >>>> process only looks at one list at a time, so not locking the list
> >>>> that
> >>>> isn't being iterated over probably wouldn't hurt, unless there's
> >>>> some
> >>>> mechanism (that I didn't see) for entries to move from one list
> >>>> to
> >>>> another. Note that I'm considering each hash bucket a separate
> >>>> "list". (One issue to worry about in that case would be
> >>>> cache-line
> >>>> contention in the array of hash buckets; perhaps
> >>>> NFSRVCACHE_HASHSIZE
> >>>> ought to be increased to reduce that.)
> >>>>
> >>> Yea, a separate mutex for each hash list might help. There is also
> >>> the
> >>> LRU list that all entries end up on, that gets used by the
> >>> trimming
> >>> code.
> >>> (I think? I wrote this stuff about 8 years ago, so I haven't
> >>> looked
> >>> at
> >>> it in a while.)
> >>>
> >>> Also, increasing the hash table size is probably a good idea,
> >>> especially
> >>> if you reduce how aggressively the cache is trimmed.
> >>>
> >>>>> Only doing it once/sec would result in a very large cache when
> >>>>> bursts of
> >>>>> traffic arrives.
> >>>>
> >>>> My servers have 96 GB of memory so that's not a big deal for me.
> >>>>
> >>> This code was originally "production tested" on a server with
> >>> 1Gbyte,
> >>> so times have changed a bit;-)
> >>>
> >>>>> I'm not sure I see why doing it as a separate thread will
> >>>>> improve
> >>>>> things.
> >>>>> There are N nfsd threads already (N can be bumped up to 256 if
> >>>>> you
> >>>>> wish)
> >>>>> and having a bunch more "cache trimming threads" would just
> >>>>> increase
> >>>>> contention, wouldn't it?
> >>>>
> >>>> Only one cache-trimming thread. The cache trim holds the (global)
> >>>> mutex for much longer than any individual nfsd service thread has
> >>>> any
> >>>> need to, and having N threads doing that in parallel is why it's
> >>>> so
> >>>> heavily contended. If there's only one thread doing the trim,
> >>>> then
> >>>> the nfsd service threads aren't spending time either contending
> >>>> on
> >>>> the
> >>>> mutex (it will be held less frequently and for shorter periods).
> >>>>
> >>> I think the little drc2.patch which will keep the nfsd threads
> >>> from
> >>> acquiring the mutex and doing the trimming most of the time, might
> >>> be
> >>> sufficient. I still don't see why a separate trimming thread will
> >>> be
> >>> an advantage. I'd also be worried that the one cache trimming
> >>> thread
> >>> won't get the job done soon enough.
> >>>
> >>> When I did production testing on a 1Gbyte server that saw a peak
> >>> load of about 100RPCs/sec, it was necessary to trim aggressively.
> >>> (Although I'd be tempted to say that a server with 1Gbyte is no
> >>> longer relevant, I recently recall someone trying to run FreeBSD
> >>> on a i486, although I doubt they wanted to run the nfsd on it.)
> >>>
> >>>>> The only negative effect I can think of w.r.t. having the nfsd
> >>>>> threads doing it would be a (I believe negligible) increase in
> >>>>> RPC
> >>>>> response times (the time the nfsd thread spends trimming the
> >>>>> cache).
> >>>>> As noted, I think this time would be negligible compared to disk
> >>>>> I/O
> >>>>> and network transit times in the total RPC response time?
> >>>>
> >>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and
> >>>> 10G
> >>>> network connectivity, spinning on a contended mutex takes a
> >>>> significant amount of CPU time. (For the current design of the
> >>>> NFS
> >>>> server, it may actually be a win to turn off adaptive mutexes --
> >>>> I
> >>>> should give that a try once I'm able to do more testing.)
> >>>>
> >>> Have fun with it. Let me know when you have what you think is a
> >>> good
> >>> patch.
> >>>
> >>> rick
> >>>
> >>>> -GAWollman
> >>>> _______________________________________________
> >>>> freebsd-hackers@freebsd.org mailing list
> >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> >>>> To unsubscribe, send any mail to
> >>>> "freebsd-hackers-unsubscribe@freebsd.org"
> >>> _______________________________________________
> >>> freebsd-fs@freebsd.org mailing list
> >>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> To unsubscribe, send any mail to
> >>> "freebsd-fs-unsubscribe@freebsd.org"
> >>
> >> My quest for IOPS over NFS continues :)
> >> So far I'm not able to achieve more than about 3000 8K read
> >> requests
> >> over NFS,
> >> while the server locally gives much more.
> >> And this is all from a file that is completely in ARC cache, no
> >> disk
> >> IO involved.
> >>
> > Just out of curiousity, why do you use 8K reads instead of 64K
> > reads.
> > Since the RPC overhead (including the DRC functions) is per RPC,
> > doing
> > fewer larger RPCs should usually work better. (Sometimes large
> > rsize/wsize
> > values generate too large a burst of traffic for a network interface
> > to
> > handle and then the rsize/wsize has to be decreased to avoid this
> > issue.)
> >
> > And, although this experiment seems useful for testing patches that
> > try
> > and reduce DRC CPU overheads, most "real" NFS servers will be doing
> > disk
> > I/O.
> >
> 
> This is the default blocksize the Oracle and probably most databases
> use.
> It uses also larger blocks, but for small random reads in OLTP
> applications this is what is used.
> 
If the client is doing 8K reads, you could increase the read ahead
"readahead=N" (N up to 16), to try and increase the bandwidth.
(But if the CPU is 99% busy, then I don't think it will matter.)

> 
> >> I've snatched some sample DTrace script from the net : [
> >> http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes
> >> ]
> >>
> >> And modified it for our new NFS server :
> >>
> >> #!/usr/sbin/dtrace -qs
> >>
> >> fbt:kernel:nfsrvd_*:entry
> >> {
> >> self->ts = timestamp;
> >> @counts[probefunc] = count();
> >> }
> >>
> >> fbt:kernel:nfsrvd_*:return
> >> / self->ts > 0 /
> >> {
> >> this->delta = (timestamp-self->ts)/1000000;
> >> }
> >>
> >> fbt:kernel:nfsrvd_*:return
> >> / self->ts > 0 && this->delta > 100 /
> >> {
> >> @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50);
> >> }
> >>
> >> fbt:kernel:nfsrvd_*:return
> >> / self->ts > 0 /
> >> {
> >> @dist[probefunc, "ms"] = quantize(this->delta);
> >> self->ts = 0;
> >> }
> >>
> >> END
> >> {
> >> printf("\n");
> >> printa("function %-20s %@10d\n", @counts);
> >> printf("\n");
> >> printa("function %s(), time in %s:%@d\n", @dist);
> >> printf("\n");
> >> printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow);
> >> }
> >>
> >> And here's a sample output from one or two minutes during the run
> >> of
> >> Oracle's ORION benchmark
> >> tool from a Linux machine, on a 32G file on NFS mount over 10G
> >> ethernet:
> >>
> >> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d
> >> ^C
> >>
> >> function nfsrvd_access 4
> >> function nfsrvd_statfs 10
> >> function nfsrvd_getattr 14
> >> function nfsrvd_commit 76
> >> function nfsrvd_sentcache 110048
> >> function nfsrvd_write 110048
> >> function nfsrvd_read 283648
> >> function nfsrvd_dorpc 393800
> >> function nfsrvd_getcache 393800
> >> function nfsrvd_rephead 393800
> >> function nfsrvd_updatecache 393800
> >>
> >> function nfsrvd_access(), time in ms:
> >> value ------------- Distribution ------------- count
> >> -1 | 0
> >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4
> >> 1 | 0
> >>
> >> function nfsrvd_statfs(), time in ms:
> >> value ------------- Distribution ------------- count
> >> -1 | 0
> >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10
> >> 1 | 0
> >>
> >> function nfsrvd_getattr(), time in ms:
> >> value ------------- Distribution ------------- count
> >> -1 | 0
> >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14
> >> 1 | 0
> >>
> >> function nfsrvd_sentcache(), time in ms:
> >> value ------------- Distribution ------------- count
> >> -1 | 0
> >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048
> >> 1 | 0
> >>
> >> function nfsrvd_rephead(), time in ms:
> >> value ------------- Distribution ------------- count
> >> -1 | 0
> >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
> >> 1 | 0
> >>
> >> function nfsrvd_updatecache(), time in ms:
> >> value ------------- Distribution ------------- count
> >> -1 | 0
> >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800
> >> 1 | 0
> >>
> >> function nfsrvd_getcache(), time in ms:
> >> value ------------- Distribution ------------- count
> >> -1 | 0
> >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798
> >> 1 | 1
> >> 2 | 0
> >> 4 | 1
> >> 8 | 0
> >>
> >> function nfsrvd_write(), time in ms:
> >> value ------------- Distribution ------------- count
> >> -1 | 0
> >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039
> >> 1 | 5
> >> 2 | 4
> >> 4 | 0
> >>
> >> function nfsrvd_read(), time in ms:
> >> value ------------- Distribution ------------- count
> >> -1 | 0
> >> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622
> >> 1 | 19
> >> 2 | 3
> >> 4 | 2
> >> 8 | 0
> >> 16 | 1
> >> 32 | 0
> >> 64 | 0
> >> 128 | 0
> >> 256 | 1
> >> 512 | 0
> >>
> >> function nfsrvd_commit(), time in ms:
> >> value ------------- Distribution ------------- count
> >> -1 | 0
> >> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44
> >> 1 |@@@@@@@ 14
> >> 2 | 0
> >> 4 |@ 1
> >> 8 |@ 1
> >> 16 | 0
> >> 32 |@@@@@@@ 14
> >> 64 |@ 2
> >> 128 | 0
> >>
> >>
> >> function nfsrvd_commit(), time in ms for >= 100 ms:
> >> value ------------- Distribution ------------- count
> >> < 100 | 0
> >> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
> >> 150 | 0
> >>
> >> function nfsrvd_read(), time in ms for >= 100 ms:
> >> value ------------- Distribution ------------- count
> >> 250 | 0
> >> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
> >> 350 | 0
> >>
> >>
> >> Looks like the nfs server cache functions are quite fast, but
> >> extremely frequently called.
> >>
> > Yep, they are called for every RPC.
> >
> > I may try coding up a patch that replaces the single mutex with
> > one for each hash bucket, for TCP.
> >
> > I'll post if/when I get this patch to a testing/review stage, rick
> >
> 
> Cool.
> 
> I've readjusted the precision of the dtrace script a bit, and I can
> see
> now the following three functions as taking most of the time :
> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache()
> 
> This was recorded during a oracle benchmark run called SLOB, which
> caused 99% cpu load on the NFS server.
> 
Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater?
(Assuming the mounts are TCP ones.)

Have fun with it, rick

> 
> >> I hope someone can find this information useful.
> >>
> >> _______________________________________________
> >> freebsd-hackers@freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> >> To unsubscribe, send any mail to
> >> "freebsd-hackers-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1071150615.2039567.1349906947942.JavaMail.root>