Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 6 Oct 2012 18:32:56 -0400 (EDT)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Nikolay Denev <ndenev@gmail.com>
Cc:        freebsd-fs@freebsd.org, rmacklem@freebsd.org, hackers@freebsd.org, Garrett Wollman <wollman@freebsd.org>
Subject:   Re: NFS server bottlenecks
Message-ID:  <895825217.1831774.1349562776418.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <3E7BCFB4-6EE6-48F5-ACA7-A615F3CE5BAC@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Nikolay Deney wrote:
> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca>
> wrote:
>=20
> > Garrett Wollman wrote:
> >> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
> >> <rmacklem@uoguelph.ca> said:
> >>
> >>>> Simple: just use a sepatate mutex for each list that a cache
> >>>> entry
> >>>> is on, rather than a global lock for everything. This would
> >>>> reduce
> >>>> the mutex contention, but I'm not sure how significantly since I
> >>>> don't have the means to measure it yet.
> >>>>
> >>> Well, since the cache trimming is removing entries from the lists,
> >>> I
> >>> don't
> >>> see how that can be done with a global lock for list updates?
> >>
> >> Well, the global lock is what we have now, but the cache trimming
> >> process only looks at one list at a time, so not locking the list
> >> that
> >> isn't being iterated over probably wouldn't hurt, unless there's
> >> some
> >> mechanism (that I didn't see) for entries to move from one list to
> >> another. Note that I'm considering each hash bucket a separate
> >> "list". (One issue to worry about in that case would be cache-line
> >> contention in the array of hash buckets; perhaps
> >> NFSRVCACHE_HASHSIZE
> >> ought to be increased to reduce that.)
> >>
> > Yea, a separate mutex for each hash list might help. There is also
> > the
> > LRU list that all entries end up on, that gets used by the trimming
> > code.
> > (I think? I wrote this stuff about 8 years ago, so I haven't looked
> > at
> > it in a while.)
> >
> > Also, increasing the hash table size is probably a good idea,
> > especially
> > if you reduce how aggressively the cache is trimmed.
> >
> >>> Only doing it once/sec would result in a very large cache when
> >>> bursts of
> >>> traffic arrives.
> >>
> >> My servers have 96 GB of memory so that's not a big deal for me.
> >>
> > This code was originally "production tested" on a server with
> > 1Gbyte,
> > so times have changed a bit;-)
> >
> >>> I'm not sure I see why doing it as a separate thread will improve
> >>> things.
> >>> There are N nfsd threads already (N can be bumped up to 256 if you
> >>> wish)
> >>> and having a bunch more "cache trimming threads" would just
> >>> increase
> >>> contention, wouldn't it?
> >>
> >> Only one cache-trimming thread. The cache trim holds the (global)
> >> mutex for much longer than any individual nfsd service thread has
> >> any
> >> need to, and having N threads doing that in parallel is why it's so
> >> heavily contended. If there's only one thread doing the trim, then
> >> the nfsd service threads aren't spending time either contending on
> >> the
> >> mutex (it will be held less frequently and for shorter periods).
> >>
> > I think the little drc2.patch which will keep the nfsd threads from
> > acquiring the mutex and doing the trimming most of the time, might
> > be
> > sufficient. I still don't see why a separate trimming thread will be
> > an advantage. I'd also be worried that the one cache trimming thread
> > won't get the job done soon enough.
> >
> > When I did production testing on a 1Gbyte server that saw a peak
> > load of about 100RPCs/sec, it was necessary to trim aggressively.
> > (Although I'd be tempted to say that a server with 1Gbyte is no
> > longer relevant, I recently recall someone trying to run FreeBSD
> > on a i486, although I doubt they wanted to run the nfsd on it.)
> >
> >>> The only negative effect I can think of w.r.t. having the nfsd
> >>> threads doing it would be a (I believe negligible) increase in RPC
> >>> response times (the time the nfsd thread spends trimming the
> >>> cache).
> >>> As noted, I think this time would be negligible compared to disk
> >>> I/O
> >>> and network transit times in the total RPC response time?
> >>
> >> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
> >> network connectivity, spinning on a contended mutex takes a
> >> significant amount of CPU time. (For the current design of the NFS
> >> server, it may actually be a win to turn off adaptive mutexes -- I
> >> should give that a try once I'm able to do more testing.)
> >>
> > Have fun with it. Let me know when you have what you think is a good
> > patch.
> >
> > rick
> >
> >> -GAWollman
> >> _______________________________________________
> >> freebsd-hackers@freebsd.org mailing list
> >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> >> To unsubscribe, send any mail to
> >> "freebsd-hackers-unsubscribe@freebsd.org"
> > _______________________________________________
> > freebsd-fs@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to
> > "freebsd-fs-unsubscribe@freebsd.org"
>=20
> I was doing some NFS testing with RELENG_9 machine and
> a Linux RHEL machine over 10G network, and noticed the same nfsd
> threads issue.
>=20
> Previously I would read a 32G file locally on the FreeBSD ZFS/NFS
> server with "dd if=3D/tank/32G.bin of=3D/dev/null bs=3D1M" to cache it
> completely in ARC (machine has 196G RAM),
> then if I do this again locally I would get close to 4GB/sec read -
> completely from the cache...
>=20
> But If I try to read the file over NFS from the Linux machine I would
> only get about 100MB/sec speed, sometimes a bit more,
> and all of the nfsd threads are clearly visible in top. pmcstat also
> showed the same mutex contention as in the original post.
>=20
> I've now applied the drc2 patch, and reruning the same test yields
> about 960MB/s transfer over NFS=E2=80=A6 quite an improvement!
>=20
Sounds good. Hopefully Garrett can test it too and then it sounds like
in can be committed.

Someday I'll look at using separate mutexes for each of the hash buckets,
which should reduce contention for the mutex for TCP. For UDP, there is
one LRU list that all entries are on, so UDP is probably stuck using one
mutex for now. Since this would be a more involved and risky patch, I
think committing drc2.patch first and then doing this later, would make
sense.

Thanks for testing it, rick

>=20
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to
> "freebsd-hackers-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?895825217.1831774.1349562776418.JavaMail.root>