Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 6 Oct 2012 14:20:11 +0300
From:      Nikolay Denev <ndenev@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        freebsd-fs@freebsd.org, rmacklem@freebsd.org, hackers@freebsd.org, Garrett Wollman <wollman@freebsd.org>
Subject:   Re: NFS server bottlenecks
Message-ID:  <3E7BCFB4-6EE6-48F5-ACA7-A615F3CE5BAC@gmail.com>
In-Reply-To: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca>
References:  <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca>

next in thread | previous in thread | raw e-mail | index | archive | help
On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Garrett Wollman wrote:
>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>> <rmacklem@uoguelph.ca> said:
>>=20
>>>> Simple: just use a sepatate mutex for each list that a cache entry
>>>> is on, rather than a global lock for everything. This would reduce
>>>> the mutex contention, but I'm not sure how significantly since I
>>>> don't have the means to measure it yet.
>>>>=20
>>> Well, since the cache trimming is removing entries from the lists, I
>>> don't
>>> see how that can be done with a global lock for list updates?
>>=20
>> Well, the global lock is what we have now, but the cache trimming
>> process only looks at one list at a time, so not locking the list =
that
>> isn't being iterated over probably wouldn't hurt, unless there's some
>> mechanism (that I didn't see) for entries to move from one list to
>> another. Note that I'm considering each hash bucket a separate
>> "list". (One issue to worry about in that case would be cache-line
>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
>> ought to be increased to reduce that.)
>>=20
> Yea, a separate mutex for each hash list might help. There is also the
> LRU list that all entries end up on, that gets used by the trimming =
code.
> (I think? I wrote this stuff about 8 years ago, so I haven't looked at
> it in a while.)
>=20
> Also, increasing the hash table size is probably a good idea, =
especially
> if you reduce how aggressively the cache is trimmed.
>=20
>>> Only doing it once/sec would result in a very large cache when
>>> bursts of
>>> traffic arrives.
>>=20
>> My servers have 96 GB of memory so that's not a big deal for me.
>>=20
> This code was originally "production tested" on a server with 1Gbyte,
> so times have changed a bit;-)
>=20
>>> I'm not sure I see why doing it as a separate thread will improve
>>> things.
>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>> wish)
>>> and having a bunch more "cache trimming threads" would just increase
>>> contention, wouldn't it?
>>=20
>> Only one cache-trimming thread. The cache trim holds the (global)
>> mutex for much longer than any individual nfsd service thread has any
>> need to, and having N threads doing that in parallel is why it's so
>> heavily contended. If there's only one thread doing the trim, then
>> the nfsd service threads aren't spending time either contending on =
the
>> mutex (it will be held less frequently and for shorter periods).
>>=20
> I think the little drc2.patch which will keep the nfsd threads from
> acquiring the mutex and doing the trimming most of the time, might be
> sufficient. I still don't see why a separate trimming thread will be
> an advantage. I'd also be worried that the one cache trimming thread
> won't get the job done soon enough.
>=20
> When I did production testing on a 1Gbyte server that saw a peak
> load of about 100RPCs/sec, it was necessary to trim aggressively.
> (Although I'd be tempted to say that a server with 1Gbyte is no
> longer relevant, I recently recall someone trying to run FreeBSD
> on a i486, although I doubt they wanted to run the nfsd on it.)
>=20
>>> The only negative effect I can think of w.r.t. having the nfsd
>>> threads doing it would be a (I believe negligible) increase in RPC
>>> response times (the time the nfsd thread spends trimming the cache).
>>> As noted, I think this time would be negligible compared to disk I/O
>>> and network transit times in the total RPC response time?
>>=20
>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
>> network connectivity, spinning on a contended mutex takes a
>> significant amount of CPU time. (For the current design of the NFS
>> server, it may actually be a win to turn off adaptive mutexes -- I
>> should give that a try once I'm able to do more testing.)
>>=20
> Have fun with it. Let me know when you have what you think is a good =
patch.
>=20
> rick
>=20
>> -GAWollman
>> _______________________________________________
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to
>> "freebsd-hackers-unsubscribe@freebsd.org"
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"

I was doing some NFS testing with RELENG_9 machine and
a Linux RHEL machine over 10G network, and noticed the same nfsd threads =
issue.

Previously I would read a 32G file locally on the FreeBSD ZFS/NFS server =
with "dd if=3D/tank/32G.bin of=3D/dev/null bs=3D1M" to cache it =
completely in ARC (machine has 196G RAM),
then if I do this again locally I would get close to 4GB/sec read - =
completely from the cache...

But If I try to read the file over NFS from the Linux machine I would =
only get about 100MB/sec speed, sometimes a bit more,
and all of the nfsd threads are clearly visible in top. pmcstat also =
showed the same mutex contention as in the original post.

I've now applied the drc2 patch, and reruning the same test yields about =
960MB/s transfer over NFS=85 quite an improvement!





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E7BCFB4-6EE6-48F5-ACA7-A615F3CE5BAC>