FreeBSD Mail Archives

Date:      Sat, 6 Oct 2012 14:20:11 +0300
From:      Nikolay Denev <ndenev@gmail.com>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        freebsd-fs@freebsd.org, rmacklem@freebsd.org, hackers@freebsd.org, Garrett Wollman <wollman@freebsd.org>
Subject:   Re: NFS server bottlenecks
Message-ID:  <3E7BCFB4-6EE6-48F5-ACA7-A615F3CE5BAC@gmail.com>
In-Reply-To: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca>
References:  <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca>

index | next in thread | previous in thread | raw e-mail


On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Garrett Wollman wrote:
>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>> <rmacklem@uoguelph.ca> said:
>> 
>>>> Simple: just use a sepatate mutex for each list that a cache entry
>>>> is on, rather than a global lock for everything. This would reduce
>>>> the mutex contention, but I'm not sure how significantly since I
>>>> don't have the means to measure it yet.
>>>> 
>>> Well, since the cache trimming is removing entries from the lists, I
>>> don't
>>> see how that can be done with a global lock for list updates?
>> 
>> Well, the global lock is what we have now, but the cache trimming
>> process only looks at one list at a time, so not locking the list that
>> isn't being iterated over probably wouldn't hurt, unless there's some
>> mechanism (that I didn't see) for entries to move from one list to
>> another. Note that I'm considering each hash bucket a separate
>> "list". (One issue to worry about in that case would be cache-line
>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
>> ought to be increased to reduce that.)
>> 
> Yea, a separate mutex for each hash list might help. There is also the
> LRU list that all entries end up on, that gets used by the trimming code.
> (I think? I wrote this stuff about 8 years ago, so I haven't looked at
> it in a while.)
> 
> Also, increasing the hash table size is probably a good idea, especially
> if you reduce how aggressively the cache is trimmed.
> 
>>> Only doing it once/sec would result in a very large cache when
>>> bursts of
>>> traffic arrives.
>> 
>> My servers have 96 GB of memory so that's not a big deal for me.
>> 
> This code was originally "production tested" on a server with 1Gbyte,
> so times have changed a bit;-)
> 
>>> I'm not sure I see why doing it as a separate thread will improve
>>> things.
>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>> wish)
>>> and having a bunch more "cache trimming threads" would just increase
>>> contention, wouldn't it?
>> 
>> Only one cache-trimming thread. The cache trim holds the (global)
>> mutex for much longer than any individual nfsd service thread has any
>> need to, and having N threads doing that in parallel is why it's so
>> heavily contended. If there's only one thread doing the trim, then
>> the nfsd service threads aren't spending time either contending on the
>> mutex (it will be held less frequently and for shorter periods).
>> 
> I think the little drc2.patch which will keep the nfsd threads from
> acquiring the mutex and doing the trimming most of the time, might be
> sufficient. I still don't see why a separate trimming thread will be
> an advantage. I'd also be worried that the one cache trimming thread
> won't get the job done soon enough.
> 
> When I did production testing on a 1Gbyte server that saw a peak
> load of about 100RPCs/sec, it was necessary to trim aggressively.
> (Although I'd be tempted to say that a server with 1Gbyte is no
> longer relevant, I recently recall someone trying to run FreeBSD
> on a i486, although I doubt they wanted to run the nfsd on it.)
> 
>>> The only negative effect I can think of w.r.t. having the nfsd
>>> threads doing it would be a (I believe negligible) increase in RPC
>>> response times (the time the nfsd thread spends trimming the cache).
>>> As noted, I think this time would be negligible compared to disk I/O
>>> and network transit times in the total RPC response time?
>> 
>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
>> network connectivity, spinning on a contended mutex takes a
>> significant amount of CPU time. (For the current design of the NFS
>> server, it may actually be a win to turn off adaptive mutexes -- I
>> should give that a try once I'm able to do more testing.)
>> 
> Have fun with it. Let me know when you have what you think is a good patch.
> 
> rick
> 
>> -GAWollman
>> _______________________________________________
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to
>> "freebsd-hackers-unsubscribe@freebsd.org"
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"

I was doing some NFS testing with RELENG_9 machine and
a Linux RHEL machine over 10G network, and noticed the same nfsd threads issue.

Previously I would read a 32G file locally on the FreeBSD ZFS/NFS server with "dd if=/tank/32G.bin of=/dev/null bs=1M" to cache it completely in ARC (machine has 196G RAM),
then if I do this again locally I would get close to 4GB/sec read - completely from the cache...

But If I try to read the file over NFS from the Linux machine I would only get about 100MB/sec speed, sometimes a bit more,
and all of the nfsd threads are clearly visible in top. pmcstat also showed the same mutex contention as in the original post.

I've now applied the drc2 patch, and reruning the same test yields about 960MB/s transfer over NFS� quite an improvement!

help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E7BCFB4-6EE6-48F5-ACA7-A615F3CE5BAC>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation