From owner-freebsd-hackers@FreeBSD.ORG Tue Oct 9 14:12:48 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id AF7B86D1; Tue, 9 Oct 2012 14:12:48 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com [209.85.212.178]) by mx1.freebsd.org (Postfix) with ESMTP id C6A3B8FC0C; Tue, 9 Oct 2012 14:12:47 +0000 (UTC) Received: by mail-wi0-f178.google.com with SMTP id hr7so4072576wib.13 for ; Tue, 09 Oct 2012 07:12:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=j14Cl+Whi217hvrLClyuARGwY6JBTXtO+V8PPUZFYbM=; b=WOeCJHK9XhIClbeeqshTnOfQME2/diIiIVemFITKROgexWSI4Kst4Z9PIl19qW7Aax Yio2uKsESDfqM3XkQgix6c856w0o4Ll/Qg3cBX86b25x7vE6dEcs/whksgmS3z+jHaT8 X4GmtcaNFZr/xfFfgcyX3Xg6vxlVPMEfmecLzR8ibOMRHsbWd+UIzI5sLLdk45ch3Pbs 5k8ByelmxiyBIpujDsnuE8OnI9o8aGHOHcBlWXhaWDZLhbc9V7JYrT41fbDXqh2i7J6s Bnjoy9cgyNRO9WLfmtuqAZWUBiwY3ynRKlr3enPVCFuMW0j8H01mNcxJsMJIxg+sNDbo YH1g== Received: by 10.217.2.146 with SMTP id p18mr12374882wes.198.1349791966518; Tue, 09 Oct 2012 07:12:46 -0700 (PDT) Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com. [217.18.249.148]) by mx.google.com with ESMTPS id gg4sm28310910wib.6.2012.10.09.07.12.44 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 09 Oct 2012 07:12:45 -0700 (PDT) Subject: Re: NFS server bottlenecks Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca> Date: Tue, 9 Oct 2012 17:12:43 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca> To: freebsd-hackers@freebsd.org X-Mailer: Apple Mail (2.1498) Cc: rmacklem@freebsd.org, Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 09 Oct 2012 14:12:48 -0000 On Oct 4, 2012, at 12:36 AM, Rick Macklem wrote: > Garrett Wollman wrote: >> <> said: >>=20 >>>> Simple: just use a sepatate mutex for each list that a cache entry >>>> is on, rather than a global lock for everything. This would reduce >>>> the mutex contention, but I'm not sure how significantly since I >>>> don't have the means to measure it yet. >>>>=20 >>> Well, since the cache trimming is removing entries from the lists, I >>> don't >>> see how that can be done with a global lock for list updates? >>=20 >> Well, the global lock is what we have now, but the cache trimming >> process only looks at one list at a time, so not locking the list = that >> isn't being iterated over probably wouldn't hurt, unless there's some >> mechanism (that I didn't see) for entries to move from one list to >> another. Note that I'm considering each hash bucket a separate >> "list". (One issue to worry about in that case would be cache-line >> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE >> ought to be increased to reduce that.) >>=20 > Yea, a separate mutex for each hash list might help. There is also the > LRU list that all entries end up on, that gets used by the trimming = code. > (I think? I wrote this stuff about 8 years ago, so I haven't looked at > it in a while.) >=20 > Also, increasing the hash table size is probably a good idea, = especially > if you reduce how aggressively the cache is trimmed. >=20 >>> Only doing it once/sec would result in a very large cache when >>> bursts of >>> traffic arrives. >>=20 >> My servers have 96 GB of memory so that's not a big deal for me. >>=20 > This code was originally "production tested" on a server with 1Gbyte, > so times have changed a bit;-) >=20 >>> I'm not sure I see why doing it as a separate thread will improve >>> things. >>> There are N nfsd threads already (N can be bumped up to 256 if you >>> wish) >>> and having a bunch more "cache trimming threads" would just increase >>> contention, wouldn't it? >>=20 >> Only one cache-trimming thread. The cache trim holds the (global) >> mutex for much longer than any individual nfsd service thread has any >> need to, and having N threads doing that in parallel is why it's so >> heavily contended. If there's only one thread doing the trim, then >> the nfsd service threads aren't spending time either contending on = the >> mutex (it will be held less frequently and for shorter periods). >>=20 > I think the little drc2.patch which will keep the nfsd threads from > acquiring the mutex and doing the trimming most of the time, might be > sufficient. I still don't see why a separate trimming thread will be > an advantage. I'd also be worried that the one cache trimming thread > won't get the job done soon enough. >=20 > When I did production testing on a 1Gbyte server that saw a peak > load of about 100RPCs/sec, it was necessary to trim aggressively. > (Although I'd be tempted to say that a server with 1Gbyte is no > longer relevant, I recently recall someone trying to run FreeBSD > on a i486, although I doubt they wanted to run the nfsd on it.) >=20 >>> The only negative effect I can think of w.r.t. having the nfsd >>> threads doing it would be a (I believe negligible) increase in RPC >>> response times (the time the nfsd thread spends trimming the cache). >>> As noted, I think this time would be negligible compared to disk I/O >>> and network transit times in the total RPC response time? >>=20 >> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G >> network connectivity, spinning on a contended mutex takes a >> significant amount of CPU time. (For the current design of the NFS >> server, it may actually be a win to turn off adaptive mutexes -- I >> should give that a try once I'm able to do more testing.) >>=20 > Have fun with it. Let me know when you have what you think is a good = patch. >=20 > rick >=20 >> -GAWollman >> _______________________________________________ >> freebsd-hackers@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >> To unsubscribe, send any mail to >> "freebsd-hackers-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" My quest for IOPS over NFS continues :) So far I'm not able to achieve more than about 3000 8K read requests = over NFS, while the server locally gives much more. And this is all from a file that is completely in ARC cache, no disk IO = involved. I've snatched some sample DTrace script from the net : [ = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes = ] And modified it for our new NFS server : #!/usr/sbin/dtrace -qs=20 fbt:kernel:nfsrvd_*:entry { self->ts =3D timestamp;=20 @counts[probefunc] =3D count(); } fbt:kernel:nfsrvd_*:return / self->ts > 0 / { this->delta =3D (timestamp-self->ts)/1000000; } fbt:kernel:nfsrvd_*:return / self->ts > 0 && this->delta > 100 / { @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); } fbt:kernel:nfsrvd_*:return / self->ts > 0 / { @dist[probefunc, "ms"] =3D quantize(this->delta); self->ts =3D 0; } END { printf("\n"); printa("function %-20s %@10d\n", @counts); printf("\n"); printa("function %s(), time in %s:%@d\n", @dist); printf("\n"); printa("function %s(), time in %s for >=3D 100 ms:%@d\n", = @slow); } And here's a sample output from one or two minutes during the run of = Oracle's ORION benchmark tool from a Linux machine, on a 32G file on NFS mount over 10G ethernet: [16:01]root@goliath:/home/ndenev# ./nfsrvd.d =20 ^C function nfsrvd_access 4 function nfsrvd_statfs 10 function nfsrvd_getattr 14 function nfsrvd_commit 76 function nfsrvd_sentcache 110048 function nfsrvd_write 110048 function nfsrvd_read 283648 function nfsrvd_dorpc 393800 function nfsrvd_getcache 393800 function nfsrvd_rephead 393800 function nfsrvd_updatecache 393800 function nfsrvd_access(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 =20 1 | 0 =20 function nfsrvd_statfs(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 =20 1 | 0 =20 function nfsrvd_getattr(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 =20 1 | 0 =20 function nfsrvd_sentcache(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 =20 1 | 0 =20 function nfsrvd_rephead(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 =20 1 | 0 =20 function nfsrvd_updatecache(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 =20 1 | 0 =20 function nfsrvd_getcache(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 =20 1 | 1 =20 2 | 0 =20 4 | 1 =20 8 | 0 =20 function nfsrvd_write(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 =20 1 | 5 =20 2 | 4 =20 4 | 0 =20 function nfsrvd_read(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 =20 1 | 19 =20 2 | 3 =20 4 | 2 =20 8 | 0 =20 16 | 1 =20 32 | 0 =20 64 | 0 =20 128 | 0 =20 256 | 1 =20 512 | 0 =20 function nfsrvd_commit(), time in ms: value ------------- Distribution ------------- count =20 -1 | 0 =20 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 =20 1 |@@@@@@@ 14 =20 2 | 0 =20 4 |@ 1 =20 8 |@ 1 =20 16 | 0 =20 32 |@@@@@@@ 14 =20 64 |@ 2 =20 128 | 0 =20 function nfsrvd_commit(), time in ms for >=3D 100 ms: value ------------- Distribution ------------- count =20 < 100 | 0 =20 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 =20 150 | 0 =20 function nfsrvd_read(), time in ms for >=3D 100 ms: value ------------- Distribution ------------- count =20 250 | 0 =20 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 =20 350 | 0 =20 Looks like the nfs server cache functions are quite fast, but extremely = frequently called. I hope someone can find this information useful.