From owner-freebsd-hackers@FreeBSD.ORG  Tue Oct  9 14:12:48 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id AF7B86D1;
 Tue,  9 Oct 2012 14:12:48 +0000 (UTC)
 (envelope-from ndenev@gmail.com)
Received: from mail-wi0-f178.google.com (mail-wi0-f178.google.com
 [209.85.212.178])
 by mx1.freebsd.org (Postfix) with ESMTP id C6A3B8FC0C;
 Tue,  9 Oct 2012 14:12:47 +0000 (UTC)
Received: by mail-wi0-f178.google.com with SMTP id hr7so4072576wib.13
 for <multiple recipients>; Tue, 09 Oct 2012 07:12:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=subject:mime-version:content-type:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to:x-mailer;
 bh=j14Cl+Whi217hvrLClyuARGwY6JBTXtO+V8PPUZFYbM=;
 b=WOeCJHK9XhIClbeeqshTnOfQME2/diIiIVemFITKROgexWSI4Kst4Z9PIl19qW7Aax
 Yio2uKsESDfqM3XkQgix6c856w0o4Ll/Qg3cBX86b25x7vE6dEcs/whksgmS3z+jHaT8
 X4GmtcaNFZr/xfFfgcyX3Xg6vxlVPMEfmecLzR8ibOMRHsbWd+UIzI5sLLdk45ch3Pbs
 5k8ByelmxiyBIpujDsnuE8OnI9o8aGHOHcBlWXhaWDZLhbc9V7JYrT41fbDXqh2i7J6s
 Bnjoy9cgyNRO9WLfmtuqAZWUBiwY3ynRKlr3enPVCFuMW0j8H01mNcxJsMJIxg+sNDbo
 YH1g==
Received: by 10.217.2.146 with SMTP id p18mr12374882wes.198.1349791966518;
 Tue, 09 Oct 2012 07:12:46 -0700 (PDT)
Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com.
 [217.18.249.148])
 by mx.google.com with ESMTPS id gg4sm28310910wib.6.2012.10.09.07.12.44
 (version=TLSv1/SSLv3 cipher=OTHER);
 Tue, 09 Oct 2012 07:12:45 -0700 (PDT)
Subject: Re: NFS server bottlenecks
Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\))
Content-Type: text/plain; charset=us-ascii
From: Nikolay Denev <ndenev@gmail.com>
In-Reply-To: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca>
Date: Tue, 9 Oct 2012 17:12:43 +0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <F865A337-7A68-4DC5-8B9E-627C9E6F3518@gmail.com>
References: <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca>
To: freebsd-hackers@freebsd.org
X-Mailer: Apple Mail (2.1498)
Cc: rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 09 Oct 2012 14:12:48 -0000


On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Garrett Wollman wrote:
>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>> <rmacklem@uoguelph.ca> said:
>>=20
>>>> Simple: just use a sepatate mutex for each list that a cache entry
>>>> is on, rather than a global lock for everything. This would reduce
>>>> the mutex contention, but I'm not sure how significantly since I
>>>> don't have the means to measure it yet.
>>>>=20
>>> Well, since the cache trimming is removing entries from the lists, I
>>> don't
>>> see how that can be done with a global lock for list updates?
>>=20
>> Well, the global lock is what we have now, but the cache trimming
>> process only looks at one list at a time, so not locking the list =
that
>> isn't being iterated over probably wouldn't hurt, unless there's some
>> mechanism (that I didn't see) for entries to move from one list to
>> another. Note that I'm considering each hash bucket a separate
>> "list". (One issue to worry about in that case would be cache-line
>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
>> ought to be increased to reduce that.)
>>=20
> Yea, a separate mutex for each hash list might help. There is also the
> LRU list that all entries end up on, that gets used by the trimming =
code.
> (I think? I wrote this stuff about 8 years ago, so I haven't looked at
> it in a while.)
>=20
> Also, increasing the hash table size is probably a good idea, =
especially
> if you reduce how aggressively the cache is trimmed.
>=20
>>> Only doing it once/sec would result in a very large cache when
>>> bursts of
>>> traffic arrives.
>>=20
>> My servers have 96 GB of memory so that's not a big deal for me.
>>=20
> This code was originally "production tested" on a server with 1Gbyte,
> so times have changed a bit;-)
>=20
>>> I'm not sure I see why doing it as a separate thread will improve
>>> things.
>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>> wish)
>>> and having a bunch more "cache trimming threads" would just increase
>>> contention, wouldn't it?
>>=20
>> Only one cache-trimming thread. The cache trim holds the (global)
>> mutex for much longer than any individual nfsd service thread has any
>> need to, and having N threads doing that in parallel is why it's so
>> heavily contended. If there's only one thread doing the trim, then
>> the nfsd service threads aren't spending time either contending on =
the
>> mutex (it will be held less frequently and for shorter periods).
>>=20
> I think the little drc2.patch which will keep the nfsd threads from
> acquiring the mutex and doing the trimming most of the time, might be
> sufficient. I still don't see why a separate trimming thread will be
> an advantage. I'd also be worried that the one cache trimming thread
> won't get the job done soon enough.
>=20
> When I did production testing on a 1Gbyte server that saw a peak
> load of about 100RPCs/sec, it was necessary to trim aggressively.
> (Although I'd be tempted to say that a server with 1Gbyte is no
> longer relevant, I recently recall someone trying to run FreeBSD
> on a i486, although I doubt they wanted to run the nfsd on it.)
>=20
>>> The only negative effect I can think of w.r.t. having the nfsd
>>> threads doing it would be a (I believe negligible) increase in RPC
>>> response times (the time the nfsd thread spends trimming the cache).
>>> As noted, I think this time would be negligible compared to disk I/O
>>> and network transit times in the total RPC response time?
>>=20
>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
>> network connectivity, spinning on a contended mutex takes a
>> significant amount of CPU time. (For the current design of the NFS
>> server, it may actually be a win to turn off adaptive mutexes -- I
>> should give that a try once I'm able to do more testing.)
>>=20
> Have fun with it. Let me know when you have what you think is a good =
patch.
>=20
> rick
>=20
>> -GAWollman
>> _______________________________________________
>> freebsd-hackers@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to
>> "freebsd-hackers-unsubscribe@freebsd.org"
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"

My quest for IOPS over NFS continues :)
So far I'm not able to achieve more than about 3000 8K read requests =
over NFS,
while the server locally gives much more.
And this is all from a file that is completely in ARC cache, no disk IO =
involved.

I've snatched some sample DTrace script from the net : [ =
http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes =
]

And modified it for our new NFS server :

#!/usr/sbin/dtrace -qs=20

fbt:kernel:nfsrvd_*:entry
{
	self->ts =3D timestamp;=20
	@counts[probefunc] =3D count();
}

fbt:kernel:nfsrvd_*:return
/ self->ts > 0 /
{
	this->delta =3D (timestamp-self->ts)/1000000;
}

fbt:kernel:nfsrvd_*:return
/ self->ts > 0 && this->delta > 100 /
{
	@slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50);
}

fbt:kernel:nfsrvd_*:return
/ self->ts > 0 /
{
	@dist[probefunc, "ms"] =3D quantize(this->delta);
	self->ts =3D 0;
}

END
{
	printf("\n");
	printa("function %-20s  %@10d\n", @counts);
	printf("\n");
	printa("function %s(), time in %s:%@d\n", @dist);
	printf("\n");
	printa("function %s(), time in %s for >=3D 100 ms:%@d\n", =
@slow);
}

And here's a sample output from one or two minutes during the run of =
Oracle's ORION benchmark
tool from a Linux machine, on a 32G file on NFS mount over 10G ethernet:

[16:01]root@goliath:/home/ndenev# ./nfsrvd.d =20
^C

function nfsrvd_access                  4
function nfsrvd_statfs                 10
function nfsrvd_getattr                14
function nfsrvd_commit                 76
function nfsrvd_sentcache          110048
function nfsrvd_write              110048
function nfsrvd_read               283648
function nfsrvd_dorpc              393800
function nfsrvd_getcache           393800
function nfsrvd_rephead            393800
function nfsrvd_updatecache        393800

function nfsrvd_access(), time in ms:
           value  ------------- Distribution ------------- count   =20
              -1 |                                         0       =20
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4       =20
               1 |                                         0       =20

function nfsrvd_statfs(), time in ms:
           value  ------------- Distribution ------------- count   =20
              -1 |                                         0       =20
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10      =20
               1 |                                         0       =20

function nfsrvd_getattr(), time in ms:
           value  ------------- Distribution ------------- count   =20
              -1 |                                         0       =20
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14      =20
               1 |                                         0       =20

function nfsrvd_sentcache(), time in ms:
           value  ------------- Distribution ------------- count   =20
              -1 |                                         0       =20
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048  =20
               1 |                                         0       =20

function nfsrvd_rephead(), time in ms:
           value  ------------- Distribution ------------- count   =20
              -1 |                                         0       =20
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800  =20
               1 |                                         0       =20

function nfsrvd_updatecache(), time in ms:
           value  ------------- Distribution ------------- count   =20
              -1 |                                         0       =20
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800  =20
               1 |                                         0       =20

function nfsrvd_getcache(), time in ms:
           value  ------------- Distribution ------------- count   =20
              -1 |                                         0       =20
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798  =20
               1 |                                         1       =20
               2 |                                         0       =20
               4 |                                         1       =20
               8 |                                         0       =20

function nfsrvd_write(), time in ms:
           value  ------------- Distribution ------------- count   =20
              -1 |                                         0       =20
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039  =20
               1 |                                         5       =20
               2 |                                         4       =20
               4 |                                         0       =20

function nfsrvd_read(), time in ms:
           value  ------------- Distribution ------------- count   =20
              -1 |                                         0       =20
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622  =20
               1 |                                         19      =20
               2 |                                         3       =20
               4 |                                         2       =20
               8 |                                         0       =20
              16 |                                         1       =20
              32 |                                         0       =20
              64 |                                         0       =20
             128 |                                         0       =20
             256 |                                         1       =20
             512 |                                         0       =20

function nfsrvd_commit(), time in ms:
           value  ------------- Distribution ------------- count   =20
              -1 |                                         0       =20
               0 |@@@@@@@@@@@@@@@@@@@@@@@                  44      =20
               1 |@@@@@@@                                  14      =20
               2 |                                         0       =20
               4 |@                                        1       =20
               8 |@                                        1       =20
              16 |                                         0       =20
              32 |@@@@@@@                                  14      =20
              64 |@                                        2       =20
             128 |                                         0       =20


function nfsrvd_commit(), time in ms for >=3D 100 ms:
           value  ------------- Distribution ------------- count   =20
           < 100 |                                         0       =20
             100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1       =20
             150 |                                         0       =20

function nfsrvd_read(), time in ms for >=3D 100 ms:
           value  ------------- Distribution ------------- count   =20
             250 |                                         0       =20
             300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1       =20
             350 |                                         0       =20


Looks like the nfs server cache functions are quite fast, but extremely =
frequently called.

I hope someone can find this information useful.