FreeBSD Mail Archives

Date:      Tue, 9 Oct 2012 18:23:37 +0300
From:      Nikolay Denev <ndenev@gmail.com>
To:        freebsd-hackers@freebsd.org
Cc:        rmacklem@freebsd.org, Garrett Wollman <wollman@freebsd.org>
Subject:   Re: NFS server bottlenecks
Message-ID:  <59B593A0-DA96-4395-B6B9-196F73A1415C@gmail.com>
In-Reply-To: <F865A337-7A68-4DC5-8B9E-627C9E6F3518@gmail.com>
References:  <1666343702.1682678.1349300219198.JavaMail.root@erie.cs.uoguelph.ca> <F865A337-7A68-4DC5-8B9E-627C9E6F3518@gmail.com>

On Oct 9, 2012, at 5:12 PM, Nikolay Denev <ndenev@gmail.com> wrote:

>=20
> On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmacklem@uoguelph.ca> =
wrote:
>=20
>> Garrett Wollman wrote:
>>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem
>>> <rmacklem@uoguelph.ca> said:
>>>=20
>>>>> Simple: just use a sepatate mutex for each list that a cache entry
>>>>> is on, rather than a global lock for everything. This would reduce
>>>>> the mutex contention, but I'm not sure how significantly since I
>>>>> don't have the means to measure it yet.
>>>>>=20
>>>> Well, since the cache trimming is removing entries from the lists, =
I
>>>> don't
>>>> see how that can be done with a global lock for list updates?
>>>=20
>>> Well, the global lock is what we have now, but the cache trimming
>>> process only looks at one list at a time, so not locking the list =
that
>>> isn't being iterated over probably wouldn't hurt, unless there's =
some
>>> mechanism (that I didn't see) for entries to move from one list to
>>> another. Note that I'm considering each hash bucket a separate
>>> "list". (One issue to worry about in that case would be cache-line
>>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE
>>> ought to be increased to reduce that.)
>>>=20
>> Yea, a separate mutex for each hash list might help. There is also =
the
>> LRU list that all entries end up on, that gets used by the trimming =
code.
>> (I think? I wrote this stuff about 8 years ago, so I haven't looked =
at
>> it in a while.)
>>=20
>> Also, increasing the hash table size is probably a good idea, =
especially
>> if you reduce how aggressively the cache is trimmed.
>>=20
>>>> Only doing it once/sec would result in a very large cache when
>>>> bursts of
>>>> traffic arrives.
>>>=20
>>> My servers have 96 GB of memory so that's not a big deal for me.
>>>=20
>> This code was originally "production tested" on a server with 1Gbyte,
>> so times have changed a bit;-)
>>=20
>>>> I'm not sure I see why doing it as a separate thread will improve
>>>> things.
>>>> There are N nfsd threads already (N can be bumped up to 256 if you
>>>> wish)
>>>> and having a bunch more "cache trimming threads" would just =
increase
>>>> contention, wouldn't it?
>>>=20
>>> Only one cache-trimming thread. The cache trim holds the (global)
>>> mutex for much longer than any individual nfsd service thread has =
any
>>> need to, and having N threads doing that in parallel is why it's so
>>> heavily contended. If there's only one thread doing the trim, then
>>> the nfsd service threads aren't spending time either contending on =
the
>>> mutex (it will be held less frequently and for shorter periods).
>>>=20
>> I think the little drc2.patch which will keep the nfsd threads from
>> acquiring the mutex and doing the trimming most of the time, might be
>> sufficient. I still don't see why a separate trimming thread will be
>> an advantage. I'd also be worried that the one cache trimming thread
>> won't get the job done soon enough.
>>=20
>> When I did production testing on a 1Gbyte server that saw a peak
>> load of about 100RPCs/sec, it was necessary to trim aggressively.
>> (Although I'd be tempted to say that a server with 1Gbyte is no
>> longer relevant, I recently recall someone trying to run FreeBSD
>> on a i486, although I doubt they wanted to run the nfsd on it.)
>>=20
>>>> The only negative effect I can think of w.r.t. having the nfsd
>>>> threads doing it would be a (I believe negligible) increase in RPC
>>>> response times (the time the nfsd thread spends trimming the =
cache).
>>>> As noted, I think this time would be negligible compared to disk =
I/O
>>>> and network transit times in the total RPC response time?
>>>=20
>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G
>>> network connectivity, spinning on a contended mutex takes a
>>> significant amount of CPU time. (For the current design of the NFS
>>> server, it may actually be a win to turn off adaptive mutexes -- I
>>> should give that a try once I'm able to do more testing.)
>>>=20
>> Have fun with it. Let me know when you have what you think is a good =
patch.
>>=20
>> rick
>>=20
>>> -GAWollman
>>> _______________________________________________
>>> freebsd-hackers@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>> To unsubscribe, send any mail to
>>> "freebsd-hackers-unsubscribe@freebsd.org"
>> _______________________________________________
>> freebsd-fs@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>=20
> My quest for IOPS over NFS continues :)
> So far I'm not able to achieve more than about 3000 8K read requests =
over NFS,
> while the server locally gives much more.
> And this is all from a file that is completely in ARC cache, no disk =
IO involved.
>=20
> I've snatched some sample DTrace script from the net : [ =
http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes =
]
>=20
> And modified it for our new NFS server :
>=20
> #!/usr/sbin/dtrace -qs=20
>=20
> fbt:kernel:nfsrvd_*:entry
> {
> 	self->ts =3D timestamp;=20
> 	@counts[probefunc] =3D count();
> }
>=20
> fbt:kernel:nfsrvd_*:return
> / self->ts > 0 /
> {
> 	this->delta =3D (timestamp-self->ts)/1000000;
> }
>=20
> fbt:kernel:nfsrvd_*:return
> / self->ts > 0 && this->delta > 100 /
> {
> 	@slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50);
> }
>=20
> fbt:kernel:nfsrvd_*:return
> / self->ts > 0 /
> {
> 	@dist[probefunc, "ms"] =3D quantize(this->delta);
> 	self->ts =3D 0;
> }
>=20
> END
> {
> 	printf("\n");
> 	printa("function %-20s  %@10d\n", @counts);
> 	printf("\n");
> 	printa("function %s(), time in %s:%@d\n", @dist);
> 	printf("\n");
> 	printa("function %s(), time in %s for >=3D 100 ms:%@d\n", =
@slow);
> }
>=20
> And here's a sample output from one or two minutes during the run of =
Oracle's ORION benchmark
> tool from a Linux machine, on a 32G file on NFS mount over 10G =
ethernet:
>=20
> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d =20
> ^C
>=20
> function nfsrvd_access                  4
> function nfsrvd_statfs                 10
> function nfsrvd_getattr                14
> function nfsrvd_commit                 76
> function nfsrvd_sentcache          110048
> function nfsrvd_write              110048
> function nfsrvd_read               283648
> function nfsrvd_dorpc              393800
> function nfsrvd_getcache           393800
> function nfsrvd_rephead            393800
> function nfsrvd_updatecache        393800
>=20
> function nfsrvd_access(), time in ms:
>           value  ------------- Distribution ------------- count   =20
>              -1 |                                         0       =20
>               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4       =20
>               1 |                                         0       =20
>=20
> function nfsrvd_statfs(), time in ms:
>           value  ------------- Distribution ------------- count   =20
>              -1 |                                         0       =20
>               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10      =20
>               1 |                                         0       =20
>=20
> function nfsrvd_getattr(), time in ms:
>           value  ------------- Distribution ------------- count   =20
>              -1 |                                         0       =20
>               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14      =20
>               1 |                                         0       =20
>=20
> function nfsrvd_sentcache(), time in ms:
>           value  ------------- Distribution ------------- count   =20
>              -1 |                                         0       =20
>               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048  =20
>               1 |                                         0       =20
>=20
> function nfsrvd_rephead(), time in ms:
>           value  ------------- Distribution ------------- count   =20
>              -1 |                                         0       =20
>               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800  =20
>               1 |                                         0       =20
>=20
> function nfsrvd_updatecache(), time in ms:
>           value  ------------- Distribution ------------- count   =20
>              -1 |                                         0       =20
>               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800  =20
>               1 |                                         0       =20
>=20
> function nfsrvd_getcache(), time in ms:
>           value  ------------- Distribution ------------- count   =20
>              -1 |                                         0       =20
>               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798  =20
>               1 |                                         1       =20
>               2 |                                         0       =20
>               4 |                                         1       =20
>               8 |                                         0       =20
>=20
> function nfsrvd_write(), time in ms:
>           value  ------------- Distribution ------------- count   =20
>              -1 |                                         0       =20
>               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039  =20
>               1 |                                         5       =20
>               2 |                                         4       =20
>               4 |                                         0       =20
>=20
> function nfsrvd_read(), time in ms:
>           value  ------------- Distribution ------------- count   =20
>              -1 |                                         0       =20
>               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622  =20
>               1 |                                         19      =20
>               2 |                                         3       =20
>               4 |                                         2       =20
>               8 |                                         0       =20
>              16 |                                         1       =20
>              32 |                                         0       =20
>              64 |                                         0       =20
>             128 |                                         0       =20
>             256 |                                         1       =20
>             512 |                                         0       =20
>=20
> function nfsrvd_commit(), time in ms:
>           value  ------------- Distribution ------------- count   =20
>              -1 |                                         0       =20
>               0 |@@@@@@@@@@@@@@@@@@@@@@@                  44      =20
>               1 |@@@@@@@                                  14      =20
>               2 |                                         0       =20
>               4 |@                                        1       =20
>               8 |@                                        1       =20
>              16 |                                         0       =20
>              32 |@@@@@@@                                  14      =20
>              64 |@                                        2       =20
>             128 |                                         0       =20
>=20
>=20
> function nfsrvd_commit(), time in ms for >=3D 100 ms:
>           value  ------------- Distribution ------------- count   =20
>           < 100 |                                         0       =20
>             100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1       =20
>             150 |                                         0       =20
>=20
> function nfsrvd_read(), time in ms for >=3D 100 ms:
>           value  ------------- Distribution ------------- count   =20
>             250 |                                         0       =20
>             300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1       =20
>             350 |                                         0       =20
>=20
>=20
> Looks like the nfs server cache functions are quite fast, but =
extremely frequently called.
>=20
> I hope someone can find this information useful.
>=20

Here's another quick one :

#!/usr/sbin/dtrace -qs=20

#pragma D option quiet

fbt:kernel:nfsrvd_*:entry
{
        self->trace =3D 1;
}

fbt:kernel:nfsrvd_*:return
/ self->trace /
{
	@calls[probefunc] =3D count();
}

tick-1sec
{
        printf("%40s | %s\n", "function", "calls per second");
	printa("%40s %10@d\n", @calls);
        clear(@calls);
        printf("\n");
}

Showing the number of calls per second to the nfsrvd_* functions.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?59B593A0-DA96-4395-B6B9-196F73A1415C>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation