From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 16:21:05 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id B6B92913; Thu, 11 Oct 2012 16:21:05 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id D74538FC14; Thu, 11 Oct 2012 16:21:04 +0000 (UTC) Received: by mail-we0-f182.google.com with SMTP id x43so1461794wey.13 for ; Thu, 11 Oct 2012 09:20:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=qAI24G5yefhBz4Gpif8LkmBCnqIDNTpEReTQU/t9kNg=; b=YXrCDRfbYNAUo5rvqQnRsKryJF+23pgBPpa1HZtwA1oTy1eBxIciw47Q1sPd+yZcJp EYvAcnpuDyIviFetjRNoI8aIqD/nnVAtlv0kMi6uoMBv/cQ0DAZ0Xfr6EWcinvHPzJD3 zdqL6ZAKgGXzf70tijHwFimHBGm980IT2vpK11qErQ8j+9VH8W1MLq8XzmyaKKPzufrs Ych54ZHZW0jmydsZXm7CGVHguYkWVgGz4IQwzClFe34pCxKmN2qxkoMFZbMvQCm8gByU 80q4CK64BY6Gu1kspo6QL+m4HQLPhQNJOvRFEJTheINHy09ufduIZin7T/LRcILfjg/n O86Q== Received: by 10.180.86.202 with SMTP id r10mr3394767wiz.12.1349972457860; Thu, 11 Oct 2012 09:20:57 -0700 (PDT) Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com. [217.18.249.148]) by mx.google.com with ESMTPS id b3sm35970004wie.0.2012.10.11.09.20.54 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 11 Oct 2012 09:20:56 -0700 (PDT) Subject: Re: NFS server bottlenecks Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com> Date: Thu, 11 Oct 2012 19:20:53 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com> References: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com> To: "freebsd-hackers@freebsd.org" X-Mailer: Apple Mail (2.1498) Cc: rmacklem@freebsd.org, Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 16:21:05 -0000 On Oct 11, 2012, at 8:46 AM, Nikolay Denev wrote: >=20 > On Oct 11, 2012, at 1:09 AM, Rick Macklem = wrote: >=20 >> Nikolay Denev wrote: >>> On Oct 10, 2012, at 3:18 AM, Rick Macklem >>> wrote: >>>=20 >>>> Nikolay Denev wrote: >>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem >>>>> wrote: >>>>>=20 >>>>>> Garrett Wollman wrote: >>>>>>> <>>>>>> said: >>>>>>>=20 >>>>>>>>> Simple: just use a sepatate mutex for each list that a cache >>>>>>>>> entry >>>>>>>>> is on, rather than a global lock for everything. This would >>>>>>>>> reduce >>>>>>>>> the mutex contention, but I'm not sure how significantly since >>>>>>>>> I >>>>>>>>> don't have the means to measure it yet. >>>>>>>>>=20 >>>>>>>> Well, since the cache trimming is removing entries from the >>>>>>>> lists, >>>>>>>> I >>>>>>>> don't >>>>>>>> see how that can be done with a global lock for list updates? >>>>>>>=20 >>>>>>> Well, the global lock is what we have now, but the cache = trimming >>>>>>> process only looks at one list at a time, so not locking the = list >>>>>>> that >>>>>>> isn't being iterated over probably wouldn't hurt, unless there's >>>>>>> some >>>>>>> mechanism (that I didn't see) for entries to move from one list >>>>>>> to >>>>>>> another. Note that I'm considering each hash bucket a separate >>>>>>> "list". (One issue to worry about in that case would be >>>>>>> cache-line >>>>>>> contention in the array of hash buckets; perhaps >>>>>>> NFSRVCACHE_HASHSIZE >>>>>>> ought to be increased to reduce that.) >>>>>>>=20 >>>>>> Yea, a separate mutex for each hash list might help. There is = also >>>>>> the >>>>>> LRU list that all entries end up on, that gets used by the >>>>>> trimming >>>>>> code. >>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't >>>>>> looked >>>>>> at >>>>>> it in a while.) >>>>>>=20 >>>>>> Also, increasing the hash table size is probably a good idea, >>>>>> especially >>>>>> if you reduce how aggressively the cache is trimmed. >>>>>>=20 >>>>>>>> Only doing it once/sec would result in a very large cache when >>>>>>>> bursts of >>>>>>>> traffic arrives. >>>>>>>=20 >>>>>>> My servers have 96 GB of memory so that's not a big deal for me. >>>>>>>=20 >>>>>> This code was originally "production tested" on a server with >>>>>> 1Gbyte, >>>>>> so times have changed a bit;-) >>>>>>=20 >>>>>>>> I'm not sure I see why doing it as a separate thread will >>>>>>>> improve >>>>>>>> things. >>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if >>>>>>>> you >>>>>>>> wish) >>>>>>>> and having a bunch more "cache trimming threads" would just >>>>>>>> increase >>>>>>>> contention, wouldn't it? >>>>>>>=20 >>>>>>> Only one cache-trimming thread. The cache trim holds the = (global) >>>>>>> mutex for much longer than any individual nfsd service thread = has >>>>>>> any >>>>>>> need to, and having N threads doing that in parallel is why it's >>>>>>> so >>>>>>> heavily contended. If there's only one thread doing the trim, >>>>>>> then >>>>>>> the nfsd service threads aren't spending time either contending >>>>>>> on >>>>>>> the >>>>>>> mutex (it will be held less frequently and for shorter periods). >>>>>>>=20 >>>>>> I think the little drc2.patch which will keep the nfsd threads >>>>>> from >>>>>> acquiring the mutex and doing the trimming most of the time, = might >>>>>> be >>>>>> sufficient. I still don't see why a separate trimming thread will >>>>>> be >>>>>> an advantage. I'd also be worried that the one cache trimming >>>>>> thread >>>>>> won't get the job done soon enough. >>>>>>=20 >>>>>> When I did production testing on a 1Gbyte server that saw a peak >>>>>> load of about 100RPCs/sec, it was necessary to trim aggressively. >>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no >>>>>> longer relevant, I recently recall someone trying to run FreeBSD >>>>>> on a i486, although I doubt they wanted to run the nfsd on it.) >>>>>>=20 >>>>>>>> The only negative effect I can think of w.r.t. having the nfsd >>>>>>>> threads doing it would be a (I believe negligible) increase in >>>>>>>> RPC >>>>>>>> response times (the time the nfsd thread spends trimming the >>>>>>>> cache). >>>>>>>> As noted, I think this time would be negligible compared to = disk >>>>>>>> I/O >>>>>>>> and network transit times in the total RPC response time? >>>>>>>=20 >>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and >>>>>>> 10G >>>>>>> network connectivity, spinning on a contended mutex takes a >>>>>>> significant amount of CPU time. (For the current design of the >>>>>>> NFS >>>>>>> server, it may actually be a win to turn off adaptive mutexes -- >>>>>>> I >>>>>>> should give that a try once I'm able to do more testing.) >>>>>>>=20 >>>>>> Have fun with it. Let me know when you have what you think is a >>>>>> good >>>>>> patch. >>>>>>=20 >>>>>> rick >>>>>>=20 >>>>>>> -GAWollman >>>>>>> _______________________________________________ >>>>>>> freebsd-hackers@freebsd.org mailing list >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>>>> To unsubscribe, send any mail to >>>>>>> "freebsd-hackers-unsubscribe@freebsd.org" >>>>>> _______________________________________________ >>>>>> freebsd-fs@freebsd.org mailing list >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >>>>>> To unsubscribe, send any mail to >>>>>> "freebsd-fs-unsubscribe@freebsd.org" >>>>>=20 >>>>> My quest for IOPS over NFS continues :) >>>>> So far I'm not able to achieve more than about 3000 8K read >>>>> requests >>>>> over NFS, >>>>> while the server locally gives much more. >>>>> And this is all from a file that is completely in ARC cache, no >>>>> disk >>>>> IO involved. >>>>>=20 >>>> Just out of curiousity, why do you use 8K reads instead of 64K >>>> reads. >>>> Since the RPC overhead (including the DRC functions) is per RPC, >>>> doing >>>> fewer larger RPCs should usually work better. (Sometimes large >>>> rsize/wsize >>>> values generate too large a burst of traffic for a network = interface >>>> to >>>> handle and then the rsize/wsize has to be decreased to avoid this >>>> issue.) >>>>=20 >>>> And, although this experiment seems useful for testing patches that >>>> try >>>> and reduce DRC CPU overheads, most "real" NFS servers will be doing >>>> disk >>>> I/O. >>>>=20 >>>=20 >>> This is the default blocksize the Oracle and probably most databases >>> use. >>> It uses also larger blocks, but for small random reads in OLTP >>> applications this is what is used. >>>=20 >> If the client is doing 8K reads, you could increase the read ahead >> "readahead=3DN" (N up to 16), to try and increase the bandwidth. >> (But if the CPU is 99% busy, then I don't think it will matter.) >=20 > I'll try to check if this is possible to be set, as we are testing not = only with the Linux NFS client, > but also with the Oracle's built in so called DirectNFS client that is = built in to the app. >=20 >>=20 >>>=20 >>>>> I've snatched some sample DTrace script from the net : [ >>>>> = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes >>>>> ] >>>>>=20 >>>>> And modified it for our new NFS server : >>>>>=20 >>>>> #!/usr/sbin/dtrace -qs >>>>>=20 >>>>> fbt:kernel:nfsrvd_*:entry >>>>> { >>>>> self->ts =3D timestamp; >>>>> @counts[probefunc] =3D count(); >>>>> } >>>>>=20 >>>>> fbt:kernel:nfsrvd_*:return >>>>> / self->ts > 0 / >>>>> { >>>>> this->delta =3D (timestamp-self->ts)/1000000; >>>>> } >>>>>=20 >>>>> fbt:kernel:nfsrvd_*:return >>>>> / self->ts > 0 && this->delta > 100 / >>>>> { >>>>> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); >>>>> } >>>>>=20 >>>>> fbt:kernel:nfsrvd_*:return >>>>> / self->ts > 0 / >>>>> { >>>>> @dist[probefunc, "ms"] =3D quantize(this->delta); >>>>> self->ts =3D 0; >>>>> } >>>>>=20 >>>>> END >>>>> { >>>>> printf("\n"); >>>>> printa("function %-20s %@10d\n", @counts); >>>>> printf("\n"); >>>>> printa("function %s(), time in %s:%@d\n", @dist); >>>>> printf("\n"); >>>>> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow); >>>>> } >>>>>=20 >>>>> And here's a sample output from one or two minutes during the run >>>>> of >>>>> Oracle's ORION benchmark >>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G >>>>> ethernet: >>>>>=20 >>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d >>>>> ^C >>>>>=20 >>>>> function nfsrvd_access 4 >>>>> function nfsrvd_statfs 10 >>>>> function nfsrvd_getattr 14 >>>>> function nfsrvd_commit 76 >>>>> function nfsrvd_sentcache 110048 >>>>> function nfsrvd_write 110048 >>>>> function nfsrvd_read 283648 >>>>> function nfsrvd_dorpc 393800 >>>>> function nfsrvd_getcache 393800 >>>>> function nfsrvd_rephead 393800 >>>>> function nfsrvd_updatecache 393800 >>>>>=20 >>>>> function nfsrvd_access(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_statfs(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_getattr(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_sentcache(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_rephead(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_updatecache(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>>> 1 | 0 >>>>>=20 >>>>> function nfsrvd_getcache(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 >>>>> 1 | 1 >>>>> 2 | 0 >>>>> 4 | 1 >>>>> 8 | 0 >>>>>=20 >>>>> function nfsrvd_write(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 >>>>> 1 | 5 >>>>> 2 | 4 >>>>> 4 | 0 >>>>>=20 >>>>> function nfsrvd_read(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 >>>>> 1 | 19 >>>>> 2 | 3 >>>>> 4 | 2 >>>>> 8 | 0 >>>>> 16 | 1 >>>>> 32 | 0 >>>>> 64 | 0 >>>>> 128 | 0 >>>>> 256 | 1 >>>>> 512 | 0 >>>>>=20 >>>>> function nfsrvd_commit(), time in ms: >>>>> value ------------- Distribution ------------- count >>>>> -1 | 0 >>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 >>>>> 1 |@@@@@@@ 14 >>>>> 2 | 0 >>>>> 4 |@ 1 >>>>> 8 |@ 1 >>>>> 16 | 0 >>>>> 32 |@@@@@@@ 14 >>>>> 64 |@ 2 >>>>> 128 | 0 >>>>>=20 >>>>>=20 >>>>> function nfsrvd_commit(), time in ms for >=3D 100 ms: >>>>> value ------------- Distribution ------------- count >>>>> < 100 | 0 >>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>>> 150 | 0 >>>>>=20 >>>>> function nfsrvd_read(), time in ms for >=3D 100 ms: >>>>> value ------------- Distribution ------------- count >>>>> 250 | 0 >>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>>> 350 | 0 >>>>>=20 >>>>>=20 >>>>> Looks like the nfs server cache functions are quite fast, but >>>>> extremely frequently called. >>>>>=20 >>>> Yep, they are called for every RPC. >>>>=20 >>>> I may try coding up a patch that replaces the single mutex with >>>> one for each hash bucket, for TCP. >>>>=20 >>>> I'll post if/when I get this patch to a testing/review stage, rick >>>>=20 >>>=20 >>> Cool. >>>=20 >>> I've readjusted the precision of the dtrace script a bit, and I can >>> see >>> now the following three functions as taking most of the time : >>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() >>>=20 >>> This was recorded during a oracle benchmark run called SLOB, which >>> caused 99% cpu load on the NFS server. >>>=20 >> Even with the drc2.patch and a large value for vfs.nfsd.tcphighwater? >> (Assuming the mounts are TCP ones.) >>=20 >> Have fun with it, rick >>=20 >=20 > I had upped it, but probably not enough. I'm now running with = vfs.nfsd.tcphighwater set > to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500. > So far it looks like good improvement as those functions no longer = show up in the dtrace script output. > I'll run some more benchmarks and testing today. >=20 > Thanks! >=20 >>>=20 >>>>> I hope someone can find this information useful. >>>>>=20 >>>>> _______________________________________________ >>>>> freebsd-hackers@freebsd.org mailing list >>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>> To unsubscribe, send any mail to >>>>> "freebsd-hackers-unsubscribe@freebsd.org" >=20 I haven't had the opportunity today to run more DB tests over NFS as the = DBA was busy with something else, however I tested a bit the large file transfers. And I'm seeing something strange probably not only NFS but also ZFS and = ARC related. When I first tested the drc2.patch I reported a huge bandwidth = improvement, but now I think that this was probably because of the machine freshly = rebooted instead of just the patch. The patch surely improved things, especially CPU utilization combined = with the increased cache. But today I'm again having a file completely cached in ZFS's ARC cache, = which when transferred over NFS reaches about 300MB/s, when at some tests it reached 900+MB/s (as = reported in my first email). The file locally can be read at about 3GB/s as reported by dd. Local: [17:51]root@goliath:/tank/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null = bs=3D1M =20 30720+1 records in 30720+1 records out 32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec) Over NFS: [17:48]root@spa:/mnt/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null bs=3D1M = =20 30720+1 records in 30720+1 records out 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s The machines are almost idle during this transfer and I can't see a = reason why it can't reach the full bandwith when it's just reading it from RAM. I've tried again tracing with DTrace to see what's happening with this = script : fbt:kernel:nfs*:entry { this->ts =3D timestamp; @counts[probefunc] =3D count(); } fbt:kernel:nfs*:return / this->ts > 0 / { @time[probefunc] =3D avg(timestamp - this->ts); } END { trunc(@counts, 10); trunc(@time, 10); printf("Top 10 called functions\n\n");=09 printa(@counts); printf("\n\nTop 10 slowest functions\n\n"); printa(@time); } And here's the result (several seconds during the dd test): Top 10 called functions nfsrc_freecache 88849 nfsrc_wanted 88849 nfsrv_fillattr 88849 nfsrv_postopattr 88849 nfsrvd_read 88849 nfsrvd_rephead 88849 nfsrvd_updatecache 88849 nfsvno_testexp 88849 nfsrc_trimcache 177697 nfsvno_getattr 177698 Top 10 slowest functions nfsd_excred 5673 nfsrc_freecache 5674 nfsrv_postopattr 5970 nfsrv_servertimer 6327 nfssvc_nfscommon 6596 nfsd_fhtovp 8000 nfsrvd_read 8380 nfssvc_program 92752 nfsvno_read 124979 nfsvno_fhtovp 1789523 I might try now to trace what nfsvno_fhtovp() is doing and where is = spending it's time. Any other ideas are welcome :)