From owner-freebsd-hackers@FreeBSD.ORG Thu Oct 11 16:47:17 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 10ED71A1; Thu, 11 Oct 2012 16:47:17 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-wg0-f50.google.com (mail-wg0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id 38EFA8FC12; Thu, 11 Oct 2012 16:47:15 +0000 (UTC) Received: by mail-wg0-f50.google.com with SMTP id 16so1507684wgi.31 for ; Thu, 11 Oct 2012 09:47:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=42qujbjVn/cwcUzL/CgDbTaz1h3ZzyWh3DjKBwwM4uY=; b=iorcJB7paRfvs+FzuLp2MO7nQ4qKgUXLQ5wElkdpcqI6JYum3fR2EmxlRv8iHAljCI 774wMi3FVyuOf26Pe28tCUCrkoH+tC5LlJsDf/P7PfRfvwATCOMdG9XNVzBj6Vp5CJgu iWRdx41qemywW+Bb0l11DodMtmR5OrvdGEXBNvH172g4Veldl2dxkE6Ca8ZpTNhrWUVR cQ9DXZOkDQzV4XopgdWEpiunM+b3ehAtEhbPL8Sv2wRcHjcdZ5NJuZA7bzbLYmH9HEcq KnJSv0F1tKYD+ips391ZdKeB5Hs+ccDD1POcAM6mEaIWZVRTmdpVGfjGnfkEgvB4kJET 2+4A== Received: by 10.180.87.132 with SMTP id ay4mr3583965wib.5.1349974034907; Thu, 11 Oct 2012 09:47:14 -0700 (PDT) Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com. [217.18.249.148]) by mx.google.com with ESMTPS id gm7sm9563488wib.10.2012.10.11.09.47.13 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 11 Oct 2012 09:47:14 -0700 (PDT) Subject: Re: NFS server bottlenecks Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com> Date: Thu, 11 Oct 2012 19:47:12 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: <1071150615.2039567.1349906947942.JavaMail.root@erie.cs.uoguelph.ca> <19724137-ABB0-43AF-BCB9-EBE8ACD6E3BD@gmail.com> <0A8CDBF9-28C3-46D2-BB58-0559D00BD545@gmail.com> To: "freebsd-hackers@freebsd.org" X-Mailer: Apple Mail (2.1498) Cc: rmacklem@freebsd.org, Garrett Wollman X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Oct 2012 16:47:17 -0000 On Oct 11, 2012, at 7:20 PM, Nikolay Denev wrote: > On Oct 11, 2012, at 8:46 AM, Nikolay Denev wrote: >=20 >>=20 >> On Oct 11, 2012, at 1:09 AM, Rick Macklem = wrote: >>=20 >>> Nikolay Denev wrote: >>>> On Oct 10, 2012, at 3:18 AM, Rick Macklem >>>> wrote: >>>>=20 >>>>> Nikolay Denev wrote: >>>>>> On Oct 4, 2012, at 12:36 AM, Rick Macklem >>>>>> wrote: >>>>>>=20 >>>>>>> Garrett Wollman wrote: >>>>>>>> <>>>>>>> said: >>>>>>>>=20 >>>>>>>>>> Simple: just use a sepatate mutex for each list that a cache >>>>>>>>>> entry >>>>>>>>>> is on, rather than a global lock for everything. This would >>>>>>>>>> reduce >>>>>>>>>> the mutex contention, but I'm not sure how significantly = since >>>>>>>>>> I >>>>>>>>>> don't have the means to measure it yet. >>>>>>>>>>=20 >>>>>>>>> Well, since the cache trimming is removing entries from the >>>>>>>>> lists, >>>>>>>>> I >>>>>>>>> don't >>>>>>>>> see how that can be done with a global lock for list updates? >>>>>>>>=20 >>>>>>>> Well, the global lock is what we have now, but the cache = trimming >>>>>>>> process only looks at one list at a time, so not locking the = list >>>>>>>> that >>>>>>>> isn't being iterated over probably wouldn't hurt, unless = there's >>>>>>>> some >>>>>>>> mechanism (that I didn't see) for entries to move from one list >>>>>>>> to >>>>>>>> another. Note that I'm considering each hash bucket a separate >>>>>>>> "list". (One issue to worry about in that case would be >>>>>>>> cache-line >>>>>>>> contention in the array of hash buckets; perhaps >>>>>>>> NFSRVCACHE_HASHSIZE >>>>>>>> ought to be increased to reduce that.) >>>>>>>>=20 >>>>>>> Yea, a separate mutex for each hash list might help. There is = also >>>>>>> the >>>>>>> LRU list that all entries end up on, that gets used by the >>>>>>> trimming >>>>>>> code. >>>>>>> (I think? I wrote this stuff about 8 years ago, so I haven't >>>>>>> looked >>>>>>> at >>>>>>> it in a while.) >>>>>>>=20 >>>>>>> Also, increasing the hash table size is probably a good idea, >>>>>>> especially >>>>>>> if you reduce how aggressively the cache is trimmed. >>>>>>>=20 >>>>>>>>> Only doing it once/sec would result in a very large cache when >>>>>>>>> bursts of >>>>>>>>> traffic arrives. >>>>>>>>=20 >>>>>>>> My servers have 96 GB of memory so that's not a big deal for = me. >>>>>>>>=20 >>>>>>> This code was originally "production tested" on a server with >>>>>>> 1Gbyte, >>>>>>> so times have changed a bit;-) >>>>>>>=20 >>>>>>>>> I'm not sure I see why doing it as a separate thread will >>>>>>>>> improve >>>>>>>>> things. >>>>>>>>> There are N nfsd threads already (N can be bumped up to 256 if >>>>>>>>> you >>>>>>>>> wish) >>>>>>>>> and having a bunch more "cache trimming threads" would just >>>>>>>>> increase >>>>>>>>> contention, wouldn't it? >>>>>>>>=20 >>>>>>>> Only one cache-trimming thread. The cache trim holds the = (global) >>>>>>>> mutex for much longer than any individual nfsd service thread = has >>>>>>>> any >>>>>>>> need to, and having N threads doing that in parallel is why = it's >>>>>>>> so >>>>>>>> heavily contended. If there's only one thread doing the trim, >>>>>>>> then >>>>>>>> the nfsd service threads aren't spending time either contending >>>>>>>> on >>>>>>>> the >>>>>>>> mutex (it will be held less frequently and for shorter = periods). >>>>>>>>=20 >>>>>>> I think the little drc2.patch which will keep the nfsd threads >>>>>>> from >>>>>>> acquiring the mutex and doing the trimming most of the time, = might >>>>>>> be >>>>>>> sufficient. I still don't see why a separate trimming thread = will >>>>>>> be >>>>>>> an advantage. I'd also be worried that the one cache trimming >>>>>>> thread >>>>>>> won't get the job done soon enough. >>>>>>>=20 >>>>>>> When I did production testing on a 1Gbyte server that saw a peak >>>>>>> load of about 100RPCs/sec, it was necessary to trim = aggressively. >>>>>>> (Although I'd be tempted to say that a server with 1Gbyte is no >>>>>>> longer relevant, I recently recall someone trying to run FreeBSD >>>>>>> on a i486, although I doubt they wanted to run the nfsd on it.) >>>>>>>=20 >>>>>>>>> The only negative effect I can think of w.r.t. having the nfsd >>>>>>>>> threads doing it would be a (I believe negligible) increase in >>>>>>>>> RPC >>>>>>>>> response times (the time the nfsd thread spends trimming the >>>>>>>>> cache). >>>>>>>>> As noted, I think this time would be negligible compared to = disk >>>>>>>>> I/O >>>>>>>>> and network transit times in the total RPC response time? >>>>>>>>=20 >>>>>>>> With adaptive mutexes, many CPUs, lots of in-memory cache, and >>>>>>>> 10G >>>>>>>> network connectivity, spinning on a contended mutex takes a >>>>>>>> significant amount of CPU time. (For the current design of the >>>>>>>> NFS >>>>>>>> server, it may actually be a win to turn off adaptive mutexes = -- >>>>>>>> I >>>>>>>> should give that a try once I'm able to do more testing.) >>>>>>>>=20 >>>>>>> Have fun with it. Let me know when you have what you think is a >>>>>>> good >>>>>>> patch. >>>>>>>=20 >>>>>>> rick >>>>>>>=20 >>>>>>>> -GAWollman >>>>>>>> _______________________________________________ >>>>>>>> freebsd-hackers@freebsd.org mailing list >>>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>>>>> To unsubscribe, send any mail to >>>>>>>> "freebsd-hackers-unsubscribe@freebsd.org" >>>>>>> _______________________________________________ >>>>>>> freebsd-fs@freebsd.org mailing list >>>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >>>>>>> To unsubscribe, send any mail to >>>>>>> "freebsd-fs-unsubscribe@freebsd.org" >>>>>>=20 >>>>>> My quest for IOPS over NFS continues :) >>>>>> So far I'm not able to achieve more than about 3000 8K read >>>>>> requests >>>>>> over NFS, >>>>>> while the server locally gives much more. >>>>>> And this is all from a file that is completely in ARC cache, no >>>>>> disk >>>>>> IO involved. >>>>>>=20 >>>>> Just out of curiousity, why do you use 8K reads instead of 64K >>>>> reads. >>>>> Since the RPC overhead (including the DRC functions) is per RPC, >>>>> doing >>>>> fewer larger RPCs should usually work better. (Sometimes large >>>>> rsize/wsize >>>>> values generate too large a burst of traffic for a network = interface >>>>> to >>>>> handle and then the rsize/wsize has to be decreased to avoid this >>>>> issue.) >>>>>=20 >>>>> And, although this experiment seems useful for testing patches = that >>>>> try >>>>> and reduce DRC CPU overheads, most "real" NFS servers will be = doing >>>>> disk >>>>> I/O. >>>>>=20 >>>>=20 >>>> This is the default blocksize the Oracle and probably most = databases >>>> use. >>>> It uses also larger blocks, but for small random reads in OLTP >>>> applications this is what is used. >>>>=20 >>> If the client is doing 8K reads, you could increase the read ahead >>> "readahead=3DN" (N up to 16), to try and increase the bandwidth. >>> (But if the CPU is 99% busy, then I don't think it will matter.) >>=20 >> I'll try to check if this is possible to be set, as we are testing = not only with the Linux NFS client, >> but also with the Oracle's built in so called DirectNFS client that = is built in to the app. >>=20 >>>=20 >>>>=20 >>>>>> I've snatched some sample DTrace script from the net : [ >>>>>> = http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes >>>>>> ] >>>>>>=20 >>>>>> And modified it for our new NFS server : >>>>>>=20 >>>>>> #!/usr/sbin/dtrace -qs >>>>>>=20 >>>>>> fbt:kernel:nfsrvd_*:entry >>>>>> { >>>>>> self->ts =3D timestamp; >>>>>> @counts[probefunc] =3D count(); >>>>>> } >>>>>>=20 >>>>>> fbt:kernel:nfsrvd_*:return >>>>>> / self->ts > 0 / >>>>>> { >>>>>> this->delta =3D (timestamp-self->ts)/1000000; >>>>>> } >>>>>>=20 >>>>>> fbt:kernel:nfsrvd_*:return >>>>>> / self->ts > 0 && this->delta > 100 / >>>>>> { >>>>>> @slow[probefunc, "ms"] =3D lquantize(this->delta, 100, 500, 50); >>>>>> } >>>>>>=20 >>>>>> fbt:kernel:nfsrvd_*:return >>>>>> / self->ts > 0 / >>>>>> { >>>>>> @dist[probefunc, "ms"] =3D quantize(this->delta); >>>>>> self->ts =3D 0; >>>>>> } >>>>>>=20 >>>>>> END >>>>>> { >>>>>> printf("\n"); >>>>>> printa("function %-20s %@10d\n", @counts); >>>>>> printf("\n"); >>>>>> printa("function %s(), time in %s:%@d\n", @dist); >>>>>> printf("\n"); >>>>>> printa("function %s(), time in %s for >=3D 100 ms:%@d\n", @slow); >>>>>> } >>>>>>=20 >>>>>> And here's a sample output from one or two minutes during the run >>>>>> of >>>>>> Oracle's ORION benchmark >>>>>> tool from a Linux machine, on a 32G file on NFS mount over 10G >>>>>> ethernet: >>>>>>=20 >>>>>> [16:01]root@goliath:/home/ndenev# ./nfsrvd.d >>>>>> ^C >>>>>>=20 >>>>>> function nfsrvd_access 4 >>>>>> function nfsrvd_statfs 10 >>>>>> function nfsrvd_getattr 14 >>>>>> function nfsrvd_commit 76 >>>>>> function nfsrvd_sentcache 110048 >>>>>> function nfsrvd_write 110048 >>>>>> function nfsrvd_read 283648 >>>>>> function nfsrvd_dorpc 393800 >>>>>> function nfsrvd_getcache 393800 >>>>>> function nfsrvd_rephead 393800 >>>>>> function nfsrvd_updatecache 393800 >>>>>>=20 >>>>>> function nfsrvd_access(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_statfs(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_getattr(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_sentcache(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_rephead(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_updatecache(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 >>>>>> 1 | 0 >>>>>>=20 >>>>>> function nfsrvd_getcache(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 >>>>>> 1 | 1 >>>>>> 2 | 0 >>>>>> 4 | 1 >>>>>> 8 | 0 >>>>>>=20 >>>>>> function nfsrvd_write(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 >>>>>> 1 | 5 >>>>>> 2 | 4 >>>>>> 4 | 0 >>>>>>=20 >>>>>> function nfsrvd_read(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 >>>>>> 1 | 19 >>>>>> 2 | 3 >>>>>> 4 | 2 >>>>>> 8 | 0 >>>>>> 16 | 1 >>>>>> 32 | 0 >>>>>> 64 | 0 >>>>>> 128 | 0 >>>>>> 256 | 1 >>>>>> 512 | 0 >>>>>>=20 >>>>>> function nfsrvd_commit(), time in ms: >>>>>> value ------------- Distribution ------------- count >>>>>> -1 | 0 >>>>>> 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 >>>>>> 1 |@@@@@@@ 14 >>>>>> 2 | 0 >>>>>> 4 |@ 1 >>>>>> 8 |@ 1 >>>>>> 16 | 0 >>>>>> 32 |@@@@@@@ 14 >>>>>> 64 |@ 2 >>>>>> 128 | 0 >>>>>>=20 >>>>>>=20 >>>>>> function nfsrvd_commit(), time in ms for >=3D 100 ms: >>>>>> value ------------- Distribution ------------- count >>>>>> < 100 | 0 >>>>>> 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>>>> 150 | 0 >>>>>>=20 >>>>>> function nfsrvd_read(), time in ms for >=3D 100 ms: >>>>>> value ------------- Distribution ------------- count >>>>>> 250 | 0 >>>>>> 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 >>>>>> 350 | 0 >>>>>>=20 >>>>>>=20 >>>>>> Looks like the nfs server cache functions are quite fast, but >>>>>> extremely frequently called. >>>>>>=20 >>>>> Yep, they are called for every RPC. >>>>>=20 >>>>> I may try coding up a patch that replaces the single mutex with >>>>> one for each hash bucket, for TCP. >>>>>=20 >>>>> I'll post if/when I get this patch to a testing/review stage, rick >>>>>=20 >>>>=20 >>>> Cool. >>>>=20 >>>> I've readjusted the precision of the dtrace script a bit, and I can >>>> see >>>> now the following three functions as taking most of the time : >>>> nfsrvd_getcache(), nfsrc_trimcache() and nfsrvd_updatecache() >>>>=20 >>>> This was recorded during a oracle benchmark run called SLOB, which >>>> caused 99% cpu load on the NFS server. >>>>=20 >>> Even with the drc2.patch and a large value for = vfs.nfsd.tcphighwater? >>> (Assuming the mounts are TCP ones.) >>>=20 >>> Have fun with it, rick >>>=20 >>=20 >> I had upped it, but probably not enough. I'm now running with = vfs.nfsd.tcphighwater set >> to some ridiculous number, and NFSRVCACHE_HASHSIZE set to 500. >> So far it looks like good improvement as those functions no longer = show up in the dtrace script output. >> I'll run some more benchmarks and testing today. >>=20 >> Thanks! >>=20 >>>>=20 >>>>>> I hope someone can find this information useful. >>>>>>=20 >>>>>> _______________________________________________ >>>>>> freebsd-hackers@freebsd.org mailing list >>>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>>>>> To unsubscribe, send any mail to >>>>>> "freebsd-hackers-unsubscribe@freebsd.org" >>=20 >=20 > I haven't had the opportunity today to run more DB tests over NFS as = the DBA was busy with something else, > however I tested a bit the large file transfers. > And I'm seeing something strange probably not only NFS but also ZFS = and ARC related. >=20 > When I first tested the drc2.patch I reported a huge bandwidth = improvement, > but now I think that this was probably because of the machine freshly = rebooted instead of just the patch. > The patch surely improved things, especially CPU utilization combined = with the increased cache. > But today I'm again having a file completely cached in ZFS's ARC = cache, which when transferred over NFS > reaches about 300MB/s, when at some tests it reached 900+MB/s (as = reported in my first email). > The file locally can be read at about 3GB/s as reported by dd. >=20 > Local: > [17:51]root@goliath:/tank/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null = bs=3D1M =20 > 30720+1 records in > 30720+1 records out > 32212262912 bytes transferred in 10.548485 secs (3053733573 bytes/sec) >=20 > Over NFS: > [17:48]root@spa:/mnt/spa_db/undo# dd if=3Ddata.dbf of=3D/dev/null = bs=3D1M =20 > 30720+1 records in > 30720+1 records out > 32212262912 bytes (32 GB) copied, 88.0663 seconds, 366 MB/s >=20 > The machines are almost idle during this transfer and I can't see a = reason why it can't reach the full bandwith when it's > just reading it from RAM. >=20 > I've tried again tracing with DTrace to see what's happening with this = script : >=20 > fbt:kernel:nfs*:entry > { > this->ts =3D timestamp; > @counts[probefunc] =3D count(); > } >=20 > fbt:kernel:nfs*:return > / this->ts > 0 / > { > @time[probefunc] =3D avg(timestamp - this->ts); > } >=20 > END > { > trunc(@counts, 10); > trunc(@time, 10); > printf("Top 10 called functions\n\n");=09 > printa(@counts); > printf("\n\nTop 10 slowest functions\n\n"); > printa(@time); > } >=20 > And here's the result (several seconds during the dd test): >=20 > Top 10 called functions > nfsrc_freecache 88849 > nfsrc_wanted 88849 > nfsrv_fillattr 88849 > nfsrv_postopattr 88849 > nfsrvd_read 88849 > nfsrvd_rephead 88849 > nfsrvd_updatecache 88849 > nfsvno_testexp 88849 > nfsrc_trimcache 177697 > nfsvno_getattr 177698 >=20 > Top 10 slowest functions > nfsd_excred 5673 > nfsrc_freecache 5674 > nfsrv_postopattr 5970 > nfsrv_servertimer 6327 > nfssvc_nfscommon 6596 > nfsd_fhtovp 8000 > nfsrvd_read 8380 > nfssvc_program 92752 > nfsvno_read 124979 > nfsvno_fhtovp 1789523 >=20 > I might try now to trace what nfsvno_fhtovp() is doing and where is = spending it's time. >=20 > Any other ideas are welcome :) >=20 To take the network out of the equation I redid the test by mounting the = same filesystem over NFS on the server: [18:23]root@goliath:~# mount -t nfs -o = rw,hard,intr,tcp,nfsv3,rsize=3D1048576,wsize=3D1048576 = localhost:/tank/spa_db/undo /mnt [18:24]root@goliath:~# dd if=3D/mnt/data.dbf of=3D/dev/null bs=3D1M=20 30720+1 records in 30720+1 records out 32212262912 bytes transferred in 79.793343 secs (403696120 bytes/sec) [18:25]root@goliath:~# dd if=3D/mnt/data.dbf of=3D/dev/null bs=3D1M 30720+1 records in 30720+1 records out 32212262912 bytes transferred in 12.033420 secs (2676900110 bytes/sec) During the first run I saw several nfsd threads in top, along with dd = and again zero disk I/O. There was increase in memory usage because of the double buffering = ARC->buffercahe. The second run was with all of the nfsd threads totally idle, and read = directly from the buffercache.