From owner-freebsd-fs@FreeBSD.ORG Mon Oct 15 20:34:59 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2EF888F0 for ; Mon, 15 Oct 2012 20:34:59 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 8AB2D8FC0A for ; Mon, 15 Oct 2012 20:34:58 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AqAEACFyfFCDaFvO/2dsb2JhbABFFoV8unyCIAEBAQMBAQEBICsgCwUWDgoCAg0ZAiMGAQkmBggHBAEcBIdRAwkGC6oOiSQNiVSBIYlSZhUFhRGBEgOTP1iBVYEVig6FDYMJgT8INA X-IronPort-AV: E=Sophos;i="4.80,590,1344225600"; d="scan'208";a="183734513" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 15 Oct 2012 16:34:56 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 01B95B4022; Mon, 15 Oct 2012 16:34:57 -0400 (EDT) Date: Mon, 15 Oct 2012 16:34:56 -0400 (EDT) From: Rick Macklem To: Nikolay Denev Message-ID: <1632051502.2285525.1350333296994.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <9BE97E36-8995-4968-B8ED-1B17D308ED19@gmail.com> Subject: Re: Bad ZFS - NFS interaction? [ was: NFS server bottlenecks ] MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.203] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - IE7 (Win)/6.0.10_GA_2692) Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 20:34:59 -0000 Nikolay Denev wrote: > On Oct 15, 2012, at 5:06 PM, Rick Macklem > wrote: > > > Nikolay Denev wrote: > >> On Oct 13, 2012, at 6:22 PM, Nikolay Denev > >> wrote: > >> > >>> > >>> On Oct 13, 2012, at 5:05 AM, Rick Macklem > >>> wrote: > >>> > >>>> I wrote: > >>>>> Oops, I didn't get the "readahead" option description > >>>>> quite right in the last post. The default read ahead > >>>>> is 1, which does result in "rsize * 2", since there is > >>>>> the read + 1 readahead. > >>>>> > >>>>> "rsize * 16" would actually be for the option "readahead=15" > >>>>> and for "readahead=16" the calculation would be "rsize * 17". > >>>>> > >>>>> However, the example was otherwise ok, I think? rick > >>>> > >>>> I've attached the patch drc3.patch (it assumes drc2.patch has > >>>> already been > >>>> applied) that replaces the single mutex with one for each hash > >>>> list > >>>> for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to > >>>> 200. > >>>> > >>>> These patches are also at: > >>>> http://people.freebsd.org/~rmacklem/drc2.patch > >>>> http://people.freebsd.org/~rmacklem/drc3.patch > >>>> in case the attachments don't get through. > >>>> > >>>> rick > >>>> ps: I haven't tested drc3.patch a lot, but I think it's ok? > >>> > >>> drc3.patch applied and build cleanly and shows nice improvement! > >>> > >>> I've done a quick benchmark using iozone over the NFS mount from > >>> the > >>> Linux host. > >>> > >>> drc2.pach (but with NFSRVCACHE_HASHSIZE=500) > >>> > >>> TEST WITH 8K > >>> ------------------------------------------------------------------------------------------------- > >>> Auto Mode > >>> Using Minimum Record Size 8 KB > >>> Using Maximum Record Size 8 KB > >>> Using minimum file size of 2097152 kilobytes. > >>> Using maximum file size of 2097152 kilobytes. > >>> O_DIRECT feature enabled > >>> SYNC Mode. > >>> OPS Mode. Output is in operations per second. > >>> Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I > >>> -o > >>> -O -i 0 -i 1 -i 2 > >>> Time Resolution = 0.000001 seconds. > >>> Processor cache size set to 1024 Kbytes. > >>> Processor cache line size set to 32 bytes. > >>> File stride size set to 17 * record size. > >>> random > >>> random > >>> bkwd > >>> record > >>> stride > >>> KB reclen write rewrite read reread read write read > >>> rewrite read fwrite frewrite fread freread > >>> 2097152 8 1919 1914 2356 2321 2335 1706 > >>> > >>> TEST WITH 1M > >>> ------------------------------------------------------------------------------------------------- > >>> Auto Mode > >>> Using Minimum Record Size 1024 KB > >>> Using Maximum Record Size 1024 KB > >>> Using minimum file size of 2097152 kilobytes. > >>> Using maximum file size of 2097152 kilobytes. > >>> O_DIRECT feature enabled > >>> SYNC Mode. > >>> OPS Mode. Output is in operations per second. > >>> Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I > >>> -o > >>> -O -i 0 -i 1 -i 2 > >>> Time Resolution = 0.000001 seconds. > >>> Processor cache size set to 1024 Kbytes. > >>> Processor cache line size set to 32 bytes. > >>> File stride size set to 17 * record size. > >>> random > >>> random > >>> bkwd > >>> record > >>> stride > >>> KB reclen write rewrite read reread read write read > >>> rewrite read fwrite frewrite fread freread > >>> 2097152 1024 73 64 477 486 496 61 > >>> > >>> > >>> drc3.patch > >>> > >>> TEST WITH 8K > >>> ------------------------------------------------------------------------------------------------- > >>> Auto Mode > >>> Using Minimum Record Size 8 KB > >>> Using Maximum Record Size 8 KB > >>> Using minimum file size of 2097152 kilobytes. > >>> Using maximum file size of 2097152 kilobytes. > >>> O_DIRECT feature enabled > >>> SYNC Mode. > >>> OPS Mode. Output is in operations per second. > >>> Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I > >>> -o > >>> -O -i 0 -i 1 -i 2 > >>> Time Resolution = 0.000001 seconds. > >>> Processor cache size set to 1024 Kbytes. > >>> Processor cache line size set to 32 bytes. > >>> File stride size set to 17 * record size. > >>> random > >>> random > >>> bkwd > >>> record > >>> stride > >>> KB reclen write rewrite read reread read write read > >>> rewrite read fwrite frewrite fread freread > >>> 2097152 8 2108 2397 3001 3013 3010 2389 > >>> > >>> > >>> TEST WITH 1M > >>> ------------------------------------------------------------------------------------------------- > >>> Auto Mode > >>> Using Minimum Record Size 1024 KB > >>> Using Maximum Record Size 1024 KB > >>> Using minimum file size of 2097152 kilobytes. > >>> Using maximum file size of 2097152 kilobytes. > >>> O_DIRECT feature enabled > >>> SYNC Mode. > >>> OPS Mode. Output is in operations per second. > >>> Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I > >>> -o > >>> -O -i 0 -i 1 -i 2 > >>> Time Resolution = 0.000001 seconds. > >>> Processor cache size set to 1024 Kbytes. > >>> Processor cache line size set to 32 bytes. > >>> File stride size set to 17 * record size. > >>> random > >>> random > >>> bkwd > >>> record > >>> stride > >>> KB reclen write rewrite read reread read write read > >>> rewrite read fwrite frewrite fread freread > >>> 2097152 1024 80 79 521 536 528 75 > >>> > >>> > >>> Also with drc3 the CPU usage on the server is noticeably lower. > >>> Most > >>> of the time I could see only the geom{g_up}/{g_down} threads, > >>> and a few nfsd threads, before that nfsd's were much more > >>> prominent. > >>> > >>> I guess under bigger load the performance improvement can be > >>> bigger. > >>> > >>> I'll run some more tests with heavier loads this week. > >>> > >>> Thanks, > >>> Nikolay > >>> > >>> > >> > >> If anyone is interested here's a FlameGraph generated using DTrace > >> and > >> Brendan Gregg's tools from > >> https://github.com/brendangregg/FlameGraph > >> : > >> > >> https://home.totalterror.net/freebsd/goliath-kernel.svg > >> > >> It was sampled during Oracle database restore from Linux host over > >> the > >> nfs mount. > >> Currently all IO on the dataset that the linux machine writes is > >> stuck, simple ls in the directory > >> hangs for maybe 10-15 minutes and then eventually completes. > >> > >> Looks like some weird locking issue. > >> > >> [*] http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/ > >> > >> P.S.: The machine runs with drc3.patch for the NFS server. > >> P.S.2: The nfsd server is configured for vfs.nfsd.maxthreads=200, > >> maybe that's too much? > >> > > You could try trimming the size of vfs.nfsd.tcphighwater down. > > Remember that, > > with this patch, when you increase this tunable, you are trading > > space > > for CPU overhead. > > > > If it's still "running", you could do "vmstat -m" and "vmstat -z" to > > see where the memory is allocated. ("nfsstat -e -s" will tell you > > the > > size of the cache.) > > > > rick > >> _______________________________________________ > >> freebsd-fs@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs > >> To unsubscribe, send any mail to > >> "freebsd-fs-unsubscribe@freebsd.org" > > > Are you saying that the time spent in _mtx_spin_lock can be because of > this? No. I was thinking that memory used by the DRC cache isn't available to ZFS and that ZFS might be getting contrained because of this. AS I've said before, I'm not a ZFS guy, but you don't have to look very hard to find problems related to ZFS running low on what I think they call the ARC cache. (I believe it is usually a lack of kernel virtual address space, but I'm not the guy to know if that's correct or how to tell.) > To me it looks like that there was some heavy contention in ZFS, maybe > specific to the > way it's accessed by the NFS server? Probably due to high maxthreads > value ? > Using fewer nfsd threads would set a lower upper limit on load for ZFS, since that sets the upper limit on the # of concurrent VOP_xxx() calls. > > Here's the nfsstat -s -e, seems like it's wrong as it's negative > number, maybe overflowed? > There was a bug fixed a while ago, where "nfsstat -e -z" would zero the count out, and then it would go negative when it decreased. It will also wrap around when it hits 2B, since it's a signed 32bit. (jwd@ suggested changing the printf() to at least show unsigned, but I don't think we ever got around to a patch.) > Server: > Retfailed Faults Clients > 0 0 0 > OpenOwner Opens LockOwner Locks Delegs > 0 0 0 0 0 > Server Cache Stats: > Inprog Idem Non-idem Misses CacheSize TCPPeak > 0 0 0 83500632 -24072 16385 > > > > Also here are the following sysctls : > > vfs.nfsd.request_space_used: 0 > vfs.nfsd.request_space_used_highest: 13121808 > vfs.nfsd.request_space_high: 13107200 > vfs.nfsd.request_space_low: 8738133 > vfs.nfsd.request_space_throttled: 0 > vfs.nfsd.request_space_throttle_count: 0 > > Are they related to the same request cache? > Nope. They are in the krpc (sys/rpc/svc.c) and control/limit the space used by requests (mbuf clusters). Again, a bigger DRC will mean less mbuf/mbuf cluster space available for the rest of the system. Reduce vfs.nfsd.tcphighwater and you reduce the mbuf/mbuf cluster usage for the DRC. (It caches the reply by m_copy()ing the mbuf list.) > I have stats that show at some point nfsd has allocated all 200 > threads and > vfs.nfsd.request_space_used hits the ceiling too. When all the threads are busy, new requests will be queued in the receive side of the krpc code, which means more request_space_used. As I mentioned, use "vmstat -z" to see what the mbuf/mbuf cluster use is, among others, rick.