From owner-freebsd-fs@FreeBSD.ORG Mon Oct 15 15:08:24 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 97DD33F5 for ; Mon, 15 Oct 2012 15:08:24 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-wg0-f42.google.com (mail-wg0-f42.google.com [74.125.82.42]) by mx1.freebsd.org (Postfix) with ESMTP id 239098FC0A for ; Mon, 15 Oct 2012 15:08:23 +0000 (UTC) Received: by mail-wg0-f42.google.com with SMTP id fm10so249677wgb.1 for ; Mon, 15 Oct 2012 08:08:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:mime-version:content-type:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to:x-mailer; bh=GJsGagae2L2KWrIg2Smh1jt6/3QLowyZWAtDYKkZChw=; b=Xs2dxhxunfMuyl+1GmERZcbDdGxxDN25vI5KDV2DjLB2eZSUy+g1r14MWLqz+CdvKb qerM8bj11gdQbVm7W9WTQf01agYTy5WwKmJjd0ORB8c7mDv6P3JH38XOZIJeH2hCSMHr l+z8vf3zqr/o2EFfPC7whbOls6zd6xNEb2XVvPyW3SAN7VYem/Drty3wm2Kya49hAchm HiPhL4W0xX6M9gjFUZ+VmcESyB6EilHx/pU/xZYqBG8EhOsLbKTMrSmxYzrpRJD3kplo IoOdPmKtHUMaItTmtaXhFbUK9eRDAEHsFNR5UAiH6MJKBhYRbRitlRlorBgYzl0dSY7n 2dsQ== Received: by 10.216.27.84 with SMTP id d62mr7108644wea.3.1350313703137; Mon, 15 Oct 2012 08:08:23 -0700 (PDT) Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com. [217.18.249.148]) by mx.google.com with ESMTPS id cl8sm14356877wib.10.2012.10.15.08.08.20 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 15 Oct 2012 08:08:22 -0700 (PDT) Subject: Re: Bad ZFS - NFS interaction? [ was: NFS server bottlenecks ] Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Content-Type: text/plain; charset=us-ascii From: Nikolay Denev In-Reply-To: <831941180.2238334.1350309978281.JavaMail.root@erie.cs.uoguelph.ca> Date: Mon, 15 Oct 2012 18:08:19 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <9BE97E36-8995-4968-B8ED-1B17D308ED19@gmail.com> References: <831941180.2238334.1350309978281.JavaMail.root@erie.cs.uoguelph.ca> To: Rick Macklem X-Mailer: Apple Mail (2.1498) Cc: "freebsd-fs@freebsd.org" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 15:08:24 -0000 On Oct 15, 2012, at 5:06 PM, Rick Macklem wrote: > Nikolay Denev wrote: >> On Oct 13, 2012, at 6:22 PM, Nikolay Denev wrote: >>=20 >>>=20 >>> On Oct 13, 2012, at 5:05 AM, Rick Macklem >>> wrote: >>>=20 >>>> I wrote: >>>>> Oops, I didn't get the "readahead" option description >>>>> quite right in the last post. The default read ahead >>>>> is 1, which does result in "rsize * 2", since there is >>>>> the read + 1 readahead. >>>>>=20 >>>>> "rsize * 16" would actually be for the option "readahead=3D15" >>>>> and for "readahead=3D16" the calculation would be "rsize * 17". >>>>>=20 >>>>> However, the example was otherwise ok, I think? rick >>>>=20 >>>> I've attached the patch drc3.patch (it assumes drc2.patch has >>>> already been >>>> applied) that replaces the single mutex with one for each hash list >>>> for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. >>>>=20 >>>> These patches are also at: >>>> http://people.freebsd.org/~rmacklem/drc2.patch >>>> http://people.freebsd.org/~rmacklem/drc3.patch >>>> in case the attachments don't get through. >>>>=20 >>>> rick >>>> ps: I haven't tested drc3.patch a lot, but I think it's ok? >>>=20 >>> drc3.patch applied and build cleanly and shows nice improvement! >>>=20 >>> I've done a quick benchmark using iozone over the NFS mount from the >>> Linux host. >>>=20 >>> drc2.pach (but with NFSRVCACHE_HASHSIZE=3D500) >>>=20 >>> TEST WITH 8K >>> = --------------------------------------------------------------------------= ----------------------- >>> Auto Mode >>> Using Minimum Record Size 8 KB >>> Using Maximum Record Size 8 KB >>> Using minimum file size of 2097152 kilobytes. >>> Using maximum file size of 2097152 kilobytes. >>> O_DIRECT feature enabled >>> SYNC Mode. >>> OPS Mode. Output is in operations per second. >>> Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o >>> -O -i 0 -i 1 -i 2 >>> Time Resolution =3D 0.000001 seconds. >>> Processor cache size set to 1024 Kbytes. >>> Processor cache line size set to 32 bytes. >>> File stride size set to 17 * record size. >>> random >>> random >>> bkwd >>> record >>> stride >>> KB reclen write rewrite read reread read write read >>> rewrite read fwrite frewrite fread freread >>> 2097152 8 1919 1914 2356 2321 2335 1706 >>>=20 >>> TEST WITH 1M >>> = --------------------------------------------------------------------------= ----------------------- >>> Auto Mode >>> Using Minimum Record Size 1024 KB >>> Using Maximum Record Size 1024 KB >>> Using minimum file size of 2097152 kilobytes. >>> Using maximum file size of 2097152 kilobytes. >>> O_DIRECT feature enabled >>> SYNC Mode. >>> OPS Mode. Output is in operations per second. >>> Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o >>> -O -i 0 -i 1 -i 2 >>> Time Resolution =3D 0.000001 seconds. >>> Processor cache size set to 1024 Kbytes. >>> Processor cache line size set to 32 bytes. >>> File stride size set to 17 * record size. >>> random >>> random >>> bkwd >>> record >>> stride >>> KB reclen write rewrite read reread read write read >>> rewrite read fwrite frewrite fread freread >>> 2097152 1024 73 64 477 486 496 61 >>>=20 >>>=20 >>> drc3.patch >>>=20 >>> TEST WITH 8K >>> = --------------------------------------------------------------------------= ----------------------- >>> Auto Mode >>> Using Minimum Record Size 8 KB >>> Using Maximum Record Size 8 KB >>> Using minimum file size of 2097152 kilobytes. >>> Using maximum file size of 2097152 kilobytes. >>> O_DIRECT feature enabled >>> SYNC Mode. >>> OPS Mode. Output is in operations per second. >>> Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o >>> -O -i 0 -i 1 -i 2 >>> Time Resolution =3D 0.000001 seconds. >>> Processor cache size set to 1024 Kbytes. >>> Processor cache line size set to 32 bytes. >>> File stride size set to 17 * record size. >>> random >>> random >>> bkwd >>> record >>> stride >>> KB reclen write rewrite read reread read write read >>> rewrite read fwrite frewrite fread freread >>> 2097152 8 2108 2397 3001 3013 3010 2389 >>>=20 >>>=20 >>> TEST WITH 1M >>> = --------------------------------------------------------------------------= ----------------------- >>> Auto Mode >>> Using Minimum Record Size 1024 KB >>> Using Maximum Record Size 1024 KB >>> Using minimum file size of 2097152 kilobytes. >>> Using maximum file size of 2097152 kilobytes. >>> O_DIRECT feature enabled >>> SYNC Mode. >>> OPS Mode. Output is in operations per second. >>> Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o >>> -O -i 0 -i 1 -i 2 >>> Time Resolution =3D 0.000001 seconds. >>> Processor cache size set to 1024 Kbytes. >>> Processor cache line size set to 32 bytes. >>> File stride size set to 17 * record size. >>> random >>> random >>> bkwd >>> record >>> stride >>> KB reclen write rewrite read reread read write read >>> rewrite read fwrite frewrite fread freread >>> 2097152 1024 80 79 521 536 528 75 >>>=20 >>>=20 >>> Also with drc3 the CPU usage on the server is noticeably lower. Most >>> of the time I could see only the geom{g_up}/{g_down} threads, >>> and a few nfsd threads, before that nfsd's were much more prominent. >>>=20 >>> I guess under bigger load the performance improvement can be bigger. >>>=20 >>> I'll run some more tests with heavier loads this week. >>>=20 >>> Thanks, >>> Nikolay >>>=20 >>>=20 >>=20 >> If anyone is interested here's a FlameGraph generated using DTrace = and >> Brendan Gregg's tools from https://github.com/brendangregg/FlameGraph >> : >>=20 >> https://home.totalterror.net/freebsd/goliath-kernel.svg >>=20 >> It was sampled during Oracle database restore from Linux host over = the >> nfs mount. >> Currently all IO on the dataset that the linux machine writes is >> stuck, simple ls in the directory >> hangs for maybe 10-15 minutes and then eventually completes. >>=20 >> Looks like some weird locking issue. >>=20 >> [*] http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/ >>=20 >> P.S.: The machine runs with drc3.patch for the NFS server. >> P.S.2: The nfsd server is configured for vfs.nfsd.maxthreads=3D200, >> maybe that's too much? >>=20 > You could try trimming the size of vfs.nfsd.tcphighwater down. = Remember that, > with this patch, when you increase this tunable, you are trading space > for CPU overhead. >=20 > If it's still "running", you could do "vmstat -m" and "vmstat -z" to > see where the memory is allocated. ("nfsstat -e -s" will tell you the > size of the cache.) >=20 > rick >> _______________________________________________ >> freebsd-fs@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" Are you saying that the time spent in _mtx_spin_lock can be because of = this? To me it looks like that there was some heavy contention in ZFS, maybe = specific to the way it's accessed by the NFS server? Probably due to high maxthreads = value ? Here's the nfsstat -s -e, seems like it's wrong as it's negative number, = maybe overflowed? Server: Retfailed Faults Clients 0 0 0 OpenOwner Opens LockOwner Locks Delegs=20 0 0 0 0 0=20 Server Cache Stats: Inprog Idem Non-idem Misses CacheSize TCPPeak 0 0 0 83500632 -24072 16385 Also here are the following sysctls : vfs.nfsd.request_space_used: 0 vfs.nfsd.request_space_used_highest: 13121808 vfs.nfsd.request_space_high: 13107200 vfs.nfsd.request_space_low: 8738133 vfs.nfsd.request_space_throttled: 0 vfs.nfsd.request_space_throttle_count: 0 Are they related to the same request cache? I have stats that show at some point nfsd has allocated all 200 threads = and=20 vfs.nfsd.request_space_used hits the ceiling too.