From owner-freebsd-fs@FreeBSD.ORG Mon Oct 15 07:28:50 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 68FFF918 for ; Mon, 15 Oct 2012 07:28:50 +0000 (UTC) (envelope-from ndenev@gmail.com) Received: from mail-we0-f182.google.com (mail-we0-f182.google.com [74.125.82.182]) by mx1.freebsd.org (Postfix) with ESMTP id DEF448FC12 for ; Mon, 15 Oct 2012 07:28:49 +0000 (UTC) Received: by mail-we0-f182.google.com with SMTP id x43so3629348wey.13 for ; Mon, 15 Oct 2012 00:28:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=Hegv+s9zQXN//HYxNQXnScT0ZBLZPkDnF3DkiwUSrtc=; b=NEjVV9ZRc+lj8gvb1YuQB6yw1Ujkq2eYqVUEGGPSkjHmwXhzntihgV5Xaz+EORMToV KWKBdKSiU6TFlveNx1+q6xOps5MiHS+UzriKZSPV8NJuaEfFy6f/q3MmU4kdseJobZxY M+wk0clQ628iX6zb/jLlcy4U+5K4jfAZZ9U4COsvAVBB/hbFDXK92pXsvxyiuSklVFgA wSTqpoeDaFCOJhPnUviXGJ4p47dX2HAbVxZzjQV4B/2NIJDTfmvuuQzOL8diS2424yUj KBuet6WxgkdbxmG10VEoTV5LHaXEWieGoQrBocCIwyaWfxLFPPBmACDh5B3CC+HHqaUi E6gg== Received: by 10.180.8.134 with SMTP id r6mr21596253wia.18.1350286128803; Mon, 15 Oct 2012 00:28:48 -0700 (PDT) Received: from [10.0.0.86] ([93.152.184.10]) by mx.google.com with ESMTPS id ei1sm14399360wid.7.2012.10.15.00.28.45 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 15 Oct 2012 00:28:47 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\)) Subject: Bad ZFS - NFS interaction? [ was: NFS server bottlenecks ] From: Nikolay Denev In-Reply-To: <302BF685-4B9D-49C8-8000-8D0F6540C8F7@gmail.com> Date: Mon, 15 Oct 2012 10:28:44 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: <65F06188-F333-4961-B3E9-CB8EB8696945@gmail.com> References: <937460294.2185822.1350093954059.JavaMail.root@erie.cs.uoguelph.ca> <302BF685-4B9D-49C8-8000-8D0F6540C8F7@gmail.com> To: "freebsd-fs@freebsd.org" X-Mailer: Apple Mail (2.1498) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 15 Oct 2012 07:28:50 -0000 On Oct 13, 2012, at 6:22 PM, Nikolay Denev wrote: >=20 > On Oct 13, 2012, at 5:05 AM, Rick Macklem = wrote: >=20 >> I wrote: >>> Oops, I didn't get the "readahead" option description >>> quite right in the last post. The default read ahead >>> is 1, which does result in "rsize * 2", since there is >>> the read + 1 readahead. >>>=20 >>> "rsize * 16" would actually be for the option "readahead=3D15" >>> and for "readahead=3D16" the calculation would be "rsize * 17". >>>=20 >>> However, the example was otherwise ok, I think? rick >>=20 >> I've attached the patch drc3.patch (it assumes drc2.patch has already = been >> applied) that replaces the single mutex with one for each hash list >> for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200. >>=20 >> These patches are also at: >> http://people.freebsd.org/~rmacklem/drc2.patch >> http://people.freebsd.org/~rmacklem/drc3.patch >> in case the attachments don't get through. >>=20 >> rick >> ps: I haven't tested drc3.patch a lot, but I think it's ok? >=20 > drc3.patch applied and build cleanly and shows nice improvement! >=20 > I've done a quick benchmark using iozone over the NFS mount from the = Linux host. >=20 > drc2.pach (but with NFSRVCACHE_HASHSIZE=3D500) >=20 > TEST WITH 8K > = --------------------------------------------------------------------------= ----------------------- > Auto Mode > Using Minimum Record Size 8 KB > Using Maximum Record Size 8 KB > Using minimum file size of 2097152 kilobytes. > Using maximum file size of 2097152 kilobytes. > O_DIRECT feature enabled > SYNC Mode.=20 > OPS Mode. Output is in operations per second. > Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o = -O -i 0 -i 1 -i 2 > Time Resolution =3D 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > random = random bkwd record stride =20 > KB reclen write rewrite read reread read = write read rewrite read fwrite frewrite fread freread > 2097152 8 1919 1914 2356 2321 2335 = 1706 =20 >=20 > TEST WITH 1M > = --------------------------------------------------------------------------= ----------------------- > Auto Mode > Using Minimum Record Size 1024 KB > Using Maximum Record Size 1024 KB > Using minimum file size of 2097152 kilobytes. > Using maximum file size of 2097152 kilobytes. > O_DIRECT feature enabled > SYNC Mode.=20 > OPS Mode. Output is in operations per second. > Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o = -O -i 0 -i 1 -i 2 > Time Resolution =3D 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > random = random bkwd record stride =20 > KB reclen write rewrite read reread read = write read rewrite read fwrite frewrite fread freread > 2097152 1024 73 64 477 486 496 = 61 =20 >=20 >=20 > drc3.patch >=20 > TEST WITH 8K > = --------------------------------------------------------------------------= ----------------------- > Auto Mode > Using Minimum Record Size 8 KB > Using Maximum Record Size 8 KB > Using minimum file size of 2097152 kilobytes. > Using maximum file size of 2097152 kilobytes. > O_DIRECT feature enabled > SYNC Mode.=20 > OPS Mode. Output is in operations per second. > Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o = -O -i 0 -i 1 -i 2 > Time Resolution =3D 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > random = random bkwd record stride =20 > KB reclen write rewrite read reread read = write read rewrite read fwrite frewrite fread freread > 2097152 8 2108 2397 3001 3013 3010 = 2389 =20 >=20 >=20 > TEST WITH 1M > = --------------------------------------------------------------------------= ----------------------- > Auto Mode > Using Minimum Record Size 1024 KB > Using Maximum Record Size 1024 KB > Using minimum file size of 2097152 kilobytes. > Using maximum file size of 2097152 kilobytes. > O_DIRECT feature enabled > SYNC Mode.=20 > OPS Mode. Output is in operations per second. > Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o = -O -i 0 -i 1 -i 2 > Time Resolution =3D 0.000001 seconds. > Processor cache size set to 1024 Kbytes. > Processor cache line size set to 32 bytes. > File stride size set to 17 * record size. > random = random bkwd record stride =20 > KB reclen write rewrite read reread read = write read rewrite read fwrite frewrite fread freread > 2097152 1024 80 79 521 536 528 = 75 =20 >=20 >=20 > Also with drc3 the CPU usage on the server is noticeably lower. Most = of the time I could see only the geom{g_up}/{g_down} threads, > and a few nfsd threads, before that nfsd's were much more prominent. >=20 > I guess under bigger load the performance improvement can be bigger. >=20 > I'll run some more tests with heavier loads this week. >=20 > Thanks, > Nikolay >=20 >=20 If anyone is interested here's a FlameGraph generated using DTrace and Brendan Gregg's tools from https://github.com/brendangregg/FlameGraph : https://home.totalterror.net/freebsd/goliath-kernel.svg It was sampled during Oracle database restore from Linux host over the = nfs mount. Currently all IO on the dataset that the linux machine writes is stuck, = simple ls in the directory hangs for maybe 10-15 minutes and then eventually completes. Looks like some weird locking issue. [*] http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/ P.S.: The machine runs with drc3.patch for the NFS server. P.S.2: The nfsd server is configured for vfs.nfsd.maxthreads=3D200, = maybe that's too much?