From owner-freebsd-fs@FreeBSD.ORG  Mon Oct 15 15:08:24 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 97DD33F5
 for <freebsd-fs@freebsd.org>; Mon, 15 Oct 2012 15:08:24 +0000 (UTC)
 (envelope-from ndenev@gmail.com)
Received: from mail-wg0-f42.google.com (mail-wg0-f42.google.com [74.125.82.42])
 by mx1.freebsd.org (Postfix) with ESMTP id 239098FC0A
 for <freebsd-fs@freebsd.org>; Mon, 15 Oct 2012 15:08:23 +0000 (UTC)
Received: by mail-wg0-f42.google.com with SMTP id fm10so249677wgb.1
 for <freebsd-fs@freebsd.org>; Mon, 15 Oct 2012 08:08:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=subject:mime-version:content-type:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to:x-mailer;
 bh=GJsGagae2L2KWrIg2Smh1jt6/3QLowyZWAtDYKkZChw=;
 b=Xs2dxhxunfMuyl+1GmERZcbDdGxxDN25vI5KDV2DjLB2eZSUy+g1r14MWLqz+CdvKb
 qerM8bj11gdQbVm7W9WTQf01agYTy5WwKmJjd0ORB8c7mDv6P3JH38XOZIJeH2hCSMHr
 l+z8vf3zqr/o2EFfPC7whbOls6zd6xNEb2XVvPyW3SAN7VYem/Drty3wm2Kya49hAchm
 HiPhL4W0xX6M9gjFUZ+VmcESyB6EilHx/pU/xZYqBG8EhOsLbKTMrSmxYzrpRJD3kplo
 IoOdPmKtHUMaItTmtaXhFbUK9eRDAEHsFNR5UAiH6MJKBhYRbRitlRlorBgYzl0dSY7n
 2dsQ==
Received: by 10.216.27.84 with SMTP id d62mr7108644wea.3.1350313703137;
 Mon, 15 Oct 2012 08:08:23 -0700 (PDT)
Received: from ndenevsa.sf.moneybookers.net (g1.moneybookers.com.
 [217.18.249.148])
 by mx.google.com with ESMTPS id cl8sm14356877wib.10.2012.10.15.08.08.20
 (version=TLSv1/SSLv3 cipher=OTHER);
 Mon, 15 Oct 2012 08:08:22 -0700 (PDT)
Subject: Re: Bad ZFS - NFS interaction? [ was: NFS server bottlenecks ]
Mime-Version: 1.0 (Mac OS X Mail 6.1 \(1498\))
Content-Type: text/plain; charset=us-ascii
From: Nikolay Denev <ndenev@gmail.com>
In-Reply-To: <831941180.2238334.1350309978281.JavaMail.root@erie.cs.uoguelph.ca>
Date: Mon, 15 Oct 2012 18:08:19 +0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <9BE97E36-8995-4968-B8ED-1B17D308ED19@gmail.com>
References: <831941180.2238334.1350309978281.JavaMail.root@erie.cs.uoguelph.ca>
To: Rick Macklem <rmacklem@uoguelph.ca>
X-Mailer: Apple Mail (2.1498)
Cc: "freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.ORG>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 15 Oct 2012 15:08:24 -0000


On Oct 15, 2012, at 5:06 PM, Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Nikolay Denev wrote:
>> On Oct 13, 2012, at 6:22 PM, Nikolay Denev <ndenev@gmail.com> wrote:
>>=20
>>>=20
>>> On Oct 13, 2012, at 5:05 AM, Rick Macklem <rmacklem@uoguelph.ca>
>>> wrote:
>>>=20
>>>> I wrote:
>>>>> Oops, I didn't get the "readahead" option description
>>>>> quite right in the last post. The default read ahead
>>>>> is 1, which does result in "rsize * 2", since there is
>>>>> the read + 1 readahead.
>>>>>=20
>>>>> "rsize * 16" would actually be for the option "readahead=3D15"
>>>>> and for "readahead=3D16" the calculation would be "rsize * 17".
>>>>>=20
>>>>> However, the example was otherwise ok, I think? rick
>>>>=20
>>>> I've attached the patch drc3.patch (it assumes drc2.patch has
>>>> already been
>>>> applied) that replaces the single mutex with one for each hash list
>>>> for tcp. It also increases the size of NFSRVCACHE_HASHSIZE to 200.
>>>>=20
>>>> These patches are also at:
>>>> http://people.freebsd.org/~rmacklem/drc2.patch
>>>> http://people.freebsd.org/~rmacklem/drc3.patch
>>>> in case the attachments don't get through.
>>>>=20
>>>> rick
>>>> ps: I haven't tested drc3.patch a lot, but I think it's ok?
>>>=20
>>> drc3.patch applied and build cleanly and shows nice improvement!
>>>=20
>>> I've done a quick benchmark using iozone over the NFS mount from the
>>> Linux host.
>>>=20
>>> drc2.pach (but with NFSRVCACHE_HASHSIZE=3D500)
>>>=20
>>> 	TEST WITH 8K
>>> 	=
--------------------------------------------------------------------------=
-----------------------
>>>       Auto Mode
>>>       Using Minimum Record Size 8 KB
>>>       Using Maximum Record Size 8 KB
>>>       Using minimum file size of 2097152 kilobytes.
>>>       Using maximum file size of 2097152 kilobytes.
>>>       O_DIRECT feature enabled
>>>       SYNC Mode.
>>>       OPS Mode. Output is in operations per second.
>>>       Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o
>>>       -O -i 0 -i 1 -i 2
>>>       Time Resolution =3D 0.000001 seconds.
>>>       Processor cache size set to 1024 Kbytes.
>>>       Processor cache line size set to 32 bytes.
>>>       File stride size set to 17 * record size.
>>>                                                           random
>>>                                                           random
>>>                                                           bkwd
>>>                                                           record
>>>                                                           stride
>>>             KB reclen write rewrite read reread read write read
>>>             rewrite read fwrite frewrite fread freread
>>>        2097152 8 1919 1914 2356 2321 2335 1706
>>>=20
>>> 	TEST WITH 1M
>>> 	=
--------------------------------------------------------------------------=
-----------------------
>>>       Auto Mode
>>>       Using Minimum Record Size 1024 KB
>>>       Using Maximum Record Size 1024 KB
>>>       Using minimum file size of 2097152 kilobytes.
>>>       Using maximum file size of 2097152 kilobytes.
>>>       O_DIRECT feature enabled
>>>       SYNC Mode.
>>>       OPS Mode. Output is in operations per second.
>>>       Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o
>>>       -O -i 0 -i 1 -i 2
>>>       Time Resolution =3D 0.000001 seconds.
>>>       Processor cache size set to 1024 Kbytes.
>>>       Processor cache line size set to 32 bytes.
>>>       File stride size set to 17 * record size.
>>>                                                           random
>>>                                                           random
>>>                                                           bkwd
>>>                                                           record
>>>                                                           stride
>>>             KB reclen write rewrite read reread read write read
>>>             rewrite read fwrite frewrite fread freread
>>>        2097152 1024 73 64 477 486 496 61
>>>=20
>>>=20
>>> drc3.patch
>>>=20
>>> 	TEST WITH 8K
>>> 	=
--------------------------------------------------------------------------=
-----------------------
>>>       Auto Mode
>>>       Using Minimum Record Size 8 KB
>>>       Using Maximum Record Size 8 KB
>>>       Using minimum file size of 2097152 kilobytes.
>>>       Using maximum file size of 2097152 kilobytes.
>>>       O_DIRECT feature enabled
>>>       SYNC Mode.
>>>       OPS Mode. Output is in operations per second.
>>>       Command line used: iozone -a -y 8k -q 8k -n 2g -g 2g -C -I -o
>>>       -O -i 0 -i 1 -i 2
>>>       Time Resolution =3D 0.000001 seconds.
>>>       Processor cache size set to 1024 Kbytes.
>>>       Processor cache line size set to 32 bytes.
>>>       File stride size set to 17 * record size.
>>>                                                           random
>>>                                                           random
>>>                                                           bkwd
>>>                                                           record
>>>                                                           stride
>>>             KB reclen write rewrite read reread read write read
>>>             rewrite read fwrite frewrite fread freread
>>>        2097152 8 2108 2397 3001 3013 3010 2389
>>>=20
>>>=20
>>> 	TEST WITH 1M
>>> 	=
--------------------------------------------------------------------------=
-----------------------
>>>       Auto Mode
>>>       Using Minimum Record Size 1024 KB
>>>       Using Maximum Record Size 1024 KB
>>>       Using minimum file size of 2097152 kilobytes.
>>>       Using maximum file size of 2097152 kilobytes.
>>>       O_DIRECT feature enabled
>>>       SYNC Mode.
>>>       OPS Mode. Output is in operations per second.
>>>       Command line used: iozone -a -y 1m -q 1m -n 2g -g 2g -C -I -o
>>>       -O -i 0 -i 1 -i 2
>>>       Time Resolution =3D 0.000001 seconds.
>>>       Processor cache size set to 1024 Kbytes.
>>>       Processor cache line size set to 32 bytes.
>>>       File stride size set to 17 * record size.
>>>                                                           random
>>>                                                           random
>>>                                                           bkwd
>>>                                                           record
>>>                                                           stride
>>>             KB reclen write rewrite read reread read write read
>>>             rewrite read fwrite frewrite fread freread
>>>        2097152 1024 80 79 521 536 528 75
>>>=20
>>>=20
>>> Also with drc3 the CPU usage on the server is noticeably lower. Most
>>> of the time I could see only the geom{g_up}/{g_down} threads,
>>> and a few nfsd threads, before that nfsd's were much more prominent.
>>>=20
>>> I guess under bigger load the performance improvement can be bigger.
>>>=20
>>> I'll run some more tests with heavier loads this week.
>>>=20
>>> Thanks,
>>> Nikolay
>>>=20
>>>=20
>>=20
>> If anyone is interested here's a FlameGraph generated using DTrace =
and
>> Brendan Gregg's tools from https://github.com/brendangregg/FlameGraph
>> :
>>=20
>> https://home.totalterror.net/freebsd/goliath-kernel.svg
>>=20
>> It was sampled during Oracle database restore from Linux host over =
the
>> nfs mount.
>> Currently all IO on the dataset that the linux machine writes is
>> stuck, simple ls in the directory
>> hangs for maybe 10-15 minutes and then eventually completes.
>>=20
>> Looks like some weird locking issue.
>>=20
>> [*] http://dtrace.org/blogs/brendan/2011/12/16/flame-graphs/
>>=20
>> P.S.: The machine runs with drc3.patch for the NFS server.
>> P.S.2: The nfsd server is configured for vfs.nfsd.maxthreads=3D200,
>> maybe that's too much?
>>=20
> You could try trimming the size of vfs.nfsd.tcphighwater down. =
Remember that,
> with this patch, when you increase this tunable, you are trading space
> for CPU overhead.
>=20
> If it's still "running", you could do "vmstat -m" and "vmstat -z" to
> see where the memory is allocated. ("nfsstat -e -s" will tell you the
> size of the cache.)
>=20
> rick
>> _______________________________________________
>> freebsd-fs@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"


Are you saying that the time spent in _mtx_spin_lock can be because of =
this?
To me it looks like that there was some heavy contention in ZFS, maybe =
specific to the
way it's accessed by the NFS server? Probably due to high maxthreads =
value ?


Here's the nfsstat -s -e, seems like it's wrong as it's negative number, =
maybe overflowed?

Server:
Retfailed    Faults   Clients
        0         0         0
OpenOwner     Opens LockOwner     Locks    Delegs=20
        0         0         0         0         0=20
Server Cache Stats:
   Inprog      Idem  Non-idem    Misses CacheSize   TCPPeak
        0         0         0  83500632    -24072     16385


Also here are the following sysctls :

vfs.nfsd.request_space_used: 0
vfs.nfsd.request_space_used_highest: 13121808
vfs.nfsd.request_space_high: 13107200
vfs.nfsd.request_space_low: 8738133
vfs.nfsd.request_space_throttled: 0
vfs.nfsd.request_space_throttle_count: 0

Are they related to the same request cache?

I have stats that show at some point nfsd has allocated all 200 threads =
and=20
vfs.nfsd.request_space_used hits the ceiling too.