From owner-freebsd-fs@FreeBSD.ORG  Tue Jul  9 23:57:03 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 1F9BD426
 for <freebsd-fs@freebsd.org>; Tue,  9 Jul 2013 23:57:03 +0000 (UTC)
 (envelope-from rmacklem@uoguelph.ca)
Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca
 [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id DBA93179E
 for <freebsd-fs@freebsd.org>; Tue,  9 Jul 2013 23:57:02 +0000 (UTC)
X-Cloudmark-SP-Filtered: true
X-Cloudmark-SP-Result: v=1.1 cv=ME3lrcP4jFDzpPiCSQywCMKJiHtpRWeRXBDIYmR1BZg=
 c=1 sm=2
 a=ctSXsGKhotwA:10 a=FKkrIqjQGGEA:10 a=V5z4IuhVU5kA:10 a=IkcTkHD0fZMA:10
 a=pkkU0Bg7WzlNbPfhUykA:9 a=QEXdDO2ut3YA:10 a=fi6rhVxsa7yJvVoJ:21
 a=BcIZAEyNbUJ55jHW:21
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: AqAEADOi3FGDaFve/2dsb2JhbABbhAiDCL4LgSt0giMBAQUjVhsYAgINGQJZBhOID6hykR6BJo4RNAeCVoEeA5QBlRyDLSCBbA
X-IronPort-AV: E=Sophos;i="4.87,1031,1363147200"; d="scan'208";a="39048264"
Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
 ([131.104.91.222])
 by esa-annu.net.uoguelph.ca with ESMTP; 09 Jul 2013 19:57:02 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
 by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 16E06B3FAC;
 Tue,  9 Jul 2013 19:57:02 -0400 (EDT)
Date: Tue, 9 Jul 2013 19:57:02 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: Garrett Wollman <wollman@bimajority.org>
Message-ID: <74469452.3886197.1373414222081.JavaMail.root@uoguelph.ca>
In-Reply-To: <20955.29796.228750.131498@hergotha.csail.mit.edu>
Subject: Re: Terrible NFS4 performance: FreeBSD 9.1 + ZFS + AWS EC2
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.202]
X-Mailer: Zimbra 7.2.1_GA_2790 (ZimbraWebClient - FF3.0 (Win)/7.2.1_GA_2790)
Cc: freebsd-fs <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 09 Jul 2013 23:57:03 -0000

Garrett Wollman wrote:
> <<On Mon, 8 Jul 2013 21:43:52 -0400 (EDT), Rick Macklem
> <rmacklem@uoguelph.ca> said:
> 
> > Berend de Boer wrote:
> >> >>>>> "Rick" == Rick Macklem <rmacklem@uoguelph.ca> writes:
> >> 
> Rick> After you apply the patch and boot the rebuilt kernel, the
> Rick> cpu overheads should be reduced after you increase the
> >> value
> Rick> of vfs.nfsd.tcphighwater.
> >> 
> >> What number would I be looking at? 100? 100,000?
> >> 
> > Garrett Wollman might have more insight into this, but I would say
> > on
> > the order of 100s to maybe 1000s.
> 
> On my production servers, I'm running with the following tuning
> (after Rick's drc4.patch):
> 
> ----loader.conf----
> kern.ipc.nmbclusters="1048576"
> vfs.zfs.scrub_limit="16"
> vfs.zfs.vdev.max_pending="24"
> vfs.zfs.arc_max="48G"
> #
> # Tunable per mps(4).  We had sigificant numbers of allocation
> failures
> # with the default value of 2048, so bump it up and see whether
> there's
> # still an issue.
> #
> hw.mps.max_chains="4096"
> #
> # Simulate the 10-CURRENT autotuning of maxusers based on available
> memory
> #
> kern.maxusers="8509"
> #
> # Attempt to make the message buffer big enough to retain all the
> crap
> # that gets spewed on the console when we boot.  64K (the default)
> isn't
> # enough to even list all of the disks.
> #
> kern.msgbufsize="262144"
> #
> # Tell the TCP implementation to use the specialized, faster but
> possibly
> # fragile implementation of soreceive.  NFS calls soreceive() a lot
> and
> # using this implementation, if it works, should improve performance
> # significantly.
> #
> net.inet.tcp.soreceive_stream="1"
> #
> # Six queues per interface means twelve queues total
> # on this hardware, which is a good match for the number
> # of processor cores we have.
> #
> hw.ixgbe.num_queues="6"
> 
> ----sysctl.conf----
> # Make sure that device interrupts are not throttled (10GbE can make
> # lots and lots of interrupts).
> hw.intr_storm_threshold=12000
> 
> # If the NFS replay cache isn't larger than the number of operations
> nfsd
> # can perform in a second, the nfsd service threads will spend all of
> their
> # time contending for the mutex that protects the cache data
> structure so
> # that they can trim them.  If the cache is big enough, it will only
> do this
> # once a second.
> vfs.nfsd.tcpcachetimeo=300
> vfs.nfsd.tcphighwater=150000
> 
> ----modules/nfs/server/freebsd.pp----
>   exec {'sysctl vfs.nfsd.minthreads':
>     command  => "sysctl vfs.nfsd.minthreads=${min_threads}",
>     onlyif   => "test $(sysctl -n vfs.nfsd.minthreads) -ne
>     ${min_threads}",
>     require  => Service['nfsd'],
>   }
> 
>   exec {'sysctl vfs.nfsd.maxthreads':
>     command  => "sysctl vfs.nfsd.maxthreads=${max_threads}",
>     onlyif   => "test $(sysctl -n vfs.nfsd.maxthreads) -ne
>     ${max_threads}",
>     require  => Service['nfsd'],
>   }
> 
> ($min_threads and $max_threads are manually configured based on
> hardware, currently 16/64 on 8-core machines and 16/96 on 12-core
> machines.)
> 
> As this is the summer, we are currently very lightly loaded.  There's
> apparently still a bug in drc4.patch, because both of my non-scratch
> production servers show a negative CacheSize in nfsstat.
> 
> (I hope that all of these patches will make it into 9.2 so we don't
> have to maintain our own mutant NFS implementation.)
> 
Afraid not. I was planning on getting it in, but the release schedule
appeared with a short time to code slush. Hopefully a cleaned up version
of this will be in 10.0 and 9.3.

rick

> -GAWollman
> 
>