From owner-freebsd-fs@FreeBSD.ORG  Tue Jul  9 02:24:39 2013
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 37CBDA99
 for <freebsd-fs@freebsd.org>; Tue,  9 Jul 2013 02:24:39 +0000 (UTC)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: from hergotha.csail.mit.edu
 (wollman-1-pt.tunnel.tserv4.nyc4.ipv6.he.net [IPv6:2001:470:1f06:ccb::2])
 by mx1.freebsd.org (Postfix) with ESMTP id D1CC61DEF
 for <freebsd-fs@freebsd.org>; Tue,  9 Jul 2013 02:24:38 +0000 (UTC)
Received: from hergotha.csail.mit.edu (localhost [127.0.0.1])
 by hergotha.csail.mit.edu (8.14.5/8.14.5) with ESMTP id r692OakD018309;
 Mon, 8 Jul 2013 22:24:36 -0400 (EDT)
 (envelope-from wollman@hergotha.csail.mit.edu)
Received: (from wollman@localhost)
 by hergotha.csail.mit.edu (8.14.5/8.14.4/Submit) id r692OaHZ018306;
 Mon, 8 Jul 2013 22:24:36 -0400 (EDT) (envelope-from wollman)
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <20955.29796.228750.131498@hergotha.csail.mit.edu>
Date: Mon, 8 Jul 2013 22:24:36 -0400
From: Garrett Wollman <wollman@bimajority.org>
To: Rick Macklem <rmacklem@uoguelph.ca>
Subject: Re: Terrible NFS4 performance: FreeBSD 9.1 + ZFS + AWS EC2
In-Reply-To: <27783474.3353362.1373334232356.JavaMail.root@uoguelph.ca>
References: <87ppuszgth.wl%berend@pobox.com>
 <27783474.3353362.1373334232356.JavaMail.root@uoguelph.ca>
X-Mailer: VM 7.17 under 21.4 (patch 22) "Instant Classic" XEmacs Lucid
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3
 (hergotha.csail.mit.edu [127.0.0.1]); Mon, 08 Jul 2013 22:24:36 -0400 (EDT)
X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED
 autolearn=disabled version=3.3.2
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on
 hergotha.csail.mit.edu
Cc: freebsd-fs <freebsd-fs@freebsd.org>
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 09 Jul 2013 02:24:39 -0000

<<On Mon, 8 Jul 2013 21:43:52 -0400 (EDT), Rick Macklem <rmacklem@uoguelph.ca> said:

> Berend de Boer wrote:
>> >>>>> "Rick" == Rick Macklem <rmacklem@uoguelph.ca> writes:
>> 
Rick> After you apply the patch and boot the rebuilt kernel, the
Rick> cpu overheads should be reduced after you increase the
>> value
Rick> of vfs.nfsd.tcphighwater.
>> 
>> What number would I be looking at? 100? 100,000?
>> 
> Garrett Wollman might have more insight into this, but I would say on
> the order of 100s to maybe 1000s.

On my production servers, I'm running with the following tuning
(after Rick's drc4.patch):

----loader.conf----
kern.ipc.nmbclusters="1048576"
vfs.zfs.scrub_limit="16"
vfs.zfs.vdev.max_pending="24"
vfs.zfs.arc_max="48G"
#
# Tunable per mps(4).  We had sigificant numbers of allocation failures
# with the default value of 2048, so bump it up and see whether there's
# still an issue.
#
hw.mps.max_chains="4096"
#
# Simulate the 10-CURRENT autotuning of maxusers based on available memory
#
kern.maxusers="8509"
#
# Attempt to make the message buffer big enough to retain all the crap
# that gets spewed on the console when we boot.  64K (the default) isn't
# enough to even list all of the disks.
#
kern.msgbufsize="262144"
#
# Tell the TCP implementation to use the specialized, faster but possibly
# fragile implementation of soreceive.  NFS calls soreceive() a lot and
# using this implementation, if it works, should improve performance
# significantly.
#
net.inet.tcp.soreceive_stream="1"
#
# Six queues per interface means twelve queues total
# on this hardware, which is a good match for the number
# of processor cores we have.
#
hw.ixgbe.num_queues="6"

----sysctl.conf----
# Make sure that device interrupts are not throttled (10GbE can make
# lots and lots of interrupts).
hw.intr_storm_threshold=12000

# If the NFS replay cache isn't larger than the number of operations nfsd
# can perform in a second, the nfsd service threads will spend all of their
# time contending for the mutex that protects the cache data structure so
# that they can trim them.  If the cache is big enough, it will only do this
# once a second.
vfs.nfsd.tcpcachetimeo=300
vfs.nfsd.tcphighwater=150000

----modules/nfs/server/freebsd.pp----
  exec {'sysctl vfs.nfsd.minthreads':
    command  => "sysctl vfs.nfsd.minthreads=${min_threads}",
    onlyif   => "test $(sysctl -n vfs.nfsd.minthreads) -ne ${min_threads}",
    require  => Service['nfsd'],
  }

  exec {'sysctl vfs.nfsd.maxthreads':
    command  => "sysctl vfs.nfsd.maxthreads=${max_threads}",
    onlyif   => "test $(sysctl -n vfs.nfsd.maxthreads) -ne ${max_threads}",
    require  => Service['nfsd'],
  }

($min_threads and $max_threads are manually configured based on
hardware, currently 16/64 on 8-core machines and 16/96 on 12-core
machines.)

As this is the summer, we are currently very lightly loaded.  There's
apparently still a bug in drc4.patch, because both of my non-scratch
production servers show a negative CacheSize in nfsstat.

(I hope that all of these patches will make it into 9.2 so we don't
have to maintain our own mutant NFS implementation.)

-GAWollman