From owner-freebsd-fs@FreeBSD.ORG  Tue Nov 20 07:56:03 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id CF6F9B08
 for <freebsd-fs@freebsd.org>; Tue, 20 Nov 2012 07:56:03 +0000 (UTC)
 (envelope-from nowakpl@platinum.linux.pl)
Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4])
 by mx1.freebsd.org (Postfix) with ESMTP id 66C708FC16
 for <freebsd-fs@freebsd.org>; Tue, 20 Nov 2012 07:56:03 +0000 (UTC)
Received: by platinum.linux.pl (Postfix, from userid 87)
 id AD69647E21; Tue, 20 Nov 2012 08:56:01 +0100 (CET)
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl
X-Spam-Level: 
X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL
 autolearn=disabled version=3.3.2
Received: from [10.255.1.2] (unknown [83.151.38.73])
 by platinum.linux.pl (Postfix) with ESMTPA id B3A2B47DCD
 for <freebsd-fs@freebsd.org>; Tue, 20 Nov 2012 08:55:56 +0100 (CET)
Message-ID: <50AB3789.1000508@platinum.linux.pl>
Date: Tue, 20 Nov 2012 08:55:53 +0100
From: Adam Nowacki <nowakpl@platinum.linux.pl>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:16.0) Gecko/20121026 Thunderbird/16.0.2
MIME-Version: 1.0
To: freebsd-fs@freebsd.org
Subject: Re: ZFS FAQ (Was: SSD recommendations for ZFS cache/log)
References: <CAFHbX1K-NPuAy5tW0N8=sJD=CU0Q1Pm3ZDkVkE+djpCsD1U8_Q@mail.gmail.com>
 <57ac1f$gf3rkl@ipmail05.adl6.internode.on.net> <50A31D48.3000700@shatow.net>
 <CAF6rxgkh6C0LKXOZa264yZcA3AvQdw7zVAzWKpytfh0+KnLOJg@mail.gmail.com>
 <57ac1f$gg70bn@ipmail05.adl6.internode.on.net>
 <CAF6rxgnjPJ=v24p+kOci2qGQ1weH7r+8vdLmiJ_1DrxLeEzZvg@mail.gmail.com>
 <CAF6rxg=wjy9KTtifGrF2D6szwWKw8cX-gkJjnZRQBmFTC9BBdg@mail.gmail.com>
 <48C81451-B9E7-44B5-8B8A-ED4B1D464EC6@bway.net>
In-Reply-To: <48C81451-B9E7-44B5-8B8A-ED4B1D464EC6@bway.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 20 Nov 2012 07:56:03 -0000

On 2012-11-20 05:59, Charles Sprickman wrote:
> Wonderful to see some work on this.
>
> One of the great remaining zfs mysteries remains all the tunables
> that are under "vfs.zfs.*".  Obviously there are plenty of read-only
> items there, but conflicting information gathered from random forum
> posts and commit messages exist about what exactly one can do
> regarding tuning beyond arc sizing.
>
> If you have any opportunity to work with the people who have ported
> and are now maintaining zfs, it would be really wonderful to get
> some feedback from them on what knobs are safe to twiddle and why.
> I suspect many of the tunable items don't really have meaningful
> equivalents in Sun's implementation since the way zfs falls under
> the vfs layer in FreeBSD is so different.
>
> Thanks,
>
> Charles

I'll share my experiences while tuning for home NAS: 
vfs.zfs.write_limit_* is a mess.
6 sysctls work together to produce a single value - maximum size of txg 
commit. If size of data yet to be stored on disk grows to this size a 
txg commit will be forced, but there is a catch, this size is only an 
estimate and absolutely worst case one at that - multiply by 24 (there 
is a reason for this madness below). This means that writing a 1MB file 
will result in 24MB estimated txg commit size (+ metadata). Back to the 
sysctls:

# vfs.zfs.write_limit_override - if not 0 absolutely override write 
limit (ignore other sysctls), if 0 then an internal dynamically computed 
value is used based on:
# vfs.zfs.txg.synctime_ms - adjust write limit based on previous txg 
commits so the time to write is equal to this value in milliseconds 
(basically estimates disks write bandwidth),
# vfs.zfs.write_limit_shift - sets vfs.zfs.write_limit_max to ram size / 
2^write_limit_shift,
# vfs.zfs.write_limit_max - used to derive vfs.zfs.write_limit_inflated 
(multiply by 24), but only if vfs.zfs.write_limit_shift is not 0,
# vfs.zfs.write_limit_inflated - maximum size of the dynamic write limit,
# vfs.zfs.write_limit_min - minimum size of the dynamic write limit,
and to have the whole picture:
# vfs.zfs.txg.timeout - force txg commit every this many seconds if it 
didn't happen by write limit.

For my home NAS (10x 2TB disks encrypted with geli in raidz2, cpu with 
hw aes, 16GB ram, 2x 1GE for samba and iSCSI with MCS) I have ended with:

/boot/loader.conf:
vfs.zfs.write_limit_shift="4" # 16GB ram / 2^4 = 1GB limit
vfs.zfs.write_limit_min="2400M" # 100MB minimum multiplied by the 24 
factor, during heavy read-write operations dynamic write limit would 
enter positive feedback loop and reduce write limit too much
vfs.zfs.txg.synctime_ms="2000" # try to maintain 2 seconds commit time 
during large writes
vfs.zfs.txg.timeout="120" # 2 minutes to reduce fragmentation and wear 
from small writes, worst case scenario 2 minutes of asynchronous writes 
is lost, synchronous end in ZIL anyway

and for completness:

vfs.zfs.arc_min="10000M"
vfs.zfs.arc_max="10000M"
vfs.zfs.vdev.cache.size="16M" # vdev cache helps a lot during scrubs
vfs.zfs.vdev.cache.bshift="14" # grow all i/o requests to 16kiB, smaller 
have shown to have same latency so might as well get more "for free"
vfs.zfs.vdev.cache.max="16384"
vfs.zfs.vdev.write_gap_limit="0"
vfs.zfs.vdev.read_gap_limit="131072"
vfs.zfs.vdev.aggregation_limit="131072" # group smaller reads into one 
larger, benchmarking shown no appreciable latency increase while again 
getting more bytes
vfs.zfs.vdev.min_pending="1"
vfs.zfs.vdev.max_pending="1" # seems to help txg commit bandwidth by 
reducing seeking with parallel reads (not fully tested)

and a reason for 24 factor (4 * 3 * 2, from the code):
     /*
      * The worst case is single-sector max-parity RAID-Z blocks, in which
      * case the space requirement is exactly (VDEV_RAIDZ_MAXPARITY + 1)
      * times the size; so just assume that.  Add to this the fact that
      * we can have up to 3 DVAs per bp, and one more factor of 2 because
      * the block may be dittoed with up to 3 DVAs by ddt_sync().
      */