From owner-freebsd-fs@FreeBSD.ORG Tue Nov 20 07:56:03 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id CF6F9B08 for ; Tue, 20 Nov 2012 07:56:03 +0000 (UTC) (envelope-from nowakpl@platinum.linux.pl) Received: from platinum.linux.pl (platinum.edu.pl [81.161.192.4]) by mx1.freebsd.org (Postfix) with ESMTP id 66C708FC16 for ; Tue, 20 Nov 2012 07:56:03 +0000 (UTC) Received: by platinum.linux.pl (Postfix, from userid 87) id AD69647E21; Tue, 20 Nov 2012 08:56:01 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on platinum.linux.pl X-Spam-Level: X-Spam-Status: No, score=-1.4 required=3.0 tests=ALL_TRUSTED,AWL autolearn=disabled version=3.3.2 Received: from [10.255.1.2] (unknown [83.151.38.73]) by platinum.linux.pl (Postfix) with ESMTPA id B3A2B47DCD for ; Tue, 20 Nov 2012 08:55:56 +0100 (CET) Message-ID: <50AB3789.1000508@platinum.linux.pl> Date: Tue, 20 Nov 2012 08:55:53 +0100 From: Adam Nowacki User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:16.0) Gecko/20121026 Thunderbird/16.0.2 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: ZFS FAQ (Was: SSD recommendations for ZFS cache/log) References: <57ac1f$gf3rkl@ipmail05.adl6.internode.on.net> <50A31D48.3000700@shatow.net> <57ac1f$gg70bn@ipmail05.adl6.internode.on.net> <48C81451-B9E7-44B5-8B8A-ED4B1D464EC6@bway.net> In-Reply-To: <48C81451-B9E7-44B5-8B8A-ED4B1D464EC6@bway.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 20 Nov 2012 07:56:03 -0000 On 2012-11-20 05:59, Charles Sprickman wrote: > Wonderful to see some work on this. > > One of the great remaining zfs mysteries remains all the tunables > that are under "vfs.zfs.*". Obviously there are plenty of read-only > items there, but conflicting information gathered from random forum > posts and commit messages exist about what exactly one can do > regarding tuning beyond arc sizing. > > If you have any opportunity to work with the people who have ported > and are now maintaining zfs, it would be really wonderful to get > some feedback from them on what knobs are safe to twiddle and why. > I suspect many of the tunable items don't really have meaningful > equivalents in Sun's implementation since the way zfs falls under > the vfs layer in FreeBSD is so different. > > Thanks, > > Charles I'll share my experiences while tuning for home NAS: vfs.zfs.write_limit_* is a mess. 6 sysctls work together to produce a single value - maximum size of txg commit. If size of data yet to be stored on disk grows to this size a txg commit will be forced, but there is a catch, this size is only an estimate and absolutely worst case one at that - multiply by 24 (there is a reason for this madness below). This means that writing a 1MB file will result in 24MB estimated txg commit size (+ metadata). Back to the sysctls: # vfs.zfs.write_limit_override - if not 0 absolutely override write limit (ignore other sysctls), if 0 then an internal dynamically computed value is used based on: # vfs.zfs.txg.synctime_ms - adjust write limit based on previous txg commits so the time to write is equal to this value in milliseconds (basically estimates disks write bandwidth), # vfs.zfs.write_limit_shift - sets vfs.zfs.write_limit_max to ram size / 2^write_limit_shift, # vfs.zfs.write_limit_max - used to derive vfs.zfs.write_limit_inflated (multiply by 24), but only if vfs.zfs.write_limit_shift is not 0, # vfs.zfs.write_limit_inflated - maximum size of the dynamic write limit, # vfs.zfs.write_limit_min - minimum size of the dynamic write limit, and to have the whole picture: # vfs.zfs.txg.timeout - force txg commit every this many seconds if it didn't happen by write limit. For my home NAS (10x 2TB disks encrypted with geli in raidz2, cpu with hw aes, 16GB ram, 2x 1GE for samba and iSCSI with MCS) I have ended with: /boot/loader.conf: vfs.zfs.write_limit_shift="4" # 16GB ram / 2^4 = 1GB limit vfs.zfs.write_limit_min="2400M" # 100MB minimum multiplied by the 24 factor, during heavy read-write operations dynamic write limit would enter positive feedback loop and reduce write limit too much vfs.zfs.txg.synctime_ms="2000" # try to maintain 2 seconds commit time during large writes vfs.zfs.txg.timeout="120" # 2 minutes to reduce fragmentation and wear from small writes, worst case scenario 2 minutes of asynchronous writes is lost, synchronous end in ZIL anyway and for completness: vfs.zfs.arc_min="10000M" vfs.zfs.arc_max="10000M" vfs.zfs.vdev.cache.size="16M" # vdev cache helps a lot during scrubs vfs.zfs.vdev.cache.bshift="14" # grow all i/o requests to 16kiB, smaller have shown to have same latency so might as well get more "for free" vfs.zfs.vdev.cache.max="16384" vfs.zfs.vdev.write_gap_limit="0" vfs.zfs.vdev.read_gap_limit="131072" vfs.zfs.vdev.aggregation_limit="131072" # group smaller reads into one larger, benchmarking shown no appreciable latency increase while again getting more bytes vfs.zfs.vdev.min_pending="1" vfs.zfs.vdev.max_pending="1" # seems to help txg commit bandwidth by reducing seeking with parallel reads (not fully tested) and a reason for 24 factor (4 * 3 * 2, from the code): /* * The worst case is single-sector max-parity RAID-Z blocks, in which * case the space requirement is exactly (VDEV_RAIDZ_MAXPARITY + 1) * times the size; so just assume that. Add to this the fact that * we can have up to 3 DVAs per bp, and one more factor of 2 because * the block may be dittoed with up to 3 DVAs by ddt_sync(). */