From owner-freebsd-fs@FreeBSD.ORG Thu Jul 21 22:03:15 2011 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 46839106566C for ; Thu, 21 Jul 2011 22:03:15 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from blade.simplesystems.org (blade.simplesystems.org [65.66.246.74]) by mx1.freebsd.org (Postfix) with ESMTP id E45E18FC15 for ; Thu, 21 Jul 2011 22:03:14 +0000 (UTC) Received: from freddy.simplesystems.org (freddy.simplesystems.org [65.66.246.65]) by blade.simplesystems.org (8.14.4+Sun/8.14.4) with ESMTP id p6LM3EpP014255; Thu, 21 Jul 2011 17:03:14 -0500 (CDT) Date: Thu, 21 Jul 2011 17:03:14 -0500 (CDT) From: Bob Friesenhahn X-X-Sender: bfriesen@freddy.simplesystems.org To: Freddie Cash In-Reply-To: Message-ID: References: <4E286F1F.6010502@FreeBSD.org> User-Agent: Alpine 2.01 (GSO 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2 (blade.simplesystems.org [65.66.246.90]); Thu, 21 Jul 2011 17:03:14 -0500 (CDT) Cc: freebsd-fs@freebsd.org Subject: Re: ZFS and large directories - caveat report X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Jul 2011 22:03:15 -0000 On Thu, 21 Jul 2011, Freddie Cash wrote: >> > The recordsize property in ZFS is the "max" block size used. It is not the > only block size used for a dataset. ZFS will use any block size from 0.5 KB > to $recordsize KB, as determined by the size of the file to be written (it > tries to the find the recordsize that most closely matches the file size to > use the least number of blocks per write). Except for tail blocks (last block in a file), the uncompressed data block size will always be the "max" block size. When compression is enabled, that "max" block size is likely to be reduced to something smaller (due to the compression), and zfs will use a smaller block size on disk. This approach minimizes the performance impact from fragmentation, copy on write (COW), and block metadata. It would not make sense for zfs to behave as you describe since files are written starting from scratch and so zfs has no knowledge of the final file size until it is completely written (and even then, more data could be written, or the file might be truncated). Zfs could have knowledge of a file size if the application did a seek to the ultimate length and wrote something, or used ftruncate to set the size, but the file size can still be arbitrarily changed. When raidzN is used, the data block is split into smaller chunks which are distributed among the disks. When mirroring is used, full blocks are written to each disk. It is important to realize that the zfs block checksum is for the uncompressed/unsplit original data block and not for some bit of data which eventually ended up on a disk. For example, when raidz is used, there is no independent checksum for the data chunks distributed across the disks. The zfs approach assures end-to-end validation and avoids having to recompute all data checksums (perhaps incorrectly) when doing 'zfs send'. Zfs metadata sizes are not related to the zfs block size. Bob -- Bob Friesenhahn bfriesen@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/