From owner-freebsd-fs@FreeBSD.ORG  Thu Jul 21 22:03:15 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 46839106566C
	for <freebsd-fs@freebsd.org>; Thu, 21 Jul 2011 22:03:15 +0000 (UTC)
	(envelope-from bfriesen@simple.dallas.tx.us)
Received: from blade.simplesystems.org (blade.simplesystems.org [65.66.246.74])
	by mx1.freebsd.org (Postfix) with ESMTP id E45E18FC15
	for <freebsd-fs@freebsd.org>; Thu, 21 Jul 2011 22:03:14 +0000 (UTC)
Received: from freddy.simplesystems.org (freddy.simplesystems.org
	[65.66.246.65])
	by blade.simplesystems.org (8.14.4+Sun/8.14.4) with ESMTP id
	p6LM3EpP014255; Thu, 21 Jul 2011 17:03:14 -0500 (CDT)
Date: Thu, 21 Jul 2011 17:03:14 -0500 (CDT)
From: Bob Friesenhahn <bfriesen@simple.dallas.tx.us>
X-X-Sender: bfriesen@freddy.simplesystems.org
To: Freddie Cash <fjwcash@gmail.com>
In-Reply-To: <CAOjFWZ7mj+CxXrzqt-OK3XrXLp4DZHYhcGBewf7shaKRAdv63g@mail.gmail.com>
Message-ID: <alpine.GSO.2.01.1107211646360.5109@freddy.simplesystems.org>
References: <j09hk8$svj$1@dough.gmane.org> <4E286F1F.6010502@FreeBSD.org>
	<CAOjFWZ7mj+CxXrzqt-OK3XrXLp4DZHYhcGBewf7shaKRAdv63g@mail.gmail.com>
User-Agent: Alpine 2.01 (GSO 1266 2009-07-14)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.2
	(blade.simplesystems.org [65.66.246.90]);
	Thu, 21 Jul 2011 17:03:14 -0500 (CDT)
Cc: freebsd-fs@freebsd.org
Subject: Re: ZFS and large directories - caveat report
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 21 Jul 2011 22:03:15 -0000

On Thu, 21 Jul 2011, Freddie Cash wrote:
>>
> The recordsize property in ZFS is the "max" block size used.  It is not the
> only block size used for a dataset.  ZFS will use any block size from 0.5 KB
> to $recordsize KB, as determined by the size of the file to be written (it
> tries to the find the recordsize that most closely matches the file size to
> use the least number of blocks per write).

Except for tail blocks (last block in a file), the uncompressed data 
block size will always be the "max" block size.  When compression is 
enabled, that "max" block size is likely to be reduced to something 
smaller (due to the compression), and zfs will use a smaller block 
size on disk.  This approach minimizes the performance impact from 
fragmentation, copy on write (COW), and block metadata.

It would not make sense for zfs to behave as you describe since files 
are written starting from scratch and so zfs has no knowledge of the 
final file size until it is completely written (and even then, more 
data could be written, or the file might be truncated).  Zfs could 
have knowledge of a file size if the application did a seek to the 
ultimate length and wrote something, or used ftruncate to set the 
size, but the file size can still be arbitrarily changed.

When raidzN is used, the data block is split into smaller chunks which 
are distributed among the disks.  When mirroring is used, full blocks 
are written to each disk.

It is important to realize that the zfs block checksum is for the 
uncompressed/unsplit original data block and not for some bit of data 
which eventually ended up on a disk.  For example, when raidz is used, 
there is no independent checksum for the data chunks distributed 
across the disks.  The zfs approach assures end-to-end validation and 
avoids having to recompute all data checksums (perhaps incorrectly) 
when doing 'zfs send'.

Zfs metadata sizes are not related to the zfs block size.

Bob
-- 
Bob Friesenhahn
bfriesen@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/