Date: Thu, 21 Jul 2011 17:03:14 -0500 (CDT) From: Bob Friesenhahn <bfriesen@simple.dallas.tx.us> To: Freddie Cash <fjwcash@gmail.com> Cc: freebsd-fs@freebsd.org Subject: Re: ZFS and large directories - caveat report Message-ID: <alpine.GSO.2.01.1107211646360.5109@freddy.simplesystems.org> In-Reply-To: <CAOjFWZ7mj%2BCxXrzqt-OK3XrXLp4DZHYhcGBewf7shaKRAdv63g@mail.gmail.com> References: <j09hk8$svj$1@dough.gmane.org> <4E286F1F.6010502@FreeBSD.org> <CAOjFWZ7mj%2BCxXrzqt-OK3XrXLp4DZHYhcGBewf7shaKRAdv63g@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 21 Jul 2011, Freddie Cash wrote: >> > The recordsize property in ZFS is the "max" block size used. It is not the > only block size used for a dataset. ZFS will use any block size from 0.5 KB > to $recordsize KB, as determined by the size of the file to be written (it > tries to the find the recordsize that most closely matches the file size to > use the least number of blocks per write). Except for tail blocks (last block in a file), the uncompressed data block size will always be the "max" block size. When compression is enabled, that "max" block size is likely to be reduced to something smaller (due to the compression), and zfs will use a smaller block size on disk. This approach minimizes the performance impact from fragmentation, copy on write (COW), and block metadata. It would not make sense for zfs to behave as you describe since files are written starting from scratch and so zfs has no knowledge of the final file size until it is completely written (and even then, more data could be written, or the file might be truncated). Zfs could have knowledge of a file size if the application did a seek to the ultimate length and wrote something, or used ftruncate to set the size, but the file size can still be arbitrarily changed. When raidzN is used, the data block is split into smaller chunks which are distributed among the disks. When mirroring is used, full blocks are written to each disk. It is important to realize that the zfs block checksum is for the uncompressed/unsplit original data block and not for some bit of data which eventually ended up on a disk. For example, when raidz is used, there is no independent checksum for the data chunks distributed across the disks. The zfs approach assures end-to-end validation and avoids having to recompute all data checksums (perhaps incorrectly) when doing 'zfs send'. Zfs metadata sizes are not related to the zfs block size. Bob -- Bob Friesenhahn bfriesen@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.GSO.2.01.1107211646360.5109>