Date: Wed, 12 Oct 2011 21:38:02 +0300 From: Daniel Kalchev <daniel@digsys.bg> To: Jeremy Chadwick <freebsd@jdc.parodius.com> Cc: freebsd-fs@freebsd.org Subject: Re: AF (4096 byte sector) drives: Can you mix/match in a ZFS pool? Message-ID: <C9D5BB73-37C6-42FA-88BA-78DA2A4780B9@digsys.bg> In-Reply-To: <20111012172912.GA27013@icarus.home.lan> References: <4E95AE08.7030105@lerctr.org> <20111012155938.GA24649@icarus.home.lan> <4E95C546.70904@digsys.bg> <20111012172912.GA27013@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
On Oct 12, 2011, at 20:29 , Jeremy Chadwick wrote: >> The gnop trick is used not because you will ask a 512-byte sector >> drive to write 8 sectors with one I/O, but because you may ask an >> 4096-byte sector drive to write only 512 bytes -- which for the >> drive means it has to read 4096 bytes, modify 512 of these bytes and >> write back 4096 bytes. >=20 > If I'm reading this correctly, you're effectively stating ashift > actually just defines (or helps in calculating) an LBA offset for the > start of the pool-related data on that device? "ashift" seems like a > badly-named term/variable for what this does, but oh well. ashift defines the minimum block size of the vdev. The choice is fine, I = believe as it describes how one get's a power of 2 size (by shifting 1 = that number of times) :-) >> The proper way to handle this is to create your zpool with 4096-byte >> alignment, that is, for the time being by using the above gnop >> 'hack'. >=20 > ...which brings into question why this is needed at all, meaning, why > the ZFS code cannot be changed to default to an ashift value that's > calculated as 12 (or equivalent) regardless of 512-byte or 4096-byte > sector drives. Currently the ZFS block size is 512 bytes to 128 kilobytes. That is with = ashift of 9. If you have shift of 12, that effectively means minimum = block size of 4k and maximum block size of 128k. > How was this addressed on Solaris/OpenSolaris? >=20 I don't think they do. >> There should be no implications to having one vdev with 512 byte >> alignment and another with 4096 byte alignment. ZFS is smart enough >> to issue minimum of 512 byte writes to the former and 4096 bytes to >> the latter thus not creating any bottleneck. >=20 > How does ZFS determine this? I was under the impression that this > behaviour was determined by (or "assisted by") shift. ZFS has a piece of data, say 20 kbyte block to write. If you have say 4 = vdevs, one with shift=3D9 (512 bytes), another with ashift=3D12 (4096 = bytes). All other issues ignored (equal size vdev's, full at the same = capacity etc.) it has to write minimum of 9kb (512+512+4096+4096) -- = apparently ZFS wants to fill all vdevs equally, so it will likely issue = one 4k to vdev1, one 4k to vdev2, two 512b to vdev3 and two 512b to = vdev4.=20 If for example, it had 16k to write, it would write one 4k I/O to the 4k = vdev's and 4 x 512b I/O (or a single write of 4k, depending on layering = abstraction) to the 512b vdevs. So yes, it is assisted by shift. But, for the time being you need to assist ZFS how to create the vdev's = with the proper shift value. This is because today's 4k drives lie that = their geometry is 512b. As mentioned, there are patches for FreeBSD to = 'discover' this behavior. Another approach is via gnop. Only at vdev = creation time. Haven't seen anything like this for Solaris. Daniel=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?C9D5BB73-37C6-42FA-88BA-78DA2A4780B9>