Date: Wed, 6 Aug 2014 19:30:50 -0400 From: Paul Kraus <paul@kraus-haus.org> To: Scott Bennett <bennett@sdf.org>, FreeBSD Questions !!!! <freebsd-questions@freebsd.org> Cc: freebsd@qeng-ho.org Subject: ZFS RAIDz space lost to parity WAS: raid5 vs. ZFS raidz Message-ID: <D5B38FF5-DAE7-4CFE-B0A2-A2B2D46C5BE5@kraus-haus.org> In-Reply-To: <201408060556.s765uKJA026937@sdf.org> References: <201408020621.s726LsiA024208@sdf.org> <alpine.BSF.2.11.1408020356250.1128@wonkity.com> <53DCDBE8.8060704@qeng-ho.org> <201408060556.s765uKJA026937@sdf.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Aug 6, 2014, at 1:56, Scott Bennett <bennett@sdf.org> wrote: > Arthur Chance <freebsd@qeng-ho.org> wrote: >> Quite right. If you have N disks in a RAIDZx configuration, the = fraction=20 >> used for data is (N-x)/N and the fraction for parity is x/N. There's=20= >> always overhead for the file system bookkeeping of course, but that's=20= >> not specific to ZFS or RAID. But ZFS does NOT use fixed width stripes across the devices in the = RAIDz<n> vdev. The stripe size changes based on number of devices and = size of the write operation. ZFS adds parity and padding to make the = data fit among the number of devices.=20 > I wonder if what varies is the amount of space taken up by the > checksums. If there's a checksum for each block, then the block size > would change the fraction of the space lost to checksums, and the = parity > for the checksums would thus also change. Enough to matter? Maybe. Nope, the size of checksum does NOT vary with vdev configuration. Going back to Matt=92s blog again (and I agree that his use of the term = =93n-sector block is confusing). http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ Read the blog, don=92t just look at the charts :-) My summary is below = and may help folks to better understand Matt=92s text. According to the blog (and I trust Matt in this regard), RAIDz does NOT = calculate parity per stripe across devices, but on a write by write = basis. Matt linked to a descriptive chart: = http://blog.delphix.com/matt/files/2014/06/RAIDZ.png =85 The chart = assumes a 5 device RAIDz1. Each color is a different write operation = (remember that ZFS is a copy on write, so every write is a new write, no = modifying existing data on disk). The orange write consists of 8 data blocks and 2 parity blocks. Assuming = 512B disk blocks, then you have 8KB of data and 1KB of parity. This is = an 8KB write operation. The yellow write is a 1.5KB write (3 data blocks) and 1 parity. The green is the same as the yellow, just aligned differently. Note that all columns (drives) are NOT involved in all write (and later = read) operations. The brown write is one data block (512B) and one parity. The light purple write is 14 data blocks (7KB) and 4 parity. Quoting directly form Matt: A 11-sector block will use 1 parity + 4 data + 1 parity + 4 data + 1 = parity + 3 data (e.g. the blue block in rows 9-12). Note that if there = are several blocks sharing what would traditionally be thought of as a = single =93stripe=94, there will be multiple parity blocks in the = =93stripe=94. RAID-Z also requires that each allocation be a multiple of (p+1), so = that when it is freed it does not leave a free segment which is too = small to be used (i.e. too small to fit even a single sector of data = plus p parity sectors =96 e.g. the light blue block at left in rows 8-9 = with 1 parity + 2 data + 1 padding). Therefore, RAID-Z requires a bit = more space for parity and overhead than RAID-4/5/6. This leads to the spreadsheet: = https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT6= 89wTjHv6CGVElrPqTA0w_ZY/edit?pli=3D1#gid=3D2126998674 The column down the left is filesystem block size in disk sectors (512B = sectors), so it goes from 0.5KB to 128KB filesystem block size = (recordsize is max you set when you tune the zfs dataset, zfs can and = will write less than full records). The column across the top is number of devices in the RAIDz1 vdev (see = other sheets in the workbook for RAIDz2 and RAIDz3). Keep in mind that the left column is also the size of the data you are = writing. If you are using a database with an 8KB recordsize (16 disk = sectors) and you have 6 devices per vdev, then you will loose 20% of the = raw space to parity (plus additional for checksums and metadata). The = chart further down (rows 29 through 37) show the same data but just for = the powers of 2 increments. So, as Matt says, the more devices you add to a RAID vdev, the more net = capacity you will have. At the expense of performance. Quoting Matt=92s = opening: TL;DR: Choose a RAID-Z stripe width based on your IOPS needs and the = amount of space you are willing to devote to parity information. If you = need more IOPS, use fewer disks per stripe. If you need more usable = space, use more disks per stripe. Trying to optimize your RAID-Z stripe = width based on exact numbers is irrelevant in nearly all cases. and his summary at the end: The strongest valid recommendation based on exact fitting of blocks into = stripes is the following: If you are using RAID-Z with 512-byte sector = devices with recordsize=3D4K or 8K and compression=3Doff (but you = probably want compression=3Dlz4): use at least 5 disks with RAIDZ1; use = at least 6 disks with RAIDZ2; and use at least 11 disks with RAIDZ3. Note that you would ONLY use recordsize =3D 4KB or 8KB if you knew that = your workload was ONLY 4 or 8 KB blocks of data (a database). and finally: To summarize: Use RAID-Z. Not too wide. Enable compression. -- Paul Kraus paul@kraus-haus.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D5B38FF5-DAE7-4CFE-B0A2-A2B2D46C5BE5>