Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 6 Aug 2014 19:30:50 -0400
From:      Paul Kraus <paul@kraus-haus.org>
To:        Scott Bennett <bennett@sdf.org>, FreeBSD Questions !!!! <freebsd-questions@freebsd.org>
Cc:        freebsd@qeng-ho.org
Subject:   ZFS RAIDz space lost to parity WAS: raid5 vs. ZFS raidz
Message-ID:  <D5B38FF5-DAE7-4CFE-B0A2-A2B2D46C5BE5@kraus-haus.org>
In-Reply-To: <201408060556.s765uKJA026937@sdf.org>
References:  <201408020621.s726LsiA024208@sdf.org> <alpine.BSF.2.11.1408020356250.1128@wonkity.com> <53DCDBE8.8060704@qeng-ho.org> <201408060556.s765uKJA026937@sdf.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Aug 6, 2014, at 1:56, Scott Bennett <bennett@sdf.org> wrote:

> Arthur Chance <freebsd@qeng-ho.org> wrote:

>> Quite right. If you have N disks in a RAIDZx configuration, the =
fraction=20
>> used for data is (N-x)/N and the fraction for parity is x/N. There's=20=

>> always overhead for the file system bookkeeping of course, but that's=20=

>> not specific to ZFS or RAID.

But ZFS does NOT use fixed width stripes across the devices in the =
RAIDz<n> vdev. The stripe size changes based on number of devices and =
size of the write operation. ZFS adds parity and padding to make the =
data fit among the number of devices.=20

>     I wonder if what varies is the amount of space taken up by the
> checksums.  If there's a checksum for each block, then the block size
> would change the fraction of the space lost to checksums, and the =
parity
> for the checksums would thus also change.  Enough to matter?  Maybe.

Nope, the size of checksum does NOT vary with vdev configuration.

Going back to Matt=92s blog again (and I agree that his use of the term =
=93n-sector block is confusing).

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

Read the blog, don=92t just look at the charts :-) My summary is below =
and may help folks to better understand Matt=92s text.

According to the blog (and I trust Matt in this regard), RAIDz does NOT =
calculate parity per stripe across devices, but on a write by write =
basis. Matt linked to a descriptive chart: =
http://blog.delphix.com/matt/files/2014/06/RAIDZ.png =85 The chart =
assumes a 5 device RAIDz1. Each color is a different write operation =
(remember that ZFS is a copy on write, so every write is a new write, no =
modifying existing data on disk).

The orange write consists of 8 data blocks and 2 parity blocks. Assuming =
512B disk blocks, then you have 8KB of data and 1KB of parity. This is =
an 8KB write operation.

The yellow write is a 1.5KB write (3 data blocks) and 1 parity.

The green is the same as the yellow, just aligned differently.

Note that all columns (drives) are NOT involved in all write (and later =
read) operations.

The brown write is one data block (512B) and one parity.

The light purple write is 14 data blocks (7KB) and 4 parity.

Quoting directly form Matt:

A 11-sector block will use 1 parity + 4 data + 1 parity + 4 data + 1 =
parity + 3 data (e.g. the blue block in rows 9-12). Note that if there =
are several blocks sharing what would traditionally be thought of as a =
single =93stripe=94, there will be multiple parity blocks in the =
=93stripe=94.

RAID-Z also requires that each allocation be a multiple of (p+1), so =
that when it is freed it does not leave a free segment which is too =
small to be used (i.e. too small to fit even a single sector of data =
plus p parity sectors =96 e.g. the light blue block at left in rows 8-9 =
with 1 parity + 2 data + 1 padding). Therefore, RAID-Z requires a bit =
more space for parity and overhead than RAID-4/5/6.

This leads to the spreadsheet: =
https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT6=
89wTjHv6CGVElrPqTA0w_ZY/edit?pli=3D1#gid=3D2126998674

The column down the left is filesystem block size in disk sectors (512B =
sectors), so it goes from 0.5KB to 128KB filesystem block size =
(recordsize is max you set when you tune the zfs dataset, zfs can and =
will write less than full records).

The column across the top is number of devices in the RAIDz1 vdev (see =
other sheets in the workbook for RAIDz2 and RAIDz3).

Keep in mind that the left column is also the size of the data you are =
writing. If you are using a database with an 8KB recordsize (16 disk =
sectors) and you have 6 devices per vdev, then you will loose 20% of the =
raw space to parity (plus additional for checksums and metadata). The =
chart further down (rows 29 through 37) show the same data but just for =
the powers of 2 increments.

So, as Matt says, the more devices you add to a RAID vdev, the more net =
capacity you will have. At the expense of performance. Quoting Matt=92s =
opening:

TL;DR: Choose a RAID-Z stripe width based on your IOPS needs and the =
amount of space you are willing to devote to parity information. If you =
need more IOPS, use fewer disks per stripe. If you need more usable =
space, use more disks per stripe. Trying to optimize your RAID-Z stripe =
width based on exact numbers is irrelevant in nearly all cases.

and his summary at the end:

The strongest valid recommendation based on exact fitting of blocks into =
stripes is the following: If you are using RAID-Z with 512-byte sector =
devices with recordsize=3D4K or 8K and compression=3Doff (but you =
probably want compression=3Dlz4): use at least 5 disks with RAIDZ1; use =
at least 6 disks with RAIDZ2; and use at least 11 disks with RAIDZ3.

Note that you would ONLY use recordsize =3D 4KB or 8KB if you knew that =
your workload was ONLY 4 or 8 KB blocks of data (a database).

and finally:

To summarize: Use RAID-Z. Not too wide. Enable compression.

--
Paul Kraus
paul@kraus-haus.org




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?D5B38FF5-DAE7-4CFE-B0A2-A2B2D46C5BE5>