From owner-freebsd-questions@FreeBSD.ORG  Wed Aug  6 23:30:55 2014
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 79D3BB40
 for <freebsd-questions@freebsd.org>; Wed,  6 Aug 2014 23:30:55 +0000 (UTC)
Received: from mail-qa0-f48.google.com (mail-qa0-f48.google.com
 [209.85.216.48])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 34B5E2A03
 for <freebsd-questions@freebsd.org>; Wed,  6 Aug 2014 23:30:54 +0000 (UTC)
Received: by mail-qa0-f48.google.com with SMTP id m5so3275476qaj.35
 for <freebsd-questions@freebsd.org>; Wed, 06 Aug 2014 16:30:53 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:mime-version:content-type:from
 :in-reply-to:date:cc:content-transfer-encoding:message-id:references
 :to; bh=fQ63qkCy7JZfg5U/0cee5KPB3hr1l8v9DB17eHbpnzA=;
 b=HtCWE72rGe9i2zoG6P2ePB/mu3Nk6BJUG/ZCO6zYrHKf8RyZGVSq3PVb4F6Ur/1wMf
 sMyJlFxyp0SFenLOwEm0TFnvDgJIYqLFBt7VWS53HnDpeoZiSdPoMfyHGq9+nz7euxtH
 mtjWFX75uoYzAsX9Ri7/5DHA/zM9ekX9lEe1MycPXilpRXThuOjTUf9L2Wi3flo+6GaO
 FcftitMmZS8grqsyyK6/cxlSkgDUpeo3GhEtrM98RhpOTMM9yTJ+1qkNJcJQNB0WuAMK
 Ibsqh35s/lg+xmrtPvcBMVQnGIzRs/aguUPT15x4c3z6zoLfbzDguLeWKHs7aFpm6qO7
 Suow==
X-Gm-Message-State: ALoCoQlB9ZEEMaZB7yz9umPrcJKpH0F3LUXel6n/WJh93JYxWkJrpY7Lp9/EzIt45UfizngxIsTW
X-Received: by 10.140.23.37 with SMTP id 34mr8829603qgo.2.1407367853439;
 Wed, 06 Aug 2014 16:30:53 -0700 (PDT)
Received: from [192.168.1.127] (c-71-234-255-65.hsd1.vt.comcast.net.
 [71.234.255.65])
 by mx.google.com with ESMTPSA id x9sm4052264qas.26.2014.08.06.16.30.52
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Wed, 06 Aug 2014 16:30:52 -0700 (PDT)
Subject: ZFS RAIDz space lost to parity WAS: raid5 vs. ZFS raidz
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
Content-Type: text/plain; charset=windows-1252
From: Paul Kraus <paul@kraus-haus.org>
In-Reply-To: <201408060556.s765uKJA026937@sdf.org>
Date: Wed, 6 Aug 2014 19:30:50 -0400
Content-Transfer-Encoding: quoted-printable
Message-Id: <D5B38FF5-DAE7-4CFE-B0A2-A2B2D46C5BE5@kraus-haus.org>
References: <201408020621.s726LsiA024208@sdf.org>
 <alpine.BSF.2.11.1408020356250.1128@wonkity.com>
 <53DCDBE8.8060704@qeng-ho.org> <201408060556.s765uKJA026937@sdf.org>
To: Scott Bennett <bennett@sdf.org>,
 FreeBSD Questions !!!! <freebsd-questions@freebsd.org>
X-Mailer: Apple Mail (2.1878.6)
Cc: freebsd@qeng-ho.org
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions/>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 06 Aug 2014 23:30:55 -0000

On Aug 6, 2014, at 1:56, Scott Bennett <bennett@sdf.org> wrote:

> Arthur Chance <freebsd@qeng-ho.org> wrote:

>> Quite right. If you have N disks in a RAIDZx configuration, the =
fraction=20
>> used for data is (N-x)/N and the fraction for parity is x/N. There's=20=

>> always overhead for the file system bookkeeping of course, but that's=20=

>> not specific to ZFS or RAID.

But ZFS does NOT use fixed width stripes across the devices in the =
RAIDz<n> vdev. The stripe size changes based on number of devices and =
size of the write operation. ZFS adds parity and padding to make the =
data fit among the number of devices.=20

>     I wonder if what varies is the amount of space taken up by the
> checksums.  If there's a checksum for each block, then the block size
> would change the fraction of the space lost to checksums, and the =
parity
> for the checksums would thus also change.  Enough to matter?  Maybe.

Nope, the size of checksum does NOT vary with vdev configuration.

Going back to Matt=92s blog again (and I agree that his use of the term =
=93n-sector block is confusing).

http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/

Read the blog, don=92t just look at the charts :-) My summary is below =
and may help folks to better understand Matt=92s text.

According to the blog (and I trust Matt in this regard), RAIDz does NOT =
calculate parity per stripe across devices, but on a write by write =
basis. Matt linked to a descriptive chart: =
http://blog.delphix.com/matt/files/2014/06/RAIDZ.png =85 The chart =
assumes a 5 device RAIDz1. Each color is a different write operation =
(remember that ZFS is a copy on write, so every write is a new write, no =
modifying existing data on disk).

The orange write consists of 8 data blocks and 2 parity blocks. Assuming =
512B disk blocks, then you have 8KB of data and 1KB of parity. This is =
an 8KB write operation.

The yellow write is a 1.5KB write (3 data blocks) and 1 parity.

The green is the same as the yellow, just aligned differently.

Note that all columns (drives) are NOT involved in all write (and later =
read) operations.

The brown write is one data block (512B) and one parity.

The light purple write is 14 data blocks (7KB) and 4 parity.

Quoting directly form Matt:

A 11-sector block will use 1 parity + 4 data + 1 parity + 4 data + 1 =
parity + 3 data (e.g. the blue block in rows 9-12). Note that if there =
are several blocks sharing what would traditionally be thought of as a =
single =93stripe=94, there will be multiple parity blocks in the =
=93stripe=94.

RAID-Z also requires that each allocation be a multiple of (p+1), so =
that when it is freed it does not leave a free segment which is too =
small to be used (i.e. too small to fit even a single sector of data =
plus p parity sectors =96 e.g. the light blue block at left in rows 8-9 =
with 1 parity + 2 data + 1 padding). Therefore, RAID-Z requires a bit =
more space for parity and overhead than RAID-4/5/6.

This leads to the spreadsheet: =
https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT6=
89wTjHv6CGVElrPqTA0w_ZY/edit?pli=3D1#gid=3D2126998674

The column down the left is filesystem block size in disk sectors (512B =
sectors), so it goes from 0.5KB to 128KB filesystem block size =
(recordsize is max you set when you tune the zfs dataset, zfs can and =
will write less than full records).

The column across the top is number of devices in the RAIDz1 vdev (see =
other sheets in the workbook for RAIDz2 and RAIDz3).

Keep in mind that the left column is also the size of the data you are =
writing. If you are using a database with an 8KB recordsize (16 disk =
sectors) and you have 6 devices per vdev, then you will loose 20% of the =
raw space to parity (plus additional for checksums and metadata). The =
chart further down (rows 29 through 37) show the same data but just for =
the powers of 2 increments.

So, as Matt says, the more devices you add to a RAID vdev, the more net =
capacity you will have. At the expense of performance. Quoting Matt=92s =
opening:

TL;DR: Choose a RAID-Z stripe width based on your IOPS needs and the =
amount of space you are willing to devote to parity information. If you =
need more IOPS, use fewer disks per stripe. If you need more usable =
space, use more disks per stripe. Trying to optimize your RAID-Z stripe =
width based on exact numbers is irrelevant in nearly all cases.

and his summary at the end:

The strongest valid recommendation based on exact fitting of blocks into =
stripes is the following: If you are using RAID-Z with 512-byte sector =
devices with recordsize=3D4K or 8K and compression=3Doff (but you =
probably want compression=3Dlz4): use at least 5 disks with RAIDZ1; use =
at least 6 disks with RAIDZ2; and use at least 11 disks with RAIDZ3.

Note that you would ONLY use recordsize =3D 4KB or 8KB if you knew that =
your workload was ONLY 4 or 8 KB blocks of data (a database).

and finally:

To summarize: Use RAID-Z. Not too wide. Enable compression.

--
Paul Kraus
paul@kraus-haus.org