From owner-freebsd-questions@FreeBSD.ORG Wed Aug 6 23:30:55 2014 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 79D3BB40 for ; Wed, 6 Aug 2014 23:30:55 +0000 (UTC) Received: from mail-qa0-f48.google.com (mail-qa0-f48.google.com [209.85.216.48]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 34B5E2A03 for ; Wed, 6 Aug 2014 23:30:54 +0000 (UTC) Received: by mail-qa0-f48.google.com with SMTP id m5so3275476qaj.35 for ; Wed, 06 Aug 2014 16:30:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:mime-version:content-type:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to; bh=fQ63qkCy7JZfg5U/0cee5KPB3hr1l8v9DB17eHbpnzA=; b=HtCWE72rGe9i2zoG6P2ePB/mu3Nk6BJUG/ZCO6zYrHKf8RyZGVSq3PVb4F6Ur/1wMf sMyJlFxyp0SFenLOwEm0TFnvDgJIYqLFBt7VWS53HnDpeoZiSdPoMfyHGq9+nz7euxtH mtjWFX75uoYzAsX9Ri7/5DHA/zM9ekX9lEe1MycPXilpRXThuOjTUf9L2Wi3flo+6GaO FcftitMmZS8grqsyyK6/cxlSkgDUpeo3GhEtrM98RhpOTMM9yTJ+1qkNJcJQNB0WuAMK Ibsqh35s/lg+xmrtPvcBMVQnGIzRs/aguUPT15x4c3z6zoLfbzDguLeWKHs7aFpm6qO7 Suow== X-Gm-Message-State: ALoCoQlB9ZEEMaZB7yz9umPrcJKpH0F3LUXel6n/WJh93JYxWkJrpY7Lp9/EzIt45UfizngxIsTW X-Received: by 10.140.23.37 with SMTP id 34mr8829603qgo.2.1407367853439; Wed, 06 Aug 2014 16:30:53 -0700 (PDT) Received: from [192.168.1.127] (c-71-234-255-65.hsd1.vt.comcast.net. [71.234.255.65]) by mx.google.com with ESMTPSA id x9sm4052264qas.26.2014.08.06.16.30.52 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 06 Aug 2014 16:30:52 -0700 (PDT) Subject: ZFS RAIDz space lost to parity WAS: raid5 vs. ZFS raidz Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Content-Type: text/plain; charset=windows-1252 From: Paul Kraus In-Reply-To: <201408060556.s765uKJA026937@sdf.org> Date: Wed, 6 Aug 2014 19:30:50 -0400 Content-Transfer-Encoding: quoted-printable Message-Id: References: <201408020621.s726LsiA024208@sdf.org> <53DCDBE8.8060704@qeng-ho.org> <201408060556.s765uKJA026937@sdf.org> To: Scott Bennett , FreeBSD Questions !!!! X-Mailer: Apple Mail (2.1878.6) Cc: freebsd@qeng-ho.org X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Aug 2014 23:30:55 -0000 On Aug 6, 2014, at 1:56, Scott Bennett wrote: > Arthur Chance wrote: >> Quite right. If you have N disks in a RAIDZx configuration, the = fraction=20 >> used for data is (N-x)/N and the fraction for parity is x/N. There's=20= >> always overhead for the file system bookkeeping of course, but that's=20= >> not specific to ZFS or RAID. But ZFS does NOT use fixed width stripes across the devices in the = RAIDz vdev. The stripe size changes based on number of devices and = size of the write operation. ZFS adds parity and padding to make the = data fit among the number of devices.=20 > I wonder if what varies is the amount of space taken up by the > checksums. If there's a checksum for each block, then the block size > would change the fraction of the space lost to checksums, and the = parity > for the checksums would thus also change. Enough to matter? Maybe. Nope, the size of checksum does NOT vary with vdev configuration. Going back to Matt=92s blog again (and I agree that his use of the term = =93n-sector block is confusing). http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ Read the blog, don=92t just look at the charts :-) My summary is below = and may help folks to better understand Matt=92s text. According to the blog (and I trust Matt in this regard), RAIDz does NOT = calculate parity per stripe across devices, but on a write by write = basis. Matt linked to a descriptive chart: = http://blog.delphix.com/matt/files/2014/06/RAIDZ.png =85 The chart = assumes a 5 device RAIDz1. Each color is a different write operation = (remember that ZFS is a copy on write, so every write is a new write, no = modifying existing data on disk). The orange write consists of 8 data blocks and 2 parity blocks. Assuming = 512B disk blocks, then you have 8KB of data and 1KB of parity. This is = an 8KB write operation. The yellow write is a 1.5KB write (3 data blocks) and 1 parity. The green is the same as the yellow, just aligned differently. Note that all columns (drives) are NOT involved in all write (and later = read) operations. The brown write is one data block (512B) and one parity. The light purple write is 14 data blocks (7KB) and 4 parity. Quoting directly form Matt: A 11-sector block will use 1 parity + 4 data + 1 parity + 4 data + 1 = parity + 3 data (e.g. the blue block in rows 9-12). Note that if there = are several blocks sharing what would traditionally be thought of as a = single =93stripe=94, there will be multiple parity blocks in the = =93stripe=94. RAID-Z also requires that each allocation be a multiple of (p+1), so = that when it is freed it does not leave a free segment which is too = small to be used (i.e. too small to fit even a single sector of data = plus p parity sectors =96 e.g. the light blue block at left in rows 8-9 = with 1 parity + 2 data + 1 padding). Therefore, RAID-Z requires a bit = more space for parity and overhead than RAID-4/5/6. This leads to the spreadsheet: = https://docs.google.com/a/delphix.com/spreadsheets/d/1tf4qx1aMJp8Lo_R6gpT6= 89wTjHv6CGVElrPqTA0w_ZY/edit?pli=3D1#gid=3D2126998674 The column down the left is filesystem block size in disk sectors (512B = sectors), so it goes from 0.5KB to 128KB filesystem block size = (recordsize is max you set when you tune the zfs dataset, zfs can and = will write less than full records). The column across the top is number of devices in the RAIDz1 vdev (see = other sheets in the workbook for RAIDz2 and RAIDz3). Keep in mind that the left column is also the size of the data you are = writing. If you are using a database with an 8KB recordsize (16 disk = sectors) and you have 6 devices per vdev, then you will loose 20% of the = raw space to parity (plus additional for checksums and metadata). The = chart further down (rows 29 through 37) show the same data but just for = the powers of 2 increments. So, as Matt says, the more devices you add to a RAID vdev, the more net = capacity you will have. At the expense of performance. Quoting Matt=92s = opening: TL;DR: Choose a RAID-Z stripe width based on your IOPS needs and the = amount of space you are willing to devote to parity information. If you = need more IOPS, use fewer disks per stripe. If you need more usable = space, use more disks per stripe. Trying to optimize your RAID-Z stripe = width based on exact numbers is irrelevant in nearly all cases. and his summary at the end: The strongest valid recommendation based on exact fitting of blocks into = stripes is the following: If you are using RAID-Z with 512-byte sector = devices with recordsize=3D4K or 8K and compression=3Doff (but you = probably want compression=3Dlz4): use at least 5 disks with RAIDZ1; use = at least 6 disks with RAIDZ2; and use at least 11 disks with RAIDZ3. Note that you would ONLY use recordsize =3D 4KB or 8KB if you knew that = your workload was ONLY 4 or 8 KB blocks of data (a database). and finally: To summarize: Use RAID-Z. Not too wide. Enable compression. -- Paul Kraus paul@kraus-haus.org