From owner-freebsd-fs@freebsd.org  Tue Feb 21 23:31:34 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2085DCE800F
 for <freebsd-fs@mailman.ysv.freebsd.org>; Tue, 21 Feb 2017 23:31:34 +0000 (UTC)
 (envelope-from eborisch@gmail.com)
Received: from mail-it0-x22a.google.com (mail-it0-x22a.google.com
 [IPv6:2607:f8b0:4001:c0b::22a])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id DE34B14A8
 for <freebsd-fs@freebsd.org>; Tue, 21 Feb 2017 23:31:33 +0000 (UTC)
 (envelope-from eborisch@gmail.com)
Received: by mail-it0-x22a.google.com with SMTP id y135so70066801itc.1
 for <freebsd-fs@freebsd.org>; Tue, 21 Feb 2017 15:31:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc; bh=tpZgJL1bQvM1LLQ4nPUNNhKwHnUxgyz6IjlxUwPUisY=;
 b=ZU/waOu9Y7gdOXXAxQHaLeQ78JQM7d9CWNYnEvPvl8OtGrdpdKXJ+mA2TSsqIvoYdM
 /TpWbRms0LF6piwZ550XwEyfKs5Lcf0LIOya30BmzJXfUo0+YkzqLrOfzO16X2TT4aPc
 hw8jz/AC4r3ghp3V/Bm978juC8BFtHl5p5eT6Nchk/w0xXKKbP+rs1AFBfKaOyJUokso
 GBBR/vU/JSZ3pAMGWgZ0fIpecfk72ohMAcmKim1KqssbYZcm15b97dIjVFeJ5EBCRW1r
 skXANCzGfRb0Gi1k2WU/b7K3mD5z8XKKRY+PHo3R/UXtffHcyhnRmG4WCgcjSIf36glU
 SfkA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc;
 bh=tpZgJL1bQvM1LLQ4nPUNNhKwHnUxgyz6IjlxUwPUisY=;
 b=d9T2lByM6o3hE0Tu3djViHi7M38GErJ5yCR3Ew4Y7sKpOpdJOGtp2n4JC46MPVAqQI
 m68xbF6kxjGaHdrfskFYnVn0jaNHyV2GKuv83eiSYKF/o3sGLnZK8eoqj2IfXc8M4e4M
 5RKqpT62CpBUNCE8+sq6ZgngkTIBzUIuCW6C4YDiAWcBmA53Bnj6L+fi3py9ZDNGllVx
 ZuHjQ3WIbVq1L4uN8tlkIwZh1W5Gkd2fev9WPEsMnN2Z2Q9itYHhpZQIKh9AnBUVv8ZH
 Y0LBz8WD7e2xMEkqnKxXY77UjI07kT/zmtWHfe+nQit+fqlx0q3Dt/mtPxEIduQjtCSA
 MO5g==
X-Gm-Message-State: AMke39mI/b2KEMLCbYY1yXeKJ2StvXiICAHRcem9LYDAhyNPUcmkMlSuxnpgt2QMUK9B7MNm3trd3/oxSj9D1w==
X-Received: by 10.36.108.15 with SMTP id w15mr28914974itb.73.1487719892705;
 Tue, 21 Feb 2017 15:31:32 -0800 (PST)
MIME-Version: 1.0
Received: by 10.107.183.148 with HTTP; Tue, 21 Feb 2017 15:31:32 -0800 (PST)
In-Reply-To: <1b54a2fe35407a95edca1f992fa08a71@norman-vivat.ru>
References: <1b54a2fe35407a95edca1f992fa08a71@norman-vivat.ru>
From: "Eric A. Borisch" <eborisch@gmail.com>
Date: Tue, 21 Feb 2017 17:31:32 -0600
Message-ID: <CAASnNnpB7NFWUbBLxKidXzsDMAwzcJzRc_f4R-9JG_=BZ9fA+A@mail.gmail.com>
Subject: Re: zfs raidz overhead
To: "Eugene M. Zheganin" <emz@norma.perm.ru>
Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.23
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 21 Feb 2017 23:31:34 -0000

On Tue, Feb 21, 2017 at 2:45 AM, Eugene M. Zheganin <emz@norma.perm.ru>
wrote:


Hi.

There's an interesting case described here:
http://serverfault.com/questions/512018/strange-zfs-disk-
space-usage-report-for-a-zvol
[1]

It's a user story who encountered that under some situations zfs on
raidz could use up to 200% of the space for a zvol.

I have also seen this. For instance:

[root@san1:~]# zfs get volsize gamestop/reference1
 NAME PROPERTY VALUE SOURCE
 gamestop/reference1 volsize 2,50T local
 [root@san1:~]# zfs get all gamestop/reference1
 NAME PROPERTY VALUE SOURCE
 gamestop/reference1 type volume -
 gamestop/reference1 creation =D1=87=D1=82 =D0=BD=D0=BE=D1=8F=D0=B1. 24 9:0=
9 2016 -
 gamestop/reference1 used 4,38T -
 gamestop/reference1 available 1,33T -
 gamestop/reference1 referenced 4,01T -
 gamestop/reference1 compressratio 1.00x -
 gamestop/reference1 reservation none default
 gamestop/reference1 volsize 2,50T local
 gamestop/reference1 volblocksize 8K -
 gamestop/reference1 checksum on default
 gamestop/reference1 compression off default
 gamestop/reference1 readonly off default
 gamestop/reference1 copies 1 default
 gamestop/reference1 refreservation none received
 gamestop/reference1 primarycache all default
 gamestop/reference1 secondarycache all default
 gamestop/reference1 usedbysnapshots 378G -
 gamestop/reference1 usedbydataset 4,01T -
 gamestop/reference1 usedbychildren 0 -
 gamestop/reference1 usedbyrefreservation 0 -
 gamestop/reference1 logbias latency default
 gamestop/reference1 dedup off default
 gamestop/reference1 mlslabel -
 gamestop/reference1 sync standard default
 gamestop/reference1 refcompressratio 1.00x -
 gamestop/reference1 written 4,89G -
 gamestop/reference1 logicalused 2,72T -
 gamestop/reference1 logicalreferenced 2,49T -
 gamestop/reference1 volmode default default
 gamestop/reference1 snapshot_limit none default
 gamestop/reference1 snapshot_count none default
 gamestop/reference1 redundant_metadata all default

[root@san1:~]# zpool status gamestop
 pool: gamestop
 state: ONLINE
 scan: none requested
 config:

 NAME STATE READ WRITE CKSUM
 gamestop ONLINE 0 0 0
 raidz1-0 ONLINE 0 0 0
 da6 ONLINE 0 0 0
 da7 ONLINE 0 0 0
 da8 ONLINE 0 0 0
 da9 ONLINE 0 0 0
 da11 ONLINE 0 0 0

 errors: No known data errors

or, another server (overhead in this case isn't that big, but still
considerable):

[root@san01:~]# zfs get all data/reference1
 NAME PROPERTY VALUE SOURCE
 data/reference1 type volume -
 data/reference1 creation Fri Jan 6 11:23 2017 -
 data/reference1 used 3.82T -
 data/reference1 available 13.0T -
 data/reference1 referenced 3.22T -
 data/reference1 compressratio 1.00x -
 data/reference1 reservation none default
 data/reference1 volsize 2T local
 data/reference1 volblocksize 8K -
 data/reference1 checksum on default
 data/reference1 compression off default
 data/reference1 readonly off default
 data/reference1 copies 1 default
 data/reference1 refreservation none received
 data/reference1 primarycache all default
 data/reference1 secondarycache all default
 data/reference1 usedbysnapshots 612G -
 data/reference1 usedbydataset 3.22T -
 data/reference1 usedbychildren 0 -
 data/reference1 usedbyrefreservation 0 -
 data/reference1 logbias latency default
 data/reference1 dedup off default
 data/reference1 mlslabel -
 data/reference1 sync standard default
 data/reference1 refcompressratio 1.00x -
 data/reference1 written 498K -
 data/reference1 logicalused 2.37T -
 data/reference1 logicalreferenced 2.00T -
 data/reference1 volmode default default
 data/reference1 snapshot_limit none default
 data/reference1 snapshot_count none default
 data/reference1 redundant_metadata all default
 [root@san01:~]# zpool status gamestop
 pool: data
 state: ONLINE
 scan: none requested
 config:

 NAME STATE READ WRITE CKSUM
 data ONLINE 0 0 0
 raidz1-0 ONLINE 0 0 0
 da3 ONLINE 0 0 0
 da4 ONLINE 0 0 0
 da5 ONLINE 0 0 0
 da6 ONLINE 0 0 0
 da7 ONLINE 0 0 0
 raidz1-1 ONLINE 0 0 0
 da8 ONLINE 0 0 0
 da9 ONLINE 0 0 0
 da10 ONLINE 0 0 0
 da11 ONLINE 0 0 0
 da12 ONLINE 0 0 0
 raidz1-2 ONLINE 0 0 0
 da13 ONLINE 0 0 0
 da14 ONLINE 0 0 0
 da15 ONLINE 0 0 0
 da16 ONLINE 0 0 0
 da17 ONLINE 0 0 0

 errors: No known data errors

So my question is - how to avoid it ? Right now I'm experimenting with
the volblocksize, making it around 64k. I'm also suspecting that such
overhead may be the subsequence of the various resizing operations, like
extening the volsize of the volume or adding new disks into the pool,
because I have a couple of servers with raidz where the initial
disk/volsize configuration didn't change, and the referenced/volsize
numbers are pretty much close to each other.

Eugene.

Links:
------
[1]
http://serverfault.com/questions/512018/strange-zfs-disk-
space-usage-report-for-a-zvol


It comes down to the zpool's sector size (2^ashift) and the volblocksize --
I'm guessing your old servers are at ashift=3D9 (512), and the new one is a=
t
12 (4096), likely with 4k drives. This is the smallest/atomic size of reads
& writes to a drive from ZFS.

As described in [1]:
 * Allocations need to be a multiple of (p+1) sectors, where p is your
parity level; for raidz1, p=3D=3D1, and allocations need to be in multiples=
 of
(1+1)=3D2 sectors, or 8k (for ashift=3D12; this is the physical size /
alignment on drive).
 * It also needs to have enough parity for failures, so it also depends [2]
on number of drives in pool at larger block/record sizes.

So considering those requirements, and your zvol with volblocksize=3D8k and
compression=3Doff, allocations for one logical 8k block are always composed
physically of two (4k) data sectors, one (p=3D1) parity sector (4k), and on=
e
padding sector (4k) to satisfy being a multiple of (p+1=3D) 2, or 16k
(allocated on disk space), hence your observed 2x data size being actually
allocated. Each of these blocks will be on a different drive. This is
different from the sector-level parity in RAID5

As Matthew Ahrens [1] points out: "Note that setting a small recordsize
with 4KB sector devices results in universally poor space efficiency --
RAIDZ-p is no better than p-way mirrors for recordsize=3D4K or 8K."

Things you can do:

 * Use ashift=3D9 (and perhaps 512-byte sector drives). The same layout rul=
es
still apply, but now your 'atomic' size is 512b. You will want to test
performance.
 * Use a larger volblocksize, especially if the filesystem on the zvol uses
a larger block size. If you aren't performance sensitive, use a larger
volblocksize even if the hosted filesystem doesn't. (But test this out to
see how performance sensitive you really are! ;) You'll need to use
something like dd to move data between different block size zvols.
 * Enable compression if the contents are compressible (some likely will
be.)
 * Use a pool created from mirrors instead of raidz if you need
high-performance small blocks while retaining redundancy.

You don't get efficient (better than mirrors) redundancy, performant small
(as in small multiple of zpool's sector size) block sizes, and zfs's
flexibility all at once.

 - Eric

[1] https://www.delphix.com/blog/delphix-engineering/zfs-rai
dz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz
[2] My spin on Ahren's spreadsheet: https://docs.google.com/spread
sheets/d/13sJPc6ZW6_441vWAUiSvKMReJW4z34Ix5JSs44YXRyM/edit?usp=3Dsharing