From owner-freebsd-fs@freebsd.org  Wed Feb 22 21:50:05 2017
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0EA63CEA537
 for <freebsd-fs@mailman.ysv.freebsd.org>; Wed, 22 Feb 2017 21:50:05 +0000 (UTC)
 (envelope-from bsd@vink.pl)
Received: from mail-qt0-x22a.google.com (mail-qt0-x22a.google.com
 [IPv6:2607:f8b0:400d:c0d::22a])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id C0A95FA8
 for <freebsd-fs@freebsd.org>; Wed, 22 Feb 2017 21:50:04 +0000 (UTC)
 (envelope-from bsd@vink.pl)
Received: by mail-qt0-x22a.google.com with SMTP id k15so14539135qtg.3
 for <freebsd-fs@freebsd.org>; Wed, 22 Feb 2017 13:50:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=vink-pl.20150623.gappssmtp.com; s=20150623;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc:content-transfer-encoding;
 bh=TtIleobm99Y2GJcmokq5ON09IgvEfOUND7fifvSNL+g=;
 b=hB5wMhPYO+D2I1+koi5/Hz9JolqF8pigePum19IhZeoGBbxdBNNWY5VXTtLimfVNe6
 yap4ZApZMt5pNjEyPsVpawNK7q+M1DhGmDTzLGARHyrmZ/XafDmhGKHOA+fbgYwTReG3
 Gki6s6OlPcg0jFIU/gRwVBGC6dpWb2PntxJAPjnOa810N3Ybf+j+x87DKqg2j/WZVP6Y
 d2USruXW4bVKN6g7ZfnrFIJmIwD6yNGmeSPZMa2Qq5e14GrUv93IDxzMD2FrvFO/imB1
 zdI39N8F9jnDBO9+EeKTmpOpYvCIM8OBGb9HkFfKziU/eRkHUt2qpWEmpNFn8dvOyuHt
 pjIQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc:content-transfer-encoding;
 bh=TtIleobm99Y2GJcmokq5ON09IgvEfOUND7fifvSNL+g=;
 b=dMFJc0XjGllowx4JbtP04kyg3BD9u9KU3jD8TjRuTz/mdZf6C7ruOtYCBrkIznqnX7
 nIN8Oz9V8GhhT92Rrp//ULdW8Vs/gLb8OGCMn27WFCeq9rGFLwPqbQuJiYn6rnI391LJ
 uc2tiTjFEQ07ZpEeBJwo4+9EawjeWpg2yX/uSnY97tJtqI2h65l5THV0EIVK2VGF9Xhk
 ROr0IxmLl2c0UBpP9JO+UP2ZGjnfALqBubvF3dpJQaMZQQpX3XPDahciFVtXApM3Fht/
 YjXj50lZQHKQWo/L9F+pHK+uMXQINK60yQ+0SUARfR2hICcESeXkferaT7E27AHh5TSw
 /NBw==
X-Gm-Message-State: AMke39kvlEBkTlLgxQOFuKX6DuoLV+tDUbwrUVuCLE4ISLUYY6eSLVBbZkfLeYDcUJ/v0g==
X-Received: by 10.200.38.196 with SMTP id 4mr31003087qtp.96.1487800203173;
 Wed, 22 Feb 2017 13:50:03 -0800 (PST)
Received: from mail-qk0-f172.google.com (mail-qk0-f172.google.com.
 [209.85.220.172])
 by smtp.gmail.com with ESMTPSA id 102sm1488374qkx.49.2017.02.22.13.50.02
 for <freebsd-fs@freebsd.org>
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Wed, 22 Feb 2017 13:50:02 -0800 (PST)
Received: by mail-qk0-f172.google.com with SMTP id x71so15957480qkb.3
 for <freebsd-fs@freebsd.org>; Wed, 22 Feb 2017 13:50:02 -0800 (PST)
X-Received: by 10.55.201.27 with SMTP id q27mr35796548qki.296.1487800202064;
 Wed, 22 Feb 2017 13:50:02 -0800 (PST)
MIME-Version: 1.0
Received: by 10.12.148.170 with HTTP; Wed, 22 Feb 2017 13:50:01 -0800 (PST)
In-Reply-To: <CAASnNnpB7NFWUbBLxKidXzsDMAwzcJzRc_f4R-9JG_=BZ9fA+A@mail.gmail.com>
References: <1b54a2fe35407a95edca1f992fa08a71@norman-vivat.ru>
 <CAASnNnpB7NFWUbBLxKidXzsDMAwzcJzRc_f4R-9JG_=BZ9fA+A@mail.gmail.com>
From: Wiktor Niesiobedzki <bsd@vink.pl>
Date: Wed, 22 Feb 2017 22:50:01 +0100
X-Gmail-Original-Message-ID: <CAH17caWPRtJVpTQNrqaabtYt7xR+oc-eL87tvea=pXjG12oEJg@mail.gmail.com>
Message-ID: <CAH17caWPRtJVpTQNrqaabtYt7xR+oc-eL87tvea=pXjG12oEJg@mail.gmail.com>
Subject: Re: zfs raidz overhead
To: "Eric A. Borisch" <eborisch@gmail.com>
Cc: "Eugene M. Zheganin" <emz@norma.perm.ru>,
 "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 22 Feb 2017 21:50:05 -0000

I can add to this, that this is not only seen on raidz, but also on
mirror pools, such as this:
# zpool status tank
  pool: tank
 state: ONLINE
  scan: scrub repaired 0 in 3h22m with 0 errors on Thu Feb  9 06:47:07 2017
config:

        NAME               STATE     READ WRITE CKSUM
        tank               ONLINE       0     0     0
          mirror-0         ONLINE       0     0     0
            gpt/tank1.eli  ONLINE       0     0     0
            gpt/tank2.eli  ONLINE       0     0     0

errors: No known data errors


When I createted test zvols:
# zfs create -V10gb -o volblocksize=3D8k tank/tst-8k
# zfs create -V10gb -o volblocksize=3D16k tank/tst-16k
# zfs create -V10gb -o volblocksize=3D32k tank/tst-32k
# zfs create -V10gb -o volblocksize=3D64k tank/tst-64k
# zfs create -V10gb -o volblocksize=3D128k tank/tst-128k

# zfs get used tank/tst-8k
NAME         PROPERTY  VALUE  SOURCE
tank/tst-8k  used      10.3G  -
root@kadlubek:~ # zfs get used tank/tst-16k
NAME          PROPERTY  VALUE  SOURCE
tank/tst-16k  used      10.2G  -
root@kadlubek:~ # zfs get used tank/tst-32k
NAME          PROPERTY  VALUE  SOURCE
tank/tst-32k  used      10.1G  -
root@kadlubek:~ # zfs get used tank/tst-64k
NAME          PROPERTY  VALUE  SOURCE
tank/tst-64k  used      10.0G  -
root@kadlubek:~ # zfs get used tank/tst-128k
NAME           PROPERTY  VALUE  SOURCE
tank/tst-128k  used      10.0G  -
root@kadlubek:~ #

So it might be related not only to raidz pools.

I also noted, that snapshots impact used stats far much, than
usedbysnapshot value:
zfs get volsize,used,referenced,compressratio,volblocksize,usedbysnapshots,=
usedbydataset,usedbychildren
tank/dkr-thinpool
NAME               PROPERTY         VALUE      SOURCE
tank/dkr-thinpool  volsize          10G        local
tank/dkr-thinpool  used             12.0G      -
tank/dkr-thinpool  referenced       1.87G      -
tank/dkr-thinpool  compressratio    1.91x      -
tank/dkr-thinpool  volblocksize     64K        -
tank/dkr-thinpool  usedbysnapshots  90.4M      -
tank/dkr-thinpool  usedbydataset    1.87G      -
tank/dkr-thinpool  usedbychildren   0          -


On a 10G volume, filled with 2G of data, and 90M used by snapshosts,
used is 2G. When I destroy the snapshots, used will drop to 10.0G.

Cheers,

Wiktor

2017-02-22 0:31 GMT+01:00 Eric A. Borisch <eborisch@gmail.com>:
> On Tue, Feb 21, 2017 at 2:45 AM, Eugene M. Zheganin <emz@norma.perm.ru>
> wrote:
>
>
>
> Hi.
>
> There's an interesting case described here:
> http://serverfault.com/questions/512018/strange-zfs-disk-
> space-usage-report-for-a-zvol
> [1]
>
> It's a user story who encountered that under some situations zfs on
> raidz could use up to 200% of the space for a zvol.
>
> I have also seen this. For instance:
>
> [root@san1:~]# zfs get volsize gamestop/reference1
>  NAME PROPERTY VALUE SOURCE
>  gamestop/reference1 volsize 2,50T local
>  [root@san1:~]# zfs get all gamestop/reference1
>  NAME PROPERTY VALUE SOURCE
>  gamestop/reference1 type volume -
>  gamestop/reference1 creation =D1=87=D1=82 =D0=BD=D0=BE=D1=8F=D0=B1. 24 9=
:09 2016 -
>  gamestop/reference1 used 4,38T -
>  gamestop/reference1 available 1,33T -
>  gamestop/reference1 referenced 4,01T -
>  gamestop/reference1 compressratio 1.00x -
>  gamestop/reference1 reservation none default
>  gamestop/reference1 volsize 2,50T local
>  gamestop/reference1 volblocksize 8K -
>  gamestop/reference1 checksum on default
>  gamestop/reference1 compression off default
>  gamestop/reference1 readonly off default
>  gamestop/reference1 copies 1 default
>  gamestop/reference1 refreservation none received
>  gamestop/reference1 primarycache all default
>  gamestop/reference1 secondarycache all default
>  gamestop/reference1 usedbysnapshots 378G -
>  gamestop/reference1 usedbydataset 4,01T -
>  gamestop/reference1 usedbychildren 0 -
>  gamestop/reference1 usedbyrefreservation 0 -
>  gamestop/reference1 logbias latency default
>  gamestop/reference1 dedup off default
>  gamestop/reference1 mlslabel -
>  gamestop/reference1 sync standard default
>  gamestop/reference1 refcompressratio 1.00x -
>  gamestop/reference1 written 4,89G -
>  gamestop/reference1 logicalused 2,72T -
>  gamestop/reference1 logicalreferenced 2,49T -
>  gamestop/reference1 volmode default default
>  gamestop/reference1 snapshot_limit none default
>  gamestop/reference1 snapshot_count none default
>  gamestop/reference1 redundant_metadata all default
>
> [root@san1:~]# zpool status gamestop
>  pool: gamestop
>  state: ONLINE
>  scan: none requested
>  config:
>
>  NAME STATE READ WRITE CKSUM
>  gamestop ONLINE 0 0 0
>  raidz1-0 ONLINE 0 0 0
>  da6 ONLINE 0 0 0
>  da7 ONLINE 0 0 0
>  da8 ONLINE 0 0 0
>  da9 ONLINE 0 0 0
>  da11 ONLINE 0 0 0
>
>  errors: No known data errors
>
> or, another server (overhead in this case isn't that big, but still
> considerable):
>
> [root@san01:~]# zfs get all data/reference1
>  NAME PROPERTY VALUE SOURCE
>  data/reference1 type volume -
>  data/reference1 creation Fri Jan 6 11:23 2017 -
>  data/reference1 used 3.82T -
>  data/reference1 available 13.0T -
>  data/reference1 referenced 3.22T -
>  data/reference1 compressratio 1.00x -
>  data/reference1 reservation none default
>  data/reference1 volsize 2T local
>  data/reference1 volblocksize 8K -
>  data/reference1 checksum on default
>  data/reference1 compression off default
>  data/reference1 readonly off default
>  data/reference1 copies 1 default
>  data/reference1 refreservation none received
>  data/reference1 primarycache all default
>  data/reference1 secondarycache all default
>  data/reference1 usedbysnapshots 612G -
>  data/reference1 usedbydataset 3.22T -
>  data/reference1 usedbychildren 0 -
>  data/reference1 usedbyrefreservation 0 -
>  data/reference1 logbias latency default
>  data/reference1 dedup off default
>  data/reference1 mlslabel -
>  data/reference1 sync standard default
>  data/reference1 refcompressratio 1.00x -
>  data/reference1 written 498K -
>  data/reference1 logicalused 2.37T -
>  data/reference1 logicalreferenced 2.00T -
>  data/reference1 volmode default default
>  data/reference1 snapshot_limit none default
>  data/reference1 snapshot_count none default
>  data/reference1 redundant_metadata all default
>  [root@san01:~]# zpool status gamestop
>  pool: data
>  state: ONLINE
>  scan: none requested
>  config:
>
>  NAME STATE READ WRITE CKSUM
>  data ONLINE 0 0 0
>  raidz1-0 ONLINE 0 0 0
>  da3 ONLINE 0 0 0
>  da4 ONLINE 0 0 0
>  da5 ONLINE 0 0 0
>  da6 ONLINE 0 0 0
>  da7 ONLINE 0 0 0
>  raidz1-1 ONLINE 0 0 0
>  da8 ONLINE 0 0 0
>  da9 ONLINE 0 0 0
>  da10 ONLINE 0 0 0
>  da11 ONLINE 0 0 0
>  da12 ONLINE 0 0 0
>  raidz1-2 ONLINE 0 0 0
>  da13 ONLINE 0 0 0
>  da14 ONLINE 0 0 0
>  da15 ONLINE 0 0 0
>  da16 ONLINE 0 0 0
>  da17 ONLINE 0 0 0
>
>  errors: No known data errors
>
> So my question is - how to avoid it ? Right now I'm experimenting with
> the volblocksize, making it around 64k. I'm also suspecting that such
> overhead may be the subsequence of the various resizing operations, like
> extening the volsize of the volume or adding new disks into the pool,
> because I have a couple of servers with raidz where the initial
> disk/volsize configuration didn't change, and the referenced/volsize
> numbers are pretty much close to each other.
>
> Eugene.
>
> Links:
> ------
> [1]
> http://serverfault.com/questions/512018/strange-zfs-disk-
> space-usage-report-for-a-zvol
>
>
> It comes down to the zpool's sector size (2^ashift) and the volblocksize =
--
> I'm guessing your old servers are at ashift=3D9 (512), and the new one is=
 at
> 12 (4096), likely with 4k drives. This is the smallest/atomic size of rea=
ds
> & writes to a drive from ZFS.
>
> As described in [1]:
>  * Allocations need to be a multiple of (p+1) sectors, where p is your
> parity level; for raidz1, p=3D=3D1, and allocations need to be in multipl=
es of
> (1+1)=3D2 sectors, or 8k (for ashift=3D12; this is the physical size /
> alignment on drive).
>  * It also needs to have enough parity for failures, so it also depends [=
2]
> on number of drives in pool at larger block/record sizes.
>
> So considering those requirements, and your zvol with volblocksize=3D8k a=
nd
> compression=3Doff, allocations for one logical 8k block are always compos=
ed
> physically of two (4k) data sectors, one (p=3D1) parity sector (4k), and =
one
> padding sector (4k) to satisfy being a multiple of (p+1=3D) 2, or 16k
> (allocated on disk space), hence your observed 2x data size being actuall=
y
> allocated. Each of these blocks will be on a different drive. This is
> different from the sector-level parity in RAID5
>
> As Matthew Ahrens [1] points out: "Note that setting a small recordsize
> with 4KB sector devices results in universally poor space efficiency --
> RAIDZ-p is no better than p-way mirrors for recordsize=3D4K or 8K."
>
> Things you can do:
>
>  * Use ashift=3D9 (and perhaps 512-byte sector drives). The same layout r=
ules
> still apply, but now your 'atomic' size is 512b. You will want to test
> performance.
>  * Use a larger volblocksize, especially if the filesystem on the zvol us=
es
> a larger block size. If you aren't performance sensitive, use a larger
> volblocksize even if the hosted filesystem doesn't. (But test this out to
> see how performance sensitive you really are! ;) You'll need to use
> something like dd to move data between different block size zvols.
>  * Enable compression if the contents are compressible (some likely will
> be.)
>  * Use a pool created from mirrors instead of raidz if you need
> high-performance small blocks while retaining redundancy.
>
> You don't get efficient (better than mirrors) redundancy, performant smal=
l
> (as in small multiple of zpool's sector size) block sizes, and zfs's
> flexibility all at once.
>
>  - Eric
>
> [1] https://www.delphix.com/blog/delphix-engineering/zfs-rai
> dz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz
> [2] My spin on Ahren's spreadsheet: https://docs.google.com/spread
> sheets/d/13sJPc6ZW6_441vWAUiSvKMReJW4z34Ix5JSs44YXRyM/edit?usp=3Dsharing
> _______________________________________________
> freebsd-fs@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"