From owner-freebsd-fs@freebsd.org Wed Feb 22 21:50:05 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0EA63CEA537 for ; Wed, 22 Feb 2017 21:50:05 +0000 (UTC) (envelope-from bsd@vink.pl) Received: from mail-qt0-x22a.google.com (mail-qt0-x22a.google.com [IPv6:2607:f8b0:400d:c0d::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id C0A95FA8 for ; Wed, 22 Feb 2017 21:50:04 +0000 (UTC) (envelope-from bsd@vink.pl) Received: by mail-qt0-x22a.google.com with SMTP id k15so14539135qtg.3 for ; Wed, 22 Feb 2017 13:50:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vink-pl.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=TtIleobm99Y2GJcmokq5ON09IgvEfOUND7fifvSNL+g=; b=hB5wMhPYO+D2I1+koi5/Hz9JolqF8pigePum19IhZeoGBbxdBNNWY5VXTtLimfVNe6 yap4ZApZMt5pNjEyPsVpawNK7q+M1DhGmDTzLGARHyrmZ/XafDmhGKHOA+fbgYwTReG3 Gki6s6OlPcg0jFIU/gRwVBGC6dpWb2PntxJAPjnOa810N3Ybf+j+x87DKqg2j/WZVP6Y d2USruXW4bVKN6g7ZfnrFIJmIwD6yNGmeSPZMa2Qq5e14GrUv93IDxzMD2FrvFO/imB1 zdI39N8F9jnDBO9+EeKTmpOpYvCIM8OBGb9HkFfKziU/eRkHUt2qpWEmpNFn8dvOyuHt pjIQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=TtIleobm99Y2GJcmokq5ON09IgvEfOUND7fifvSNL+g=; b=dMFJc0XjGllowx4JbtP04kyg3BD9u9KU3jD8TjRuTz/mdZf6C7ruOtYCBrkIznqnX7 nIN8Oz9V8GhhT92Rrp//ULdW8Vs/gLb8OGCMn27WFCeq9rGFLwPqbQuJiYn6rnI391LJ uc2tiTjFEQ07ZpEeBJwo4+9EawjeWpg2yX/uSnY97tJtqI2h65l5THV0EIVK2VGF9Xhk ROr0IxmLl2c0UBpP9JO+UP2ZGjnfALqBubvF3dpJQaMZQQpX3XPDahciFVtXApM3Fht/ YjXj50lZQHKQWo/L9F+pHK+uMXQINK60yQ+0SUARfR2hICcESeXkferaT7E27AHh5TSw /NBw== X-Gm-Message-State: AMke39kvlEBkTlLgxQOFuKX6DuoLV+tDUbwrUVuCLE4ISLUYY6eSLVBbZkfLeYDcUJ/v0g== X-Received: by 10.200.38.196 with SMTP id 4mr31003087qtp.96.1487800203173; Wed, 22 Feb 2017 13:50:03 -0800 (PST) Received: from mail-qk0-f172.google.com (mail-qk0-f172.google.com. [209.85.220.172]) by smtp.gmail.com with ESMTPSA id 102sm1488374qkx.49.2017.02.22.13.50.02 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 22 Feb 2017 13:50:02 -0800 (PST) Received: by mail-qk0-f172.google.com with SMTP id x71so15957480qkb.3 for ; Wed, 22 Feb 2017 13:50:02 -0800 (PST) X-Received: by 10.55.201.27 with SMTP id q27mr35796548qki.296.1487800202064; Wed, 22 Feb 2017 13:50:02 -0800 (PST) MIME-Version: 1.0 Received: by 10.12.148.170 with HTTP; Wed, 22 Feb 2017 13:50:01 -0800 (PST) In-Reply-To: References: <1b54a2fe35407a95edca1f992fa08a71@norman-vivat.ru> From: Wiktor Niesiobedzki Date: Wed, 22 Feb 2017 22:50:01 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: zfs raidz overhead To: "Eric A. Borisch" Cc: "Eugene M. Zheganin" , "freebsd-fs@freebsd.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Feb 2017 21:50:05 -0000 I can add to this, that this is not only seen on raidz, but also on mirror pools, such as this: # zpool status tank pool: tank state: ONLINE scan: scrub repaired 0 in 3h22m with 0 errors on Thu Feb 9 06:47:07 2017 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/tank1.eli ONLINE 0 0 0 gpt/tank2.eli ONLINE 0 0 0 errors: No known data errors When I createted test zvols: # zfs create -V10gb -o volblocksize=3D8k tank/tst-8k # zfs create -V10gb -o volblocksize=3D16k tank/tst-16k # zfs create -V10gb -o volblocksize=3D32k tank/tst-32k # zfs create -V10gb -o volblocksize=3D64k tank/tst-64k # zfs create -V10gb -o volblocksize=3D128k tank/tst-128k # zfs get used tank/tst-8k NAME PROPERTY VALUE SOURCE tank/tst-8k used 10.3G - root@kadlubek:~ # zfs get used tank/tst-16k NAME PROPERTY VALUE SOURCE tank/tst-16k used 10.2G - root@kadlubek:~ # zfs get used tank/tst-32k NAME PROPERTY VALUE SOURCE tank/tst-32k used 10.1G - root@kadlubek:~ # zfs get used tank/tst-64k NAME PROPERTY VALUE SOURCE tank/tst-64k used 10.0G - root@kadlubek:~ # zfs get used tank/tst-128k NAME PROPERTY VALUE SOURCE tank/tst-128k used 10.0G - root@kadlubek:~ # So it might be related not only to raidz pools. I also noted, that snapshots impact used stats far much, than usedbysnapshot value: zfs get volsize,used,referenced,compressratio,volblocksize,usedbysnapshots,= usedbydataset,usedbychildren tank/dkr-thinpool NAME PROPERTY VALUE SOURCE tank/dkr-thinpool volsize 10G local tank/dkr-thinpool used 12.0G - tank/dkr-thinpool referenced 1.87G - tank/dkr-thinpool compressratio 1.91x - tank/dkr-thinpool volblocksize 64K - tank/dkr-thinpool usedbysnapshots 90.4M - tank/dkr-thinpool usedbydataset 1.87G - tank/dkr-thinpool usedbychildren 0 - On a 10G volume, filled with 2G of data, and 90M used by snapshosts, used is 2G. When I destroy the snapshots, used will drop to 10.0G. Cheers, Wiktor 2017-02-22 0:31 GMT+01:00 Eric A. Borisch : > On Tue, Feb 21, 2017 at 2:45 AM, Eugene M. Zheganin > wrote: > > > > Hi. > > There's an interesting case described here: > http://serverfault.com/questions/512018/strange-zfs-disk- > space-usage-report-for-a-zvol > [1] > > It's a user story who encountered that under some situations zfs on > raidz could use up to 200% of the space for a zvol. > > I have also seen this. For instance: > > [root@san1:~]# zfs get volsize gamestop/reference1 > NAME PROPERTY VALUE SOURCE > gamestop/reference1 volsize 2,50T local > [root@san1:~]# zfs get all gamestop/reference1 > NAME PROPERTY VALUE SOURCE > gamestop/reference1 type volume - > gamestop/reference1 creation =D1=87=D1=82 =D0=BD=D0=BE=D1=8F=D0=B1. 24 9= :09 2016 - > gamestop/reference1 used 4,38T - > gamestop/reference1 available 1,33T - > gamestop/reference1 referenced 4,01T - > gamestop/reference1 compressratio 1.00x - > gamestop/reference1 reservation none default > gamestop/reference1 volsize 2,50T local > gamestop/reference1 volblocksize 8K - > gamestop/reference1 checksum on default > gamestop/reference1 compression off default > gamestop/reference1 readonly off default > gamestop/reference1 copies 1 default > gamestop/reference1 refreservation none received > gamestop/reference1 primarycache all default > gamestop/reference1 secondarycache all default > gamestop/reference1 usedbysnapshots 378G - > gamestop/reference1 usedbydataset 4,01T - > gamestop/reference1 usedbychildren 0 - > gamestop/reference1 usedbyrefreservation 0 - > gamestop/reference1 logbias latency default > gamestop/reference1 dedup off default > gamestop/reference1 mlslabel - > gamestop/reference1 sync standard default > gamestop/reference1 refcompressratio 1.00x - > gamestop/reference1 written 4,89G - > gamestop/reference1 logicalused 2,72T - > gamestop/reference1 logicalreferenced 2,49T - > gamestop/reference1 volmode default default > gamestop/reference1 snapshot_limit none default > gamestop/reference1 snapshot_count none default > gamestop/reference1 redundant_metadata all default > > [root@san1:~]# zpool status gamestop > pool: gamestop > state: ONLINE > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > gamestop ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > da6 ONLINE 0 0 0 > da7 ONLINE 0 0 0 > da8 ONLINE 0 0 0 > da9 ONLINE 0 0 0 > da11 ONLINE 0 0 0 > > errors: No known data errors > > or, another server (overhead in this case isn't that big, but still > considerable): > > [root@san01:~]# zfs get all data/reference1 > NAME PROPERTY VALUE SOURCE > data/reference1 type volume - > data/reference1 creation Fri Jan 6 11:23 2017 - > data/reference1 used 3.82T - > data/reference1 available 13.0T - > data/reference1 referenced 3.22T - > data/reference1 compressratio 1.00x - > data/reference1 reservation none default > data/reference1 volsize 2T local > data/reference1 volblocksize 8K - > data/reference1 checksum on default > data/reference1 compression off default > data/reference1 readonly off default > data/reference1 copies 1 default > data/reference1 refreservation none received > data/reference1 primarycache all default > data/reference1 secondarycache all default > data/reference1 usedbysnapshots 612G - > data/reference1 usedbydataset 3.22T - > data/reference1 usedbychildren 0 - > data/reference1 usedbyrefreservation 0 - > data/reference1 logbias latency default > data/reference1 dedup off default > data/reference1 mlslabel - > data/reference1 sync standard default > data/reference1 refcompressratio 1.00x - > data/reference1 written 498K - > data/reference1 logicalused 2.37T - > data/reference1 logicalreferenced 2.00T - > data/reference1 volmode default default > data/reference1 snapshot_limit none default > data/reference1 snapshot_count none default > data/reference1 redundant_metadata all default > [root@san01:~]# zpool status gamestop > pool: data > state: ONLINE > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > data ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > da3 ONLINE 0 0 0 > da4 ONLINE 0 0 0 > da5 ONLINE 0 0 0 > da6 ONLINE 0 0 0 > da7 ONLINE 0 0 0 > raidz1-1 ONLINE 0 0 0 > da8 ONLINE 0 0 0 > da9 ONLINE 0 0 0 > da10 ONLINE 0 0 0 > da11 ONLINE 0 0 0 > da12 ONLINE 0 0 0 > raidz1-2 ONLINE 0 0 0 > da13 ONLINE 0 0 0 > da14 ONLINE 0 0 0 > da15 ONLINE 0 0 0 > da16 ONLINE 0 0 0 > da17 ONLINE 0 0 0 > > errors: No known data errors > > So my question is - how to avoid it ? Right now I'm experimenting with > the volblocksize, making it around 64k. I'm also suspecting that such > overhead may be the subsequence of the various resizing operations, like > extening the volsize of the volume or adding new disks into the pool, > because I have a couple of servers with raidz where the initial > disk/volsize configuration didn't change, and the referenced/volsize > numbers are pretty much close to each other. > > Eugene. > > Links: > ------ > [1] > http://serverfault.com/questions/512018/strange-zfs-disk- > space-usage-report-for-a-zvol > > > It comes down to the zpool's sector size (2^ashift) and the volblocksize = -- > I'm guessing your old servers are at ashift=3D9 (512), and the new one is= at > 12 (4096), likely with 4k drives. This is the smallest/atomic size of rea= ds > & writes to a drive from ZFS. > > As described in [1]: > * Allocations need to be a multiple of (p+1) sectors, where p is your > parity level; for raidz1, p=3D=3D1, and allocations need to be in multipl= es of > (1+1)=3D2 sectors, or 8k (for ashift=3D12; this is the physical size / > alignment on drive). > * It also needs to have enough parity for failures, so it also depends [= 2] > on number of drives in pool at larger block/record sizes. > > So considering those requirements, and your zvol with volblocksize=3D8k a= nd > compression=3Doff, allocations for one logical 8k block are always compos= ed > physically of two (4k) data sectors, one (p=3D1) parity sector (4k), and = one > padding sector (4k) to satisfy being a multiple of (p+1=3D) 2, or 16k > (allocated on disk space), hence your observed 2x data size being actuall= y > allocated. Each of these blocks will be on a different drive. This is > different from the sector-level parity in RAID5 > > As Matthew Ahrens [1] points out: "Note that setting a small recordsize > with 4KB sector devices results in universally poor space efficiency -- > RAIDZ-p is no better than p-way mirrors for recordsize=3D4K or 8K." > > Things you can do: > > * Use ashift=3D9 (and perhaps 512-byte sector drives). The same layout r= ules > still apply, but now your 'atomic' size is 512b. You will want to test > performance. > * Use a larger volblocksize, especially if the filesystem on the zvol us= es > a larger block size. If you aren't performance sensitive, use a larger > volblocksize even if the hosted filesystem doesn't. (But test this out to > see how performance sensitive you really are! ;) You'll need to use > something like dd to move data between different block size zvols. > * Enable compression if the contents are compressible (some likely will > be.) > * Use a pool created from mirrors instead of raidz if you need > high-performance small blocks while retaining redundancy. > > You don't get efficient (better than mirrors) redundancy, performant smal= l > (as in small multiple of zpool's sector size) block sizes, and zfs's > flexibility all at once. > > - Eric > > [1] https://www.delphix.com/blog/delphix-engineering/zfs-rai > dz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz > [2] My spin on Ahren's spreadsheet: https://docs.google.com/spread > sheets/d/13sJPc6ZW6_441vWAUiSvKMReJW4z34Ix5JSs44YXRyM/edit?usp=3Dsharing > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"