Date: Wed, 22 Feb 2017 22:50:01 +0100 From: Wiktor Niesiobedzki <bsd@vink.pl> To: "Eric A. Borisch" <eborisch@gmail.com> Cc: "Eugene M. Zheganin" <emz@norma.perm.ru>, "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: zfs raidz overhead Message-ID: <CAH17caWPRtJVpTQNrqaabtYt7xR%2Boc-eL87tvea=pXjG12oEJg@mail.gmail.com> In-Reply-To: <CAASnNnpB7NFWUbBLxKidXzsDMAwzcJzRc_f4R-9JG_=BZ9fA%2BA@mail.gmail.com> References: <1b54a2fe35407a95edca1f992fa08a71@norman-vivat.ru> <CAASnNnpB7NFWUbBLxKidXzsDMAwzcJzRc_f4R-9JG_=BZ9fA%2BA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
I can add to this, that this is not only seen on raidz, but also on mirror pools, such as this: # zpool status tank pool: tank state: ONLINE scan: scrub repaired 0 in 3h22m with 0 errors on Thu Feb 9 06:47:07 2017 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 gpt/tank1.eli ONLINE 0 0 0 gpt/tank2.eli ONLINE 0 0 0 errors: No known data errors When I createted test zvols: # zfs create -V10gb -o volblocksize=3D8k tank/tst-8k # zfs create -V10gb -o volblocksize=3D16k tank/tst-16k # zfs create -V10gb -o volblocksize=3D32k tank/tst-32k # zfs create -V10gb -o volblocksize=3D64k tank/tst-64k # zfs create -V10gb -o volblocksize=3D128k tank/tst-128k # zfs get used tank/tst-8k NAME PROPERTY VALUE SOURCE tank/tst-8k used 10.3G - root@kadlubek:~ # zfs get used tank/tst-16k NAME PROPERTY VALUE SOURCE tank/tst-16k used 10.2G - root@kadlubek:~ # zfs get used tank/tst-32k NAME PROPERTY VALUE SOURCE tank/tst-32k used 10.1G - root@kadlubek:~ # zfs get used tank/tst-64k NAME PROPERTY VALUE SOURCE tank/tst-64k used 10.0G - root@kadlubek:~ # zfs get used tank/tst-128k NAME PROPERTY VALUE SOURCE tank/tst-128k used 10.0G - root@kadlubek:~ # So it might be related not only to raidz pools. I also noted, that snapshots impact used stats far much, than usedbysnapshot value: zfs get volsize,used,referenced,compressratio,volblocksize,usedbysnapshots,= usedbydataset,usedbychildren tank/dkr-thinpool NAME PROPERTY VALUE SOURCE tank/dkr-thinpool volsize 10G local tank/dkr-thinpool used 12.0G - tank/dkr-thinpool referenced 1.87G - tank/dkr-thinpool compressratio 1.91x - tank/dkr-thinpool volblocksize 64K - tank/dkr-thinpool usedbysnapshots 90.4M - tank/dkr-thinpool usedbydataset 1.87G - tank/dkr-thinpool usedbychildren 0 - On a 10G volume, filled with 2G of data, and 90M used by snapshosts, used is 2G. When I destroy the snapshots, used will drop to 10.0G. Cheers, Wiktor 2017-02-22 0:31 GMT+01:00 Eric A. Borisch <eborisch@gmail.com>: > On Tue, Feb 21, 2017 at 2:45 AM, Eugene M. Zheganin <emz@norma.perm.ru> > wrote: > > > > Hi. > > There's an interesting case described here: > http://serverfault.com/questions/512018/strange-zfs-disk- > space-usage-report-for-a-zvol > [1] > > It's a user story who encountered that under some situations zfs on > raidz could use up to 200% of the space for a zvol. > > I have also seen this. For instance: > > [root@san1:~]# zfs get volsize gamestop/reference1 > NAME PROPERTY VALUE SOURCE > gamestop/reference1 volsize 2,50T local > [root@san1:~]# zfs get all gamestop/reference1 > NAME PROPERTY VALUE SOURCE > gamestop/reference1 type volume - > gamestop/reference1 creation =D1=87=D1=82 =D0=BD=D0=BE=D1=8F=D0=B1. 24 9= :09 2016 - > gamestop/reference1 used 4,38T - > gamestop/reference1 available 1,33T - > gamestop/reference1 referenced 4,01T - > gamestop/reference1 compressratio 1.00x - > gamestop/reference1 reservation none default > gamestop/reference1 volsize 2,50T local > gamestop/reference1 volblocksize 8K - > gamestop/reference1 checksum on default > gamestop/reference1 compression off default > gamestop/reference1 readonly off default > gamestop/reference1 copies 1 default > gamestop/reference1 refreservation none received > gamestop/reference1 primarycache all default > gamestop/reference1 secondarycache all default > gamestop/reference1 usedbysnapshots 378G - > gamestop/reference1 usedbydataset 4,01T - > gamestop/reference1 usedbychildren 0 - > gamestop/reference1 usedbyrefreservation 0 - > gamestop/reference1 logbias latency default > gamestop/reference1 dedup off default > gamestop/reference1 mlslabel - > gamestop/reference1 sync standard default > gamestop/reference1 refcompressratio 1.00x - > gamestop/reference1 written 4,89G - > gamestop/reference1 logicalused 2,72T - > gamestop/reference1 logicalreferenced 2,49T - > gamestop/reference1 volmode default default > gamestop/reference1 snapshot_limit none default > gamestop/reference1 snapshot_count none default > gamestop/reference1 redundant_metadata all default > > [root@san1:~]# zpool status gamestop > pool: gamestop > state: ONLINE > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > gamestop ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > da6 ONLINE 0 0 0 > da7 ONLINE 0 0 0 > da8 ONLINE 0 0 0 > da9 ONLINE 0 0 0 > da11 ONLINE 0 0 0 > > errors: No known data errors > > or, another server (overhead in this case isn't that big, but still > considerable): > > [root@san01:~]# zfs get all data/reference1 > NAME PROPERTY VALUE SOURCE > data/reference1 type volume - > data/reference1 creation Fri Jan 6 11:23 2017 - > data/reference1 used 3.82T - > data/reference1 available 13.0T - > data/reference1 referenced 3.22T - > data/reference1 compressratio 1.00x - > data/reference1 reservation none default > data/reference1 volsize 2T local > data/reference1 volblocksize 8K - > data/reference1 checksum on default > data/reference1 compression off default > data/reference1 readonly off default > data/reference1 copies 1 default > data/reference1 refreservation none received > data/reference1 primarycache all default > data/reference1 secondarycache all default > data/reference1 usedbysnapshots 612G - > data/reference1 usedbydataset 3.22T - > data/reference1 usedbychildren 0 - > data/reference1 usedbyrefreservation 0 - > data/reference1 logbias latency default > data/reference1 dedup off default > data/reference1 mlslabel - > data/reference1 sync standard default > data/reference1 refcompressratio 1.00x - > data/reference1 written 498K - > data/reference1 logicalused 2.37T - > data/reference1 logicalreferenced 2.00T - > data/reference1 volmode default default > data/reference1 snapshot_limit none default > data/reference1 snapshot_count none default > data/reference1 redundant_metadata all default > [root@san01:~]# zpool status gamestop > pool: data > state: ONLINE > scan: none requested > config: > > NAME STATE READ WRITE CKSUM > data ONLINE 0 0 0 > raidz1-0 ONLINE 0 0 0 > da3 ONLINE 0 0 0 > da4 ONLINE 0 0 0 > da5 ONLINE 0 0 0 > da6 ONLINE 0 0 0 > da7 ONLINE 0 0 0 > raidz1-1 ONLINE 0 0 0 > da8 ONLINE 0 0 0 > da9 ONLINE 0 0 0 > da10 ONLINE 0 0 0 > da11 ONLINE 0 0 0 > da12 ONLINE 0 0 0 > raidz1-2 ONLINE 0 0 0 > da13 ONLINE 0 0 0 > da14 ONLINE 0 0 0 > da15 ONLINE 0 0 0 > da16 ONLINE 0 0 0 > da17 ONLINE 0 0 0 > > errors: No known data errors > > So my question is - how to avoid it ? Right now I'm experimenting with > the volblocksize, making it around 64k. I'm also suspecting that such > overhead may be the subsequence of the various resizing operations, like > extening the volsize of the volume or adding new disks into the pool, > because I have a couple of servers with raidz where the initial > disk/volsize configuration didn't change, and the referenced/volsize > numbers are pretty much close to each other. > > Eugene. > > Links: > ------ > [1] > http://serverfault.com/questions/512018/strange-zfs-disk- > space-usage-report-for-a-zvol > > > It comes down to the zpool's sector size (2^ashift) and the volblocksize = -- > I'm guessing your old servers are at ashift=3D9 (512), and the new one is= at > 12 (4096), likely with 4k drives. This is the smallest/atomic size of rea= ds > & writes to a drive from ZFS. > > As described in [1]: > * Allocations need to be a multiple of (p+1) sectors, where p is your > parity level; for raidz1, p=3D=3D1, and allocations need to be in multipl= es of > (1+1)=3D2 sectors, or 8k (for ashift=3D12; this is the physical size / > alignment on drive). > * It also needs to have enough parity for failures, so it also depends [= 2] > on number of drives in pool at larger block/record sizes. > > So considering those requirements, and your zvol with volblocksize=3D8k a= nd > compression=3Doff, allocations for one logical 8k block are always compos= ed > physically of two (4k) data sectors, one (p=3D1) parity sector (4k), and = one > padding sector (4k) to satisfy being a multiple of (p+1=3D) 2, or 16k > (allocated on disk space), hence your observed 2x data size being actuall= y > allocated. Each of these blocks will be on a different drive. This is > different from the sector-level parity in RAID5 > > As Matthew Ahrens [1] points out: "Note that setting a small recordsize > with 4KB sector devices results in universally poor space efficiency -- > RAIDZ-p is no better than p-way mirrors for recordsize=3D4K or 8K." > > Things you can do: > > * Use ashift=3D9 (and perhaps 512-byte sector drives). The same layout r= ules > still apply, but now your 'atomic' size is 512b. You will want to test > performance. > * Use a larger volblocksize, especially if the filesystem on the zvol us= es > a larger block size. If you aren't performance sensitive, use a larger > volblocksize even if the hosted filesystem doesn't. (But test this out to > see how performance sensitive you really are! ;) You'll need to use > something like dd to move data between different block size zvols. > * Enable compression if the contents are compressible (some likely will > be.) > * Use a pool created from mirrors instead of raidz if you need > high-performance small blocks while retaining redundancy. > > You don't get efficient (better than mirrors) redundancy, performant smal= l > (as in small multiple of zpool's sector size) block sizes, and zfs's > flexibility all at once. > > - Eric > > [1] https://www.delphix.com/blog/delphix-engineering/zfs-rai > dz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz > [2] My spin on Ahren's spreadsheet: https://docs.google.com/spread > sheets/d/13sJPc6ZW6_441vWAUiSvKMReJW4z34Ix5JSs44YXRyM/edit?usp=3Dsharing > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAH17caWPRtJVpTQNrqaabtYt7xR%2Boc-eL87tvea=pXjG12oEJg>