Date: Tue, 21 Feb 2017 17:31:32 -0600 From: "Eric A. Borisch" <eborisch@gmail.com> To: "Eugene M. Zheganin" <emz@norma.perm.ru> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@freebsd.org> Subject: Re: zfs raidz overhead Message-ID: <CAASnNnpB7NFWUbBLxKidXzsDMAwzcJzRc_f4R-9JG_=BZ9fA%2BA@mail.gmail.com> In-Reply-To: <1b54a2fe35407a95edca1f992fa08a71@norman-vivat.ru> References: <1b54a2fe35407a95edca1f992fa08a71@norman-vivat.ru>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Feb 21, 2017 at 2:45 AM, Eugene M. Zheganin <emz@norma.perm.ru> wrote: Hi. There's an interesting case described here: http://serverfault.com/questions/512018/strange-zfs-disk- space-usage-report-for-a-zvol [1] It's a user story who encountered that under some situations zfs on raidz could use up to 200% of the space for a zvol. I have also seen this. For instance: [root@san1:~]# zfs get volsize gamestop/reference1 NAME PROPERTY VALUE SOURCE gamestop/reference1 volsize 2,50T local [root@san1:~]# zfs get all gamestop/reference1 NAME PROPERTY VALUE SOURCE gamestop/reference1 type volume - gamestop/reference1 creation =D1=87=D1=82 =D0=BD=D0=BE=D1=8F=D0=B1. 24 9:0= 9 2016 - gamestop/reference1 used 4,38T - gamestop/reference1 available 1,33T - gamestop/reference1 referenced 4,01T - gamestop/reference1 compressratio 1.00x - gamestop/reference1 reservation none default gamestop/reference1 volsize 2,50T local gamestop/reference1 volblocksize 8K - gamestop/reference1 checksum on default gamestop/reference1 compression off default gamestop/reference1 readonly off default gamestop/reference1 copies 1 default gamestop/reference1 refreservation none received gamestop/reference1 primarycache all default gamestop/reference1 secondarycache all default gamestop/reference1 usedbysnapshots 378G - gamestop/reference1 usedbydataset 4,01T - gamestop/reference1 usedbychildren 0 - gamestop/reference1 usedbyrefreservation 0 - gamestop/reference1 logbias latency default gamestop/reference1 dedup off default gamestop/reference1 mlslabel - gamestop/reference1 sync standard default gamestop/reference1 refcompressratio 1.00x - gamestop/reference1 written 4,89G - gamestop/reference1 logicalused 2,72T - gamestop/reference1 logicalreferenced 2,49T - gamestop/reference1 volmode default default gamestop/reference1 snapshot_limit none default gamestop/reference1 snapshot_count none default gamestop/reference1 redundant_metadata all default [root@san1:~]# zpool status gamestop pool: gamestop state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM gamestop ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 da8 ONLINE 0 0 0 da9 ONLINE 0 0 0 da11 ONLINE 0 0 0 errors: No known data errors or, another server (overhead in this case isn't that big, but still considerable): [root@san01:~]# zfs get all data/reference1 NAME PROPERTY VALUE SOURCE data/reference1 type volume - data/reference1 creation Fri Jan 6 11:23 2017 - data/reference1 used 3.82T - data/reference1 available 13.0T - data/reference1 referenced 3.22T - data/reference1 compressratio 1.00x - data/reference1 reservation none default data/reference1 volsize 2T local data/reference1 volblocksize 8K - data/reference1 checksum on default data/reference1 compression off default data/reference1 readonly off default data/reference1 copies 1 default data/reference1 refreservation none received data/reference1 primarycache all default data/reference1 secondarycache all default data/reference1 usedbysnapshots 612G - data/reference1 usedbydataset 3.22T - data/reference1 usedbychildren 0 - data/reference1 usedbyrefreservation 0 - data/reference1 logbias latency default data/reference1 dedup off default data/reference1 mlslabel - data/reference1 sync standard default data/reference1 refcompressratio 1.00x - data/reference1 written 498K - data/reference1 logicalused 2.37T - data/reference1 logicalreferenced 2.00T - data/reference1 volmode default default data/reference1 snapshot_limit none default data/reference1 snapshot_count none default data/reference1 redundant_metadata all default [root@san01:~]# zpool status gamestop pool: data state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 da8 ONLINE 0 0 0 da9 ONLINE 0 0 0 da10 ONLINE 0 0 0 da11 ONLINE 0 0 0 da12 ONLINE 0 0 0 raidz1-2 ONLINE 0 0 0 da13 ONLINE 0 0 0 da14 ONLINE 0 0 0 da15 ONLINE 0 0 0 da16 ONLINE 0 0 0 da17 ONLINE 0 0 0 errors: No known data errors So my question is - how to avoid it ? Right now I'm experimenting with the volblocksize, making it around 64k. I'm also suspecting that such overhead may be the subsequence of the various resizing operations, like extening the volsize of the volume or adding new disks into the pool, because I have a couple of servers with raidz where the initial disk/volsize configuration didn't change, and the referenced/volsize numbers are pretty much close to each other. Eugene. Links: ------ [1] http://serverfault.com/questions/512018/strange-zfs-disk- space-usage-report-for-a-zvol It comes down to the zpool's sector size (2^ashift) and the volblocksize -- I'm guessing your old servers are at ashift=3D9 (512), and the new one is a= t 12 (4096), likely with 4k drives. This is the smallest/atomic size of reads & writes to a drive from ZFS. As described in [1]: * Allocations need to be a multiple of (p+1) sectors, where p is your parity level; for raidz1, p=3D=3D1, and allocations need to be in multiples= of (1+1)=3D2 sectors, or 8k (for ashift=3D12; this is the physical size / alignment on drive). * It also needs to have enough parity for failures, so it also depends [2] on number of drives in pool at larger block/record sizes. So considering those requirements, and your zvol with volblocksize=3D8k and compression=3Doff, allocations for one logical 8k block are always composed physically of two (4k) data sectors, one (p=3D1) parity sector (4k), and on= e padding sector (4k) to satisfy being a multiple of (p+1=3D) 2, or 16k (allocated on disk space), hence your observed 2x data size being actually allocated. Each of these blocks will be on a different drive. This is different from the sector-level parity in RAID5 As Matthew Ahrens [1] points out: "Note that setting a small recordsize with 4KB sector devices results in universally poor space efficiency -- RAIDZ-p is no better than p-way mirrors for recordsize=3D4K or 8K." Things you can do: * Use ashift=3D9 (and perhaps 512-byte sector drives). The same layout rul= es still apply, but now your 'atomic' size is 512b. You will want to test performance. * Use a larger volblocksize, especially if the filesystem on the zvol uses a larger block size. If you aren't performance sensitive, use a larger volblocksize even if the hosted filesystem doesn't. (But test this out to see how performance sensitive you really are! ;) You'll need to use something like dd to move data between different block size zvols. * Enable compression if the contents are compressible (some likely will be.) * Use a pool created from mirrors instead of raidz if you need high-performance small blocks while retaining redundancy. You don't get efficient (better than mirrors) redundancy, performant small (as in small multiple of zpool's sector size) block sizes, and zfs's flexibility all at once. - Eric [1] https://www.delphix.com/blog/delphix-engineering/zfs-rai dz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz [2] My spin on Ahren's spreadsheet: https://docs.google.com/spread sheets/d/13sJPc6ZW6_441vWAUiSvKMReJW4z34Ix5JSs44YXRyM/edit?usp=3Dsharing
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAASnNnpB7NFWUbBLxKidXzsDMAwzcJzRc_f4R-9JG_=BZ9fA%2BA>