From owner-freebsd-fs@freebsd.org Tue Feb 21 23:31:34 2017 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2085DCE800F for ; Tue, 21 Feb 2017 23:31:34 +0000 (UTC) (envelope-from eborisch@gmail.com) Received: from mail-it0-x22a.google.com (mail-it0-x22a.google.com [IPv6:2607:f8b0:4001:c0b::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DE34B14A8 for ; Tue, 21 Feb 2017 23:31:33 +0000 (UTC) (envelope-from eborisch@gmail.com) Received: by mail-it0-x22a.google.com with SMTP id y135so70066801itc.1 for ; Tue, 21 Feb 2017 15:31:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=tpZgJL1bQvM1LLQ4nPUNNhKwHnUxgyz6IjlxUwPUisY=; b=ZU/waOu9Y7gdOXXAxQHaLeQ78JQM7d9CWNYnEvPvl8OtGrdpdKXJ+mA2TSsqIvoYdM /TpWbRms0LF6piwZ550XwEyfKs5Lcf0LIOya30BmzJXfUo0+YkzqLrOfzO16X2TT4aPc hw8jz/AC4r3ghp3V/Bm978juC8BFtHl5p5eT6Nchk/w0xXKKbP+rs1AFBfKaOyJUokso GBBR/vU/JSZ3pAMGWgZ0fIpecfk72ohMAcmKim1KqssbYZcm15b97dIjVFeJ5EBCRW1r skXANCzGfRb0Gi1k2WU/b7K3mD5z8XKKRY+PHo3R/UXtffHcyhnRmG4WCgcjSIf36glU SfkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=tpZgJL1bQvM1LLQ4nPUNNhKwHnUxgyz6IjlxUwPUisY=; b=d9T2lByM6o3hE0Tu3djViHi7M38GErJ5yCR3Ew4Y7sKpOpdJOGtp2n4JC46MPVAqQI m68xbF6kxjGaHdrfskFYnVn0jaNHyV2GKuv83eiSYKF/o3sGLnZK8eoqj2IfXc8M4e4M 5RKqpT62CpBUNCE8+sq6ZgngkTIBzUIuCW6C4YDiAWcBmA53Bnj6L+fi3py9ZDNGllVx ZuHjQ3WIbVq1L4uN8tlkIwZh1W5Gkd2fev9WPEsMnN2Z2Q9itYHhpZQIKh9AnBUVv8ZH Y0LBz8WD7e2xMEkqnKxXY77UjI07kT/zmtWHfe+nQit+fqlx0q3Dt/mtPxEIduQjtCSA MO5g== X-Gm-Message-State: AMke39mI/b2KEMLCbYY1yXeKJ2StvXiICAHRcem9LYDAhyNPUcmkMlSuxnpgt2QMUK9B7MNm3trd3/oxSj9D1w== X-Received: by 10.36.108.15 with SMTP id w15mr28914974itb.73.1487719892705; Tue, 21 Feb 2017 15:31:32 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.183.148 with HTTP; Tue, 21 Feb 2017 15:31:32 -0800 (PST) In-Reply-To: <1b54a2fe35407a95edca1f992fa08a71@norman-vivat.ru> References: <1b54a2fe35407a95edca1f992fa08a71@norman-vivat.ru> From: "Eric A. Borisch" Date: Tue, 21 Feb 2017 17:31:32 -0600 Message-ID: Subject: Re: zfs raidz overhead To: "Eugene M. Zheganin" Cc: "freebsd-fs@freebsd.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 21 Feb 2017 23:31:34 -0000 On Tue, Feb 21, 2017 at 2:45 AM, Eugene M. Zheganin wrote: Hi. There's an interesting case described here: http://serverfault.com/questions/512018/strange-zfs-disk- space-usage-report-for-a-zvol [1] It's a user story who encountered that under some situations zfs on raidz could use up to 200% of the space for a zvol. I have also seen this. For instance: [root@san1:~]# zfs get volsize gamestop/reference1 NAME PROPERTY VALUE SOURCE gamestop/reference1 volsize 2,50T local [root@san1:~]# zfs get all gamestop/reference1 NAME PROPERTY VALUE SOURCE gamestop/reference1 type volume - gamestop/reference1 creation =D1=87=D1=82 =D0=BD=D0=BE=D1=8F=D0=B1. 24 9:0= 9 2016 - gamestop/reference1 used 4,38T - gamestop/reference1 available 1,33T - gamestop/reference1 referenced 4,01T - gamestop/reference1 compressratio 1.00x - gamestop/reference1 reservation none default gamestop/reference1 volsize 2,50T local gamestop/reference1 volblocksize 8K - gamestop/reference1 checksum on default gamestop/reference1 compression off default gamestop/reference1 readonly off default gamestop/reference1 copies 1 default gamestop/reference1 refreservation none received gamestop/reference1 primarycache all default gamestop/reference1 secondarycache all default gamestop/reference1 usedbysnapshots 378G - gamestop/reference1 usedbydataset 4,01T - gamestop/reference1 usedbychildren 0 - gamestop/reference1 usedbyrefreservation 0 - gamestop/reference1 logbias latency default gamestop/reference1 dedup off default gamestop/reference1 mlslabel - gamestop/reference1 sync standard default gamestop/reference1 refcompressratio 1.00x - gamestop/reference1 written 4,89G - gamestop/reference1 logicalused 2,72T - gamestop/reference1 logicalreferenced 2,49T - gamestop/reference1 volmode default default gamestop/reference1 snapshot_limit none default gamestop/reference1 snapshot_count none default gamestop/reference1 redundant_metadata all default [root@san1:~]# zpool status gamestop pool: gamestop state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM gamestop ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 da8 ONLINE 0 0 0 da9 ONLINE 0 0 0 da11 ONLINE 0 0 0 errors: No known data errors or, another server (overhead in this case isn't that big, but still considerable): [root@san01:~]# zfs get all data/reference1 NAME PROPERTY VALUE SOURCE data/reference1 type volume - data/reference1 creation Fri Jan 6 11:23 2017 - data/reference1 used 3.82T - data/reference1 available 13.0T - data/reference1 referenced 3.22T - data/reference1 compressratio 1.00x - data/reference1 reservation none default data/reference1 volsize 2T local data/reference1 volblocksize 8K - data/reference1 checksum on default data/reference1 compression off default data/reference1 readonly off default data/reference1 copies 1 default data/reference1 refreservation none received data/reference1 primarycache all default data/reference1 secondarycache all default data/reference1 usedbysnapshots 612G - data/reference1 usedbydataset 3.22T - data/reference1 usedbychildren 0 - data/reference1 usedbyrefreservation 0 - data/reference1 logbias latency default data/reference1 dedup off default data/reference1 mlslabel - data/reference1 sync standard default data/reference1 refcompressratio 1.00x - data/reference1 written 498K - data/reference1 logicalused 2.37T - data/reference1 logicalreferenced 2.00T - data/reference1 volmode default default data/reference1 snapshot_limit none default data/reference1 snapshot_count none default data/reference1 redundant_metadata all default [root@san01:~]# zpool status gamestop pool: data state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 da3 ONLINE 0 0 0 da4 ONLINE 0 0 0 da5 ONLINE 0 0 0 da6 ONLINE 0 0 0 da7 ONLINE 0 0 0 raidz1-1 ONLINE 0 0 0 da8 ONLINE 0 0 0 da9 ONLINE 0 0 0 da10 ONLINE 0 0 0 da11 ONLINE 0 0 0 da12 ONLINE 0 0 0 raidz1-2 ONLINE 0 0 0 da13 ONLINE 0 0 0 da14 ONLINE 0 0 0 da15 ONLINE 0 0 0 da16 ONLINE 0 0 0 da17 ONLINE 0 0 0 errors: No known data errors So my question is - how to avoid it ? Right now I'm experimenting with the volblocksize, making it around 64k. I'm also suspecting that such overhead may be the subsequence of the various resizing operations, like extening the volsize of the volume or adding new disks into the pool, because I have a couple of servers with raidz where the initial disk/volsize configuration didn't change, and the referenced/volsize numbers are pretty much close to each other. Eugene. Links: ------ [1] http://serverfault.com/questions/512018/strange-zfs-disk- space-usage-report-for-a-zvol It comes down to the zpool's sector size (2^ashift) and the volblocksize -- I'm guessing your old servers are at ashift=3D9 (512), and the new one is a= t 12 (4096), likely with 4k drives. This is the smallest/atomic size of reads & writes to a drive from ZFS. As described in [1]: * Allocations need to be a multiple of (p+1) sectors, where p is your parity level; for raidz1, p=3D=3D1, and allocations need to be in multiples= of (1+1)=3D2 sectors, or 8k (for ashift=3D12; this is the physical size / alignment on drive). * It also needs to have enough parity for failures, so it also depends [2] on number of drives in pool at larger block/record sizes. So considering those requirements, and your zvol with volblocksize=3D8k and compression=3Doff, allocations for one logical 8k block are always composed physically of two (4k) data sectors, one (p=3D1) parity sector (4k), and on= e padding sector (4k) to satisfy being a multiple of (p+1=3D) 2, or 16k (allocated on disk space), hence your observed 2x data size being actually allocated. Each of these blocks will be on a different drive. This is different from the sector-level parity in RAID5 As Matthew Ahrens [1] points out: "Note that setting a small recordsize with 4KB sector devices results in universally poor space efficiency -- RAIDZ-p is no better than p-way mirrors for recordsize=3D4K or 8K." Things you can do: * Use ashift=3D9 (and perhaps 512-byte sector drives). The same layout rul= es still apply, but now your 'atomic' size is 512b. You will want to test performance. * Use a larger volblocksize, especially if the filesystem on the zvol uses a larger block size. If you aren't performance sensitive, use a larger volblocksize even if the hosted filesystem doesn't. (But test this out to see how performance sensitive you really are! ;) You'll need to use something like dd to move data between different block size zvols. * Enable compression if the contents are compressible (some likely will be.) * Use a pool created from mirrors instead of raidz if you need high-performance small blocks while retaining redundancy. You don't get efficient (better than mirrors) redundancy, performant small (as in small multiple of zpool's sector size) block sizes, and zfs's flexibility all at once. - Eric [1] https://www.delphix.com/blog/delphix-engineering/zfs-rai dz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz [2] My spin on Ahren's spreadsheet: https://docs.google.com/spread sheets/d/13sJPc6ZW6_441vWAUiSvKMReJW4z34Ix5JSs44YXRyM/edit?usp=3Dsharing