FreeBSD Mail Archives

Date:      Fri, 23 Mar 2012 09:40:24 -0700 (PDT)
From:      Dennis Glatting <dg@pki2.com>
To:        Taylor <j.freebsd-zfs@enone.net>
Cc:        freebsd-fs@freebsd.org
Subject:   Re: ZFS extra space overhead for ashift=12 vs ashift=9 raidz2 pool?
Message-ID:  <alpine.BSF.2.00.1203230937580.89054@btw.pki2.com>
In-Reply-To: <45654FDD-A20A-47C8-B3B5-F9B0B71CC38B@enone.net>
References:  <45654FDD-A20A-47C8-B3B5-F9B0B71CC38B@enone.net>


Somewhat related:

I am also using 4TB Hitachi drives but only four. Although fairly happy 
with these drives I have had one disk fail in the two months I have been 
using them. This may have been an infant failure but I am wondering if you 
have had any similar experiances with the drives.



On Fri, 23 Mar 2012, Taylor wrote:

> Hello,
>
> I'm bringing up a new ZFS filesystem and have noticed something strange with respect to the overhead from ZFS. When I create a raidz2 pool with 512-byte sectors (ashift=9), I have an overhead of 2.59%, but when I create the zpool using 4k sectors (ashift=12), I have an overhead of 8.06%. This amounts to a difference of 2.79TiB in my particular application, which I'd like to avoid. :)
>
> (Assuming I haven't done anything wrong. :) ) Is the extra overhead for 4k sector (ashift=12) raidz2 pools expected? Is there any way to reduce this?
>
> (In my very limited performance testing, 4K sectors do seem to perform slightly better and more consistently, so I'd like to use them if I can avoid the extra overhead.)
>
> Details below.
>
> Thanks in advance for your time,
>
> -Taylor
>
>
>
> I'm running:
> FreeBSD host 9.0-RELEASE FreeBSD 9.0-RELEASE #0  amd64
>
> I'm using Hitachi 4TB Deskstar 0S03364 drives, which are 4K sector devices.
>
> In order to "future proof" the raidz2 pool against possible variations in replacement drive size, I've created a single partition on each drive, starting at sector 2048 and using 100MB less than total available space on the disk.
> $ sudo gpart list da2
> Geom name: da2
> modified: false
> state: OK
> fwheads: 255
> fwsectors: 63
> last: 7814037134
> first: 34
> entries: 128
> scheme: GPT
> Providers:
> 1. Name: da2p1
>  Mediasize: 4000682172416 (3.7T)
>  Sectorsize: 512
>  Stripesize: 0
>  Stripeoffset: 1048576
>  Mode: r1w1e1
>  rawuuid: 71ebbd49-7241-11e1-b2dd-00259055e634
>  rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b
>  label: (null)
>  length: 4000682172416
>  offset: 1048576
>  type: freebsd-zfs
>  index: 1
>  end: 7813834415
>  start: 2048
> Consumers:
> 1. Name: da2
>  Mediasize: 4000787030016 (3.7T)
>  Sectorsize: 512
>  Mode: r1w1e2
>
> Each partition gives me 4000682172416 bytes (or 3.64 TiB). I'm using 16 drives.  I create the zpool with 4K sectors as follows:
> $ sudo gnop create -S 4096 /dev/da2p1
> $ sudo zpool create zav raidz2 da2p1.nop da3p1 da4p1 da5p1 da6p1 da7p1 da8p1 da9p1 da10p1 da11p1 da12p1 da13p1 da14p1 da15p1 da16p1 da17p1
>
> I confirm ashift=12:
> $ sudo zdb zav | grep ashift
>               ashift: 12
>               ashift: 12
>
> "zpool list" approximately matches the expected raw capacity of 16*4000682172416 = 64010914758656 bytes (58.28 TiB).
> $ zpool list zav
> NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
> zav     58T  1.34M  58.0T     0%  1.00x  ONLINE  -
>
> For raidz2, I'd expect to see 4000682172416*14 = 56009550413824 bytes (50.94 TiB). However, I only get:
> $ zfs list zav
> NAME   USED  AVAIL  REFER  MOUNTPOINT
> zav   1.10M  46.8T  354K  /zav
>
> Or using df for greater accuracy:
> $ df zav
> Filesystem 1K-blocks   Used       Avail Capacity  Mounted on
> zav        50288393472  354 50288393117     0%    /zav
>
> A total of 51495314915328 bytes (46.83TiB). (This is for a freshly created zpool before any snapshots, etc. have been performed.)
>
> I measure overhead as "expected - actual / expected", which in the case of 4k sector (ashift=12) raidz2 comes to 8.05%.
>
> To create a 512-byte sector (ashift=9) raidz2 pool, I basically just replace "da2p1.nop" with "da2p1" when creating the zpool. I confirm ashift=9. zpool raw size is the same (as much as I can tell with such limited precision from zpool list). However, the available size according to zfs list/df is 54560512935936 bytes (49.62 TiB), which amounts to an overhead of 2.58%. There are some minor differences in ALLOC and USED size listings, so I repeat them here for the 512-byte sector raidz2 pool:
> $ zpool list zav
> NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
> zav     58T   228K  58.0T     0%  1.00x  ONLINE  -
> $ zfs list zav
> NAME   USED  AVAIL  REFER  MOUNTPOINT
> zav    198K  49.6T  73.0K  /zav
> $ df zav
> Filesystem 1K-blocks   Used       Avail Capacity  Mounted on
> zav        53281750914   73 53281750841     0%    /zav
>
> I expect some overhead from ZFS and according to this blog post:
> http://www.cuddletech.com/blog/pivot/entry.php?id=1013
> (via http://mail.opensolaris.org/pipermail/zfs-discuss/2010-May/041773.html)
> there may be a 1/64 or 1.56% overhead baked into ZFS. Interestingly enough, when I create a pool with no raid/mirroring, I get an overhead of 1.93% regardless of ashift=9 or ashift=12 which is quite close to the 1/64 number. I have also tested raidz, which has similar behavior to raidz2, however the overhead is slightly less in each case: 1) ashift=9 raidz overhead is 2.33% and 2) ashift=12 raidz overhead is 7.04%.
>
> In order to preserve space, I've put the zdb listings for both ashift=9 and ashift=12 radiz2 pools here:
> http://pastebin.com/v2xjZkNw
>
> There are also some differences in ZDB output, for example "SPA allocated" is higher for in the 4K sector raidz2 pool, which seems interesting, although I don't comprehend the significance of this._______________________________________________
> freebsd-fs@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"
>
>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1203230937580.89054>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation