Date: Fri, 23 Mar 2012 09:40:24 -0700 (PDT) From: Dennis Glatting <dg@pki2.com> To: Taylor <j.freebsd-zfs@enone.net> Cc: freebsd-fs@freebsd.org Subject: Re: ZFS extra space overhead for ashift=12 vs ashift=9 raidz2 pool? Message-ID: <alpine.BSF.2.00.1203230937580.89054@btw.pki2.com> In-Reply-To: <45654FDD-A20A-47C8-B3B5-F9B0B71CC38B@enone.net> References: <45654FDD-A20A-47C8-B3B5-F9B0B71CC38B@enone.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Somewhat related: I am also using 4TB Hitachi drives but only four. Although fairly happy with these drives I have had one disk fail in the two months I have been using them. This may have been an infant failure but I am wondering if you have had any similar experiances with the drives. On Fri, 23 Mar 2012, Taylor wrote: > Hello, > > I'm bringing up a new ZFS filesystem and have noticed something strange with respect to the overhead from ZFS. When I create a raidz2 pool with 512-byte sectors (ashift=9), I have an overhead of 2.59%, but when I create the zpool using 4k sectors (ashift=12), I have an overhead of 8.06%. This amounts to a difference of 2.79TiB in my particular application, which I'd like to avoid. :) > > (Assuming I haven't done anything wrong. :) ) Is the extra overhead for 4k sector (ashift=12) raidz2 pools expected? Is there any way to reduce this? > > (In my very limited performance testing, 4K sectors do seem to perform slightly better and more consistently, so I'd like to use them if I can avoid the extra overhead.) > > Details below. > > Thanks in advance for your time, > > -Taylor > > > > I'm running: > FreeBSD host 9.0-RELEASE FreeBSD 9.0-RELEASE #0 amd64 > > I'm using Hitachi 4TB Deskstar 0S03364 drives, which are 4K sector devices. > > In order to "future proof" the raidz2 pool against possible variations in replacement drive size, I've created a single partition on each drive, starting at sector 2048 and using 100MB less than total available space on the disk. > $ sudo gpart list da2 > Geom name: da2 > modified: false > state: OK > fwheads: 255 > fwsectors: 63 > last: 7814037134 > first: 34 > entries: 128 > scheme: GPT > Providers: > 1. Name: da2p1 > Mediasize: 4000682172416 (3.7T) > Sectorsize: 512 > Stripesize: 0 > Stripeoffset: 1048576 > Mode: r1w1e1 > rawuuid: 71ebbd49-7241-11e1-b2dd-00259055e634 > rawtype: 516e7cba-6ecf-11d6-8ff8-00022d09712b > label: (null) > length: 4000682172416 > offset: 1048576 > type: freebsd-zfs > index: 1 > end: 7813834415 > start: 2048 > Consumers: > 1. Name: da2 > Mediasize: 4000787030016 (3.7T) > Sectorsize: 512 > Mode: r1w1e2 > > Each partition gives me 4000682172416 bytes (or 3.64 TiB). I'm using 16 drives. I create the zpool with 4K sectors as follows: > $ sudo gnop create -S 4096 /dev/da2p1 > $ sudo zpool create zav raidz2 da2p1.nop da3p1 da4p1 da5p1 da6p1 da7p1 da8p1 da9p1 da10p1 da11p1 da12p1 da13p1 da14p1 da15p1 da16p1 da17p1 > > I confirm ashift=12: > $ sudo zdb zav | grep ashift > ashift: 12 > ashift: 12 > > "zpool list" approximately matches the expected raw capacity of 16*4000682172416 = 64010914758656 bytes (58.28 TiB). > $ zpool list zav > NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT > zav 58T 1.34M 58.0T 0% 1.00x ONLINE - > > For raidz2, I'd expect to see 4000682172416*14 = 56009550413824 bytes (50.94 TiB). However, I only get: > $ zfs list zav > NAME USED AVAIL REFER MOUNTPOINT > zav 1.10M 46.8T 354K /zav > > Or using df for greater accuracy: > $ df zav > Filesystem 1K-blocks Used Avail Capacity Mounted on > zav 50288393472 354 50288393117 0% /zav > > A total of 51495314915328 bytes (46.83TiB). (This is for a freshly created zpool before any snapshots, etc. have been performed.) > > I measure overhead as "expected - actual / expected", which in the case of 4k sector (ashift=12) raidz2 comes to 8.05%. > > To create a 512-byte sector (ashift=9) raidz2 pool, I basically just replace "da2p1.nop" with "da2p1" when creating the zpool. I confirm ashift=9. zpool raw size is the same (as much as I can tell with such limited precision from zpool list). However, the available size according to zfs list/df is 54560512935936 bytes (49.62 TiB), which amounts to an overhead of 2.58%. There are some minor differences in ALLOC and USED size listings, so I repeat them here for the 512-byte sector raidz2 pool: > $ zpool list zav > NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT > zav 58T 228K 58.0T 0% 1.00x ONLINE - > $ zfs list zav > NAME USED AVAIL REFER MOUNTPOINT > zav 198K 49.6T 73.0K /zav > $ df zav > Filesystem 1K-blocks Used Avail Capacity Mounted on > zav 53281750914 73 53281750841 0% /zav > > I expect some overhead from ZFS and according to this blog post: > http://www.cuddletech.com/blog/pivot/entry.php?id=1013 > (via http://mail.opensolaris.org/pipermail/zfs-discuss/2010-May/041773.html) > there may be a 1/64 or 1.56% overhead baked into ZFS. Interestingly enough, when I create a pool with no raid/mirroring, I get an overhead of 1.93% regardless of ashift=9 or ashift=12 which is quite close to the 1/64 number. I have also tested raidz, which has similar behavior to raidz2, however the overhead is slightly less in each case: 1) ashift=9 raidz overhead is 2.33% and 2) ashift=12 raidz overhead is 7.04%. > > In order to preserve space, I've put the zdb listings for both ashift=9 and ashift=12 radiz2 pools here: > http://pastebin.com/v2xjZkNw > > There are also some differences in ZDB output, for example "SPA allocated" is higher for in the 4K sector raidz2 pool, which seems interesting, although I don't comprehend the significance of this._______________________________________________ > freebsd-fs@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?alpine.BSF.2.00.1203230937580.89054>