Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Jan 2022 07:23:45 -0700
From:      Alan Somers <asomers@freebsd.org>
To:        Rich <rincebrain@gmail.com>
Cc:        Florent Rivoire <florent@rivoire.fr>, freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: [zfs] recordsize: unexpected increase of disk usage when increasing it
Message-ID:  <CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG%2BaTiudvK_jp2sQKJQ@mail.gmail.com>
In-Reply-To: <CAOeNLuopaY3j7P030KO4LMwU3BOU5tXiu6gRsSKsDrFEuGKuaA@mail.gmail.com>
References:  <CADzRhsEsZMGE-SoeWLMG9NTtkwhhy6OGQQ046m9AxGFbp5h_kQ@mail.gmail.com> <CAOeNLuopaY3j7P030KO4LMwU3BOU5tXiu6gRsSKsDrFEuGKuaA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 18, 2022 at 7:13 AM Rich <rincebrain@gmail.com> wrote:
>
> Compression would have made your life better here, and possibly also made it clearer what's going on.
>
> All records in a file are going to be the same size pre-compression - so if you set the recordsize to 1M and save a 131.1M file, it's going to take up 132M on disk before compression/raidz overhead/whatnot.

Not true.  ZFS will trim the file's tails even without compression enabled.

>
> Usually compression saves you from the tail padding actually requiring allocation on disk, which is one reason I encourage everyone to at least use lz4 (or, if you absolutely cannot for some reason, I guess zle should also work for this one case...)
>
> But I would say it's probably the sum of last record padding across the whole dataset, if you don't have compression on.
>
> - Rich
>
> On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire <florent@rivoire.fr> wrote:
>>
>> TLDR: I rsync-ed the same data twice: once with 128K recordsize and
>> once with 1M, and the allocated size on disk is ~3% bigger with 1M.
>> Why not smaller ?
>>
>>
>> Hello,
>>
>> I would like some help to understand how the disk usage evolves when I
>> change the recordsize.
>>
>> I've read several articles/presentations/forums about recordsize in
>> ZFS, and if I try to summarize, I mainly understood that:
>> - recordsize is the "maximum" size of "objects" (so "logical blocks")
>> that zfs will create for both  -data & metadata, then each object is
>> compressed and allocated to one vdev, splitted into smaller (ashift
>> size) "physical" blocks and written on disks
>> - increasing recordsize is usually good when storing large files that
>> are not modified, because it limits the nb of metadata objects
>> (block-pointers), which has a positive effect on performance
>> - decreasing recordsize is useful for "databases-like" workloads (ie:
>> small random writes inside existing objects), because it avoids write
>> amplification (read-modify-write a large object for a small update)
>>
>> Today, I'm trying to observe the effect of increasing recordsize for
>> *my* data (because I'm also considering defining special_small_blocks
>> & using SSDs as "special", but not tested nor discussed here, just
>> recordsize).
>> So, I'm doing some benchmarks on my "documents" dataset (details in
>> "notes" below), but the results are really strange to me.
>>
>> When I rsync the same data to a freshly-recreated zpool:
>> A) with recordsize=128K : 226G allocated on disk
>> B) with recordsize=1M : 232G allocated on disk => bigger than 128K ?!?
>>
>> I would clearly expect the other way around, because bigger recordsize
>> generates less metadata so smaller disk usage, and there shouldn't be
>> any overhead because 1M is just a maximum and not a forced size to
>> allocate for every object.

A common misconception.  The 1M recordsize applies to every newly
created object, and every object must use the same size for all of its
records (except possibly the last one).  But objects created before
you changed the recsize will retain their old recsize, file tails have
a flexible recsize.

>> I don't mind the increased usage (I can live with a few GB more), but
>> I would like to understand why it happens.

You might be seeing the effects of sparsity.  ZFS is smart enough not
to store file holes (and if any kind of compression is enabled, it
will find long runs of zeroes and turn them into holes).  If your data
contains any holes that are >= 128 kB but < 1MB, then they can be
stored as holes with a 128 kB recsize but must be stored as long runs
of zeros with a 1MB recsize.

However, I would suggest that you don't bother.  With a 128kB recsize,
ZFS has something like a 1000:1 ratio of data:metadata.  In other
words, increasing your recsize can save you at most 0.1% of disk
space.  Basically, it doesn't matter.  What it _does_ matter for is
the tradeoff between write amplification and RAM usage.  1000:1 is
comparable to the disk:ram of many computers.  And performance is more
sensitive to metadata access times than data access times.  So
increasing your recsize can help you keep a greater fraction of your
metadata in ARC.  OTOH, as you remarked increasing your recsize will
also increase write amplification.

So to summarize:
* Adjust compression settings to save disk space.
* Adjust recsize to save RAM.

-Alan

>>
>> I tried to give all the details of my tests below.
>> Did I do something wrong ? Can you explain the increase ?
>>
>> Thanks !
>>
>>
>>
>> ===============================================
>> A) 128K
>> ==========
>>
>> # zpool destroy bench
>> # zpool create -o ashift=12 bench
>> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
>>
>> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
>> [...]
>> sent 241,042,476,154 bytes  received 353,838 bytes  81,806,492.45 bytes/sec
>> total size is 240,982,439,038  speedup is 1.00
>>
>> # zfs get recordsize bench
>> NAME   PROPERTY    VALUE    SOURCE
>> bench  recordsize  128K     default
>>
>> # zpool list -v bench
>> NAME                                           SIZE  ALLOC   FREE
>> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
>> bench                                         2.72T   226G  2.50T
>>   -         -     0%     8%  1.00x    ONLINE  -
>>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   226G  2.50T
>>   -         -     0%  8.10%      -    ONLINE
>>
>> # zfs list bench
>> NAME    USED  AVAIL     REFER  MOUNTPOINT
>> bench   226G  2.41T      226G  /bench
>>
>> # zfs get all bench |egrep "(used|referenced|written)"
>> bench  used                  226G                   -
>> bench  referenced            226G                   -
>> bench  usedbysnapshots       0B                     -
>> bench  usedbydataset         226G                   -
>> bench  usedbychildren        1.80M                  -
>> bench  usedbyrefreservation  0B                     -
>> bench  written               226G                   -
>> bench  logicalused           226G                   -
>> bench  logicalreferenced     226G                   -
>>
>> # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb
>>
>>
>>
>> ===============================================
>> B) 1M
>> ==========
>>
>> # zpool destroy bench
>> # zpool create -o ashift=12 bench
>> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
>> # zfs set recordsize=1M bench
>>
>> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
>> [...]
>> sent 241,042,476,154 bytes  received 353,830 bytes  80,173,899.88 bytes/sec
>> total size is 240,982,439,038  speedup is 1.00
>>
>> # zfs get recordsize bench
>> NAME   PROPERTY    VALUE    SOURCE
>> bench  recordsize  1M       local
>>
>> # zpool list -v bench
>> NAME                                           SIZE  ALLOC   FREE
>> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
>> bench                                         2.72T   232G  2.49T
>>   -         -     0%     8%  1.00x    ONLINE  -
>>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   232G  2.49T
>>   -         -     0%  8.32%      -    ONLINE
>>
>> # zfs list bench
>> NAME    USED  AVAIL     REFER  MOUNTPOINT
>> bench   232G  2.41T      232G  /bench
>>
>> # zfs get all bench |egrep "(used|referenced|written)"
>> bench  used                  232G                   -
>> bench  referenced            232G                   -
>> bench  usedbysnapshots       0B                     -
>> bench  usedbydataset         232G                   -
>> bench  usedbychildren        1.96M                  -
>> bench  usedbyrefreservation  0B                     -
>> bench  written               232G                   -
>> bench  logicalused           232G                   -
>> bench  logicalreferenced     232G                   -
>>
>> # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb
>>
>>
>>
>> ===============================================
>> Notes:
>> ==========
>>
>> - the source dataset contains ~50% of pictures (raw files and jpg),
>> and also some music, various archived documents, zip, videos
>> - no change on the source dataset while testing (cf size logged by resync)
>> - I repeated the tests twice (128K, then 1M, then 128K, then 1M), and
>> same results
>> - probably not important here, but:
>> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR
>> (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize dataset
>> on another zpool that I never tweaked except ashit=12 (because using
>> the same model of Red 3TB)
>>
>> # zfs --version
>> zfs-2.0.6-1
>> zfs-kmod-v2021120100-zfs_a8c7652
>>
>> # uname -a
>> FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11
>> 75566f060d4(HEAD) TRUENAS  amd64



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG%2BaTiudvK_jp2sQKJQ>