FreeBSD Mail Archives

Date:      Tue, 18 Jan 2022 08:00:08 -0700
From:      Alan Somers <asomers@freebsd.org>
To:        Rich <rincebrain@gmail.com>
Cc:        Florent Rivoire <florent@rivoire.fr>, freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: [zfs] recordsize: unexpected increase of disk usage when increasing it
Message-ID:  <CAOtMX2g-0rkYz7Q%2BKO=W49OdF5_GnV%2B-VW6Rb5Eb4LokvaPUpA@mail.gmail.com>
In-Reply-To: <CAOeNLuppbdRbC-bsDEqKKUBMO8KKvaLpVs-OcSA2AF2tO5b03w@mail.gmail.com>
References:  <CADzRhsEsZMGE-SoeWLMG9NTtkwhhy6OGQQ046m9AxGFbp5h_kQ@mail.gmail.com> <CAOeNLuopaY3j7P030KO4LMwU3BOU5tXiu6gRsSKsDrFEuGKuaA@mail.gmail.com> <CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG%2BaTiudvK_jp2sQKJQ@mail.gmail.com> <CAOeNLuoQLgKn673FVotxdoDC3HBr1_j%2BzY0t9-uVj7N%2BFkoe1Q@mail.gmail.com> <CAOtMX2g4KduvFA6W062m93jnrJcjQ9KSzkXjb42F1nvhPWaZsw@mail.gmail.com> <CAOeNLuppbdRbC-bsDEqKKUBMO8KKvaLpVs-OcSA2AF2tO5b03w@mail.gmail.com>

That's not what I get.  Is your pool formatted using a very old
version or something?

somers@fbsd-head /u/h/somers [1]>
dd if=3D/dev/random bs=3D1179648 of=3D/testpool/food/t/richfile count=3D1
1+0 records in
1+0 records out
1179648 bytes transferred in 0.003782 secs (311906705 bytes/sec)
somers@fbsd-head /u/h/somers> du -sh  /testpool/food/t/richfile
1.1M    /testpool/food/t/richfile

On Tue, Jan 18, 2022 at 7:51 AM Rich <rincebrain@gmail.com> wrote:
>
> 2.1M    /workspace/test1M/1
>
> - Rich
>
> On Tue, Jan 18, 2022 at 9:47 AM Alan Somers <asomers@freebsd.org> wrote:
>>
>> Yeah, it does.  Just check "du -sh <FILENAME>".  zdb there is showing
>> you the logical size of the record, but it isn't showing how many disk
>> blocks are actually allocated.
>>
>> On Tue, Jan 18, 2022 at 7:30 AM Rich <rincebrain@gmail.com> wrote:
>> >
>> > Really? I didn't know it would still trim the tails on files with comp=
ression off.
>> >
>> > ...
>> >
>> >         size    1179648
>> >         parent  34
>> >         links   1
>> >         pflags  40800000004
>> > Indirect blocks:
>> >                0 L1  DVA[0]=3D<3:c02b96c000:1000> DVA[1]=3D<3:c8107330=
00:1000> [L1 ZFS plain file] skein lz4 unencrypted LE contiguous unique dou=
ble size=3D20000L/1000P birth=3D35675472L/35675472P fill=3D2 cksum=3D5cfba2=
4b351a09aa:8bd9dfef87c5b625:906ed5c3252943db:bed77ce51ad540d4
>> >                0  L0 DVA[0]=3D<2:a0827db4000:100000> [L0 ZFS plain fil=
e] skein uncompressed unencrypted LE contiguous unique single size=3D100000=
L/100000P birth=3D35675472L/35675472P fill=3D1 cksum=3D95b06edf60e5f54c:af6=
f6950775d0863:8fc28b0783fcd9d3:2e44676e48a59360
>> >           100000  L0 DVA[0]=3D<2:a0827eb4000:100000> [L0 ZFS plain fil=
e] skein uncompressed unencrypted LE contiguous unique single size=3D100000=
L/100000P birth=3D35675472L/35675472P fill=3D1 cksum=3D62a1f05769528648:819=
7c8a05ca9f1fb:a750c690124dd2e0:390bddc4314cd4c3
>> >
>> > It seems not?
>> >
>> > - Rich
>> >
>> >
>> > On Tue, Jan 18, 2022 at 9:23 AM Alan Somers <asomers@freebsd.org> wrot=
e:
>> >>
>> >> On Tue, Jan 18, 2022 at 7:13 AM Rich <rincebrain@gmail.com> wrote:
>> >> >
>> >> > Compression would have made your life better here, and possibly als=
o made it clearer what's going on.
>> >> >
>> >> > All records in a file are going to be the same size pre-compression=
 - so if you set the recordsize to 1M and save a 131.1M file, it's going to=
 take up 132M on disk before compression/raidz overhead/whatnot.
>> >>
>> >> Not true.  ZFS will trim the file's tails even without compression en=
abled.
>> >>
>> >> >
>> >> > Usually compression saves you from the tail padding actually requir=
ing allocation on disk, which is one reason I encourage everyone to at leas=
t use lz4 (or, if you absolutely cannot for some reason, I guess zle should=
 also work for this one case...)
>> >> >
>> >> > But I would say it's probably the sum of last record padding across=
 the whole dataset, if you don't have compression on.
>> >> >
>> >> > - Rich
>> >> >
>> >> > On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire <florent@rivoire.fr=
> wrote:
>> >> >>
>> >> >> TLDR: I rsync-ed the same data twice: once with 128K recordsize an=
d
>> >> >> once with 1M, and the allocated size on disk is ~3% bigger with 1M=
.
>> >> >> Why not smaller ?
>> >> >>
>> >> >>
>> >> >> Hello,
>> >> >>
>> >> >> I would like some help to understand how the disk usage evolves wh=
en I
>> >> >> change the recordsize.
>> >> >>
>> >> >> I've read several articles/presentations/forums about recordsize i=
n
>> >> >> ZFS, and if I try to summarize, I mainly understood that:
>> >> >> - recordsize is the "maximum" size of "objects" (so "logical block=
s")
>> >> >> that zfs will create for both  -data & metadata, then each object =
is
>> >> >> compressed and allocated to one vdev, splitted into smaller (ashif=
t
>> >> >> size) "physical" blocks and written on disks
>> >> >> - increasing recordsize is usually good when storing large files t=
hat
>> >> >> are not modified, because it limits the nb of metadata objects
>> >> >> (block-pointers), which has a positive effect on performance
>> >> >> - decreasing recordsize is useful for "databases-like" workloads (=
ie:
>> >> >> small random writes inside existing objects), because it avoids wr=
ite
>> >> >> amplification (read-modify-write a large object for a small update=
)
>> >> >>
>> >> >> Today, I'm trying to observe the effect of increasing recordsize f=
or
>> >> >> *my* data (because I'm also considering defining special_small_blo=
cks
>> >> >> & using SSDs as "special", but not tested nor discussed here, just
>> >> >> recordsize).
>> >> >> So, I'm doing some benchmarks on my "documents" dataset (details i=
n
>> >> >> "notes" below), but the results are really strange to me.
>> >> >>
>> >> >> When I rsync the same data to a freshly-recreated zpool:
>> >> >> A) with recordsize=3D128K : 226G allocated on disk
>> >> >> B) with recordsize=3D1M : 232G allocated on disk =3D> bigger than =
128K ?!?
>> >> >>
>> >> >> I would clearly expect the other way around, because bigger record=
size
>> >> >> generates less metadata so smaller disk usage, and there shouldn't=
 be
>> >> >> any overhead because 1M is just a maximum and not a forced size to
>> >> >> allocate for every object.
>> >>
>> >> A common misconception.  The 1M recordsize applies to every newly
>> >> created object, and every object must use the same size for all of it=
s
>> >> records (except possibly the last one).  But objects created before
>> >> you changed the recsize will retain their old recsize, file tails hav=
e
>> >> a flexible recsize.
>> >>
>> >> >> I don't mind the increased usage (I can live with a few GB more), =
but
>> >> >> I would like to understand why it happens.
>> >>
>> >> You might be seeing the effects of sparsity.  ZFS is smart enough not
>> >> to store file holes (and if any kind of compression is enabled, it
>> >> will find long runs of zeroes and turn them into holes).  If your dat=
a
>> >> contains any holes that are >=3D 128 kB but < 1MB, then they can be
>> >> stored as holes with a 128 kB recsize but must be stored as long runs
>> >> of zeros with a 1MB recsize.
>> >>
>> >> However, I would suggest that you don't bother.  With a 128kB recsize=
,
>> >> ZFS has something like a 1000:1 ratio of data:metadata.  In other
>> >> words, increasing your recsize can save you at most 0.1% of disk
>> >> space.  Basically, it doesn't matter.  What it _does_ matter for is
>> >> the tradeoff between write amplification and RAM usage.  1000:1 is
>> >> comparable to the disk:ram of many computers.  And performance is mor=
e
>> >> sensitive to metadata access times than data access times.  So
>> >> increasing your recsize can help you keep a greater fraction of your
>> >> metadata in ARC.  OTOH, as you remarked increasing your recsize will
>> >> also increase write amplification.
>> >>
>> >> So to summarize:
>> >> * Adjust compression settings to save disk space.
>> >> * Adjust recsize to save RAM.
>> >>
>> >> -Alan
>> >>
>> >> >>
>> >> >> I tried to give all the details of my tests below.
>> >> >> Did I do something wrong ? Can you explain the increase ?
>> >> >>
>> >> >> Thanks !
>> >> >>
>> >> >>
>> >> >>
>> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >> A) 128K
>> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >>
>> >> >> # zpool destroy bench
>> >> >> # zpool create -o ashift=3D12 bench
>> >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
>> >> >>
>> >> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
>> >> >> [...]
>> >> >> sent 241,042,476,154 bytes  received 353,838 bytes  81,806,492.45 =
bytes/sec
>> >> >> total size is 240,982,439,038  speedup is 1.00
>> >> >>
>> >> >> # zfs get recordsize bench
>> >> >> NAME   PROPERTY    VALUE    SOURCE
>> >> >> bench  recordsize  128K     default
>> >> >>
>> >> >> # zpool list -v bench
>> >> >> NAME                                           SIZE  ALLOC   FREE
>> >> >> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
>> >> >> bench                                         2.72T   226G  2.50T
>> >> >>   -         -     0%     8%  1.00x    ONLINE  -
>> >> >>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   226G  2.50T
>> >> >>   -         -     0%  8.10%      -    ONLINE
>> >> >>
>> >> >> # zfs list bench
>> >> >> NAME    USED  AVAIL     REFER  MOUNTPOINT
>> >> >> bench   226G  2.41T      226G  /bench
>> >> >>
>> >> >> # zfs get all bench |egrep "(used|referenced|written)"
>> >> >> bench  used                  226G                   -
>> >> >> bench  referenced            226G                   -
>> >> >> bench  usedbysnapshots       0B                     -
>> >> >> bench  usedbydataset         226G                   -
>> >> >> bench  usedbychildren        1.80M                  -
>> >> >> bench  usedbyrefreservation  0B                     -
>> >> >> bench  written               226G                   -
>> >> >> bench  logicalused           226G                   -
>> >> >> bench  logicalreferenced     226G                   -
>> >> >>
>> >> >> # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb
>> >> >>
>> >> >>
>> >> >>
>> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >> B) 1M
>> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >>
>> >> >> # zpool destroy bench
>> >> >> # zpool create -o ashift=3D12 bench
>> >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
>> >> >> # zfs set recordsize=3D1M bench
>> >> >>
>> >> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
>> >> >> [...]
>> >> >> sent 241,042,476,154 bytes  received 353,830 bytes  80,173,899.88 =
bytes/sec
>> >> >> total size is 240,982,439,038  speedup is 1.00
>> >> >>
>> >> >> # zfs get recordsize bench
>> >> >> NAME   PROPERTY    VALUE    SOURCE
>> >> >> bench  recordsize  1M       local
>> >> >>
>> >> >> # zpool list -v bench
>> >> >> NAME                                           SIZE  ALLOC   FREE
>> >> >> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
>> >> >> bench                                         2.72T   232G  2.49T
>> >> >>   -         -     0%     8%  1.00x    ONLINE  -
>> >> >>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   232G  2.49T
>> >> >>   -         -     0%  8.32%      -    ONLINE
>> >> >>
>> >> >> # zfs list bench
>> >> >> NAME    USED  AVAIL     REFER  MOUNTPOINT
>> >> >> bench   232G  2.41T      232G  /bench
>> >> >>
>> >> >> # zfs get all bench |egrep "(used|referenced|written)"
>> >> >> bench  used                  232G                   -
>> >> >> bench  referenced            232G                   -
>> >> >> bench  usedbysnapshots       0B                     -
>> >> >> bench  usedbydataset         232G                   -
>> >> >> bench  usedbychildren        1.96M                  -
>> >> >> bench  usedbyrefreservation  0B                     -
>> >> >> bench  written               232G                   -
>> >> >> bench  logicalused           232G                   -
>> >> >> bench  logicalreferenced     232G                   -
>> >> >>
>> >> >> # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb
>> >> >>
>> >> >>
>> >> >>
>> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >> Notes:
>> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> >>
>> >> >> - the source dataset contains ~50% of pictures (raw files and jpg)=
,
>> >> >> and also some music, various archived documents, zip, videos
>> >> >> - no change on the source dataset while testing (cf size logged by=
 resync)
>> >> >> - I repeated the tests twice (128K, then 1M, then 128K, then 1M), =
and
>> >> >> same results
>> >> >> - probably not important here, but:
>> >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR
>> >> >> (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize datas=
et
>> >> >> on another zpool that I never tweaked except ashit=3D12 (because u=
sing
>> >> >> the same model of Red 3TB)
>> >> >>
>> >> >> # zfs --version
>> >> >> zfs-2.0.6-1
>> >> >> zfs-kmod-v2021120100-zfs_a8c7652
>> >> >>
>> >> >> # uname -a
>> >> >> FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11
>> >> >> 75566f060d4(HEAD) TRUENAS  amd64

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2g-0rkYz7Q%2BKO=W49OdF5_GnV%2B-VW6Rb5Eb4LokvaPUpA>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation