Date: Tue, 18 Jan 2022 07:23:45 -0700 From: Alan Somers <asomers@freebsd.org> To: Rich <rincebrain@gmail.com> Cc: Florent Rivoire <florent@rivoire.fr>, freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: [zfs] recordsize: unexpected increase of disk usage when increasing it Message-ID: <CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG%2BaTiudvK_jp2sQKJQ@mail.gmail.com> In-Reply-To: <CAOeNLuopaY3j7P030KO4LMwU3BOU5tXiu6gRsSKsDrFEuGKuaA@mail.gmail.com> References: <CADzRhsEsZMGE-SoeWLMG9NTtkwhhy6OGQQ046m9AxGFbp5h_kQ@mail.gmail.com> <CAOeNLuopaY3j7P030KO4LMwU3BOU5tXiu6gRsSKsDrFEuGKuaA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 18, 2022 at 7:13 AM Rich <rincebrain@gmail.com> wrote: > > Compression would have made your life better here, and possibly also made it clearer what's going on. > > All records in a file are going to be the same size pre-compression - so if you set the recordsize to 1M and save a 131.1M file, it's going to take up 132M on disk before compression/raidz overhead/whatnot. Not true. ZFS will trim the file's tails even without compression enabled. > > Usually compression saves you from the tail padding actually requiring allocation on disk, which is one reason I encourage everyone to at least use lz4 (or, if you absolutely cannot for some reason, I guess zle should also work for this one case...) > > But I would say it's probably the sum of last record padding across the whole dataset, if you don't have compression on. > > - Rich > > On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire <florent@rivoire.fr> wrote: >> >> TLDR: I rsync-ed the same data twice: once with 128K recordsize and >> once with 1M, and the allocated size on disk is ~3% bigger with 1M. >> Why not smaller ? >> >> >> Hello, >> >> I would like some help to understand how the disk usage evolves when I >> change the recordsize. >> >> I've read several articles/presentations/forums about recordsize in >> ZFS, and if I try to summarize, I mainly understood that: >> - recordsize is the "maximum" size of "objects" (so "logical blocks") >> that zfs will create for both -data & metadata, then each object is >> compressed and allocated to one vdev, splitted into smaller (ashift >> size) "physical" blocks and written on disks >> - increasing recordsize is usually good when storing large files that >> are not modified, because it limits the nb of metadata objects >> (block-pointers), which has a positive effect on performance >> - decreasing recordsize is useful for "databases-like" workloads (ie: >> small random writes inside existing objects), because it avoids write >> amplification (read-modify-write a large object for a small update) >> >> Today, I'm trying to observe the effect of increasing recordsize for >> *my* data (because I'm also considering defining special_small_blocks >> & using SSDs as "special", but not tested nor discussed here, just >> recordsize). >> So, I'm doing some benchmarks on my "documents" dataset (details in >> "notes" below), but the results are really strange to me. >> >> When I rsync the same data to a freshly-recreated zpool: >> A) with recordsize=128K : 226G allocated on disk >> B) with recordsize=1M : 232G allocated on disk => bigger than 128K ?!? >> >> I would clearly expect the other way around, because bigger recordsize >> generates less metadata so smaller disk usage, and there shouldn't be >> any overhead because 1M is just a maximum and not a forced size to >> allocate for every object. A common misconception. The 1M recordsize applies to every newly created object, and every object must use the same size for all of its records (except possibly the last one). But objects created before you changed the recsize will retain their old recsize, file tails have a flexible recsize. >> I don't mind the increased usage (I can live with a few GB more), but >> I would like to understand why it happens. You might be seeing the effects of sparsity. ZFS is smart enough not to store file holes (and if any kind of compression is enabled, it will find long runs of zeroes and turn them into holes). If your data contains any holes that are >= 128 kB but < 1MB, then they can be stored as holes with a 128 kB recsize but must be stored as long runs of zeros with a 1MB recsize. However, I would suggest that you don't bother. With a 128kB recsize, ZFS has something like a 1000:1 ratio of data:metadata. In other words, increasing your recsize can save you at most 0.1% of disk space. Basically, it doesn't matter. What it _does_ matter for is the tradeoff between write amplification and RAM usage. 1000:1 is comparable to the disk:ram of many computers. And performance is more sensitive to metadata access times than data access times. So increasing your recsize can help you keep a greater fraction of your metadata in ARC. OTOH, as you remarked increasing your recsize will also increase write amplification. So to summarize: * Adjust compression settings to save disk space. * Adjust recsize to save RAM. -Alan >> >> I tried to give all the details of my tests below. >> Did I do something wrong ? Can you explain the increase ? >> >> Thanks ! >> >> >> >> =============================================== >> A) 128K >> ========== >> >> # zpool destroy bench >> # zpool create -o ashift=12 bench >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 >> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench >> [...] >> sent 241,042,476,154 bytes received 353,838 bytes 81,806,492.45 bytes/sec >> total size is 240,982,439,038 speedup is 1.00 >> >> # zfs get recordsize bench >> NAME PROPERTY VALUE SOURCE >> bench recordsize 128K default >> >> # zpool list -v bench >> NAME SIZE ALLOC FREE >> CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT >> bench 2.72T 226G 2.50T >> - - 0% 8% 1.00x ONLINE - >> gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 2.72T 226G 2.50T >> - - 0% 8.10% - ONLINE >> >> # zfs list bench >> NAME USED AVAIL REFER MOUNTPOINT >> bench 226G 2.41T 226G /bench >> >> # zfs get all bench |egrep "(used|referenced|written)" >> bench used 226G - >> bench referenced 226G - >> bench usedbysnapshots 0B - >> bench usedbydataset 226G - >> bench usedbychildren 1.80M - >> bench usedbyrefreservation 0B - >> bench written 226G - >> bench logicalused 226G - >> bench logicalreferenced 226G - >> >> # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb >> >> >> >> =============================================== >> B) 1M >> ========== >> >> # zpool destroy bench >> # zpool create -o ashift=12 bench >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 >> # zfs set recordsize=1M bench >> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench >> [...] >> sent 241,042,476,154 bytes received 353,830 bytes 80,173,899.88 bytes/sec >> total size is 240,982,439,038 speedup is 1.00 >> >> # zfs get recordsize bench >> NAME PROPERTY VALUE SOURCE >> bench recordsize 1M local >> >> # zpool list -v bench >> NAME SIZE ALLOC FREE >> CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT >> bench 2.72T 232G 2.49T >> - - 0% 8% 1.00x ONLINE - >> gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 2.72T 232G 2.49T >> - - 0% 8.32% - ONLINE >> >> # zfs list bench >> NAME USED AVAIL REFER MOUNTPOINT >> bench 232G 2.41T 232G /bench >> >> # zfs get all bench |egrep "(used|referenced|written)" >> bench used 232G - >> bench referenced 232G - >> bench usedbysnapshots 0B - >> bench usedbydataset 232G - >> bench usedbychildren 1.96M - >> bench usedbyrefreservation 0B - >> bench written 232G - >> bench logicalused 232G - >> bench logicalreferenced 232G - >> >> # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb >> >> >> >> =============================================== >> Notes: >> ========== >> >> - the source dataset contains ~50% of pictures (raw files and jpg), >> and also some music, various archived documents, zip, videos >> - no change on the source dataset while testing (cf size logged by resync) >> - I repeated the tests twice (128K, then 1M, then 128K, then 1M), and >> same results >> - probably not important here, but: >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR >> (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize dataset >> on another zpool that I never tweaked except ashit=12 (because using >> the same model of Red 3TB) >> >> # zfs --version >> zfs-2.0.6-1 >> zfs-kmod-v2021120100-zfs_a8c7652 >> >> # uname -a >> FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11 >> 75566f060d4(HEAD) TRUENAS amd64
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG%2BaTiudvK_jp2sQKJQ>