Date: Tue, 18 Jan 2022 07:47:40 -0700 From: Alan Somers <asomers@freebsd.org> To: Rich <rincebrain@gmail.com> Cc: Florent Rivoire <florent@rivoire.fr>, freebsd-fs <freebsd-fs@freebsd.org> Subject: Re: [zfs] recordsize: unexpected increase of disk usage when increasing it Message-ID: <CAOtMX2g4KduvFA6W062m93jnrJcjQ9KSzkXjb42F1nvhPWaZsw@mail.gmail.com> In-Reply-To: <CAOeNLuoQLgKn673FVotxdoDC3HBr1_j%2BzY0t9-uVj7N%2BFkoe1Q@mail.gmail.com> References: <CADzRhsEsZMGE-SoeWLMG9NTtkwhhy6OGQQ046m9AxGFbp5h_kQ@mail.gmail.com> <CAOeNLuopaY3j7P030KO4LMwU3BOU5tXiu6gRsSKsDrFEuGKuaA@mail.gmail.com> <CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG%2BaTiudvK_jp2sQKJQ@mail.gmail.com> <CAOeNLuoQLgKn673FVotxdoDC3HBr1_j%2BzY0t9-uVj7N%2BFkoe1Q@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Yeah, it does. Just check "du -sh <FILENAME>". zdb there is showing you the logical size of the record, but it isn't showing how many disk blocks are actually allocated. On Tue, Jan 18, 2022 at 7:30 AM Rich <rincebrain@gmail.com> wrote: > > Really? I didn't know it would still trim the tails on files with compres= sion off. > > ... > > size 1179648 > parent 34 > links 1 > pflags 40800000004 > Indirect blocks: > 0 L1 DVA[0]=3D<3:c02b96c000:1000> DVA[1]=3D<3:c810733000:= 1000> [L1 ZFS plain file] skein lz4 unencrypted LE contiguous unique double= size=3D20000L/1000P birth=3D35675472L/35675472P fill=3D2 cksum=3D5cfba24b3= 51a09aa:8bd9dfef87c5b625:906ed5c3252943db:bed77ce51ad540d4 > 0 L0 DVA[0]=3D<2:a0827db4000:100000> [L0 ZFS plain file] = skein uncompressed unencrypted LE contiguous unique single size=3D100000L/1= 00000P birth=3D35675472L/35675472P fill=3D1 cksum=3D95b06edf60e5f54c:af6f69= 50775d0863:8fc28b0783fcd9d3:2e44676e48a59360 > 100000 L0 DVA[0]=3D<2:a0827eb4000:100000> [L0 ZFS plain file] = skein uncompressed unencrypted LE contiguous unique single size=3D100000L/1= 00000P birth=3D35675472L/35675472P fill=3D1 cksum=3D62a1f05769528648:8197c8= a05ca9f1fb:a750c690124dd2e0:390bddc4314cd4c3 > > It seems not? > > - Rich > > > On Tue, Jan 18, 2022 at 9:23 AM Alan Somers <asomers@freebsd.org> wrote: >> >> On Tue, Jan 18, 2022 at 7:13 AM Rich <rincebrain@gmail.com> wrote: >> > >> > Compression would have made your life better here, and possibly also m= ade it clearer what's going on. >> > >> > All records in a file are going to be the same size pre-compression - = so if you set the recordsize to 1M and save a 131.1M file, it's going to ta= ke up 132M on disk before compression/raidz overhead/whatnot. >> >> Not true. ZFS will trim the file's tails even without compression enabl= ed. >> >> > >> > Usually compression saves you from the tail padding actually requiring= allocation on disk, which is one reason I encourage everyone to at least u= se lz4 (or, if you absolutely cannot for some reason, I guess zle should al= so work for this one case...) >> > >> > But I would say it's probably the sum of last record padding across th= e whole dataset, if you don't have compression on. >> > >> > - Rich >> > >> > On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire <florent@rivoire.fr> w= rote: >> >> >> >> TLDR: I rsync-ed the same data twice: once with 128K recordsize and >> >> once with 1M, and the allocated size on disk is ~3% bigger with 1M. >> >> Why not smaller ? >> >> >> >> >> >> Hello, >> >> >> >> I would like some help to understand how the disk usage evolves when = I >> >> change the recordsize. >> >> >> >> I've read several articles/presentations/forums about recordsize in >> >> ZFS, and if I try to summarize, I mainly understood that: >> >> - recordsize is the "maximum" size of "objects" (so "logical blocks") >> >> that zfs will create for both -data & metadata, then each object is >> >> compressed and allocated to one vdev, splitted into smaller (ashift >> >> size) "physical" blocks and written on disks >> >> - increasing recordsize is usually good when storing large files that >> >> are not modified, because it limits the nb of metadata objects >> >> (block-pointers), which has a positive effect on performance >> >> - decreasing recordsize is useful for "databases-like" workloads (ie: >> >> small random writes inside existing objects), because it avoids write >> >> amplification (read-modify-write a large object for a small update) >> >> >> >> Today, I'm trying to observe the effect of increasing recordsize for >> >> *my* data (because I'm also considering defining special_small_blocks >> >> & using SSDs as "special", but not tested nor discussed here, just >> >> recordsize). >> >> So, I'm doing some benchmarks on my "documents" dataset (details in >> >> "notes" below), but the results are really strange to me. >> >> >> >> When I rsync the same data to a freshly-recreated zpool: >> >> A) with recordsize=3D128K : 226G allocated on disk >> >> B) with recordsize=3D1M : 232G allocated on disk =3D> bigger than 128= K ?!? >> >> >> >> I would clearly expect the other way around, because bigger recordsiz= e >> >> generates less metadata so smaller disk usage, and there shouldn't be >> >> any overhead because 1M is just a maximum and not a forced size to >> >> allocate for every object. >> >> A common misconception. The 1M recordsize applies to every newly >> created object, and every object must use the same size for all of its >> records (except possibly the last one). But objects created before >> you changed the recsize will retain their old recsize, file tails have >> a flexible recsize. >> >> >> I don't mind the increased usage (I can live with a few GB more), but >> >> I would like to understand why it happens. >> >> You might be seeing the effects of sparsity. ZFS is smart enough not >> to store file holes (and if any kind of compression is enabled, it >> will find long runs of zeroes and turn them into holes). If your data >> contains any holes that are >=3D 128 kB but < 1MB, then they can be >> stored as holes with a 128 kB recsize but must be stored as long runs >> of zeros with a 1MB recsize. >> >> However, I would suggest that you don't bother. With a 128kB recsize, >> ZFS has something like a 1000:1 ratio of data:metadata. In other >> words, increasing your recsize can save you at most 0.1% of disk >> space. Basically, it doesn't matter. What it _does_ matter for is >> the tradeoff between write amplification and RAM usage. 1000:1 is >> comparable to the disk:ram of many computers. And performance is more >> sensitive to metadata access times than data access times. So >> increasing your recsize can help you keep a greater fraction of your >> metadata in ARC. OTOH, as you remarked increasing your recsize will >> also increase write amplification. >> >> So to summarize: >> * Adjust compression settings to save disk space. >> * Adjust recsize to save RAM. >> >> -Alan >> >> >> >> >> I tried to give all the details of my tests below. >> >> Did I do something wrong ? Can you explain the increase ? >> >> >> >> Thanks ! >> >> >> >> >> >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> A) 128K >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> # zpool destroy bench >> >> # zpool create -o ashift=3D12 bench >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 >> >> >> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench >> >> [...] >> >> sent 241,042,476,154 bytes received 353,838 bytes 81,806,492.45 byt= es/sec >> >> total size is 240,982,439,038 speedup is 1.00 >> >> >> >> # zfs get recordsize bench >> >> NAME PROPERTY VALUE SOURCE >> >> bench recordsize 128K default >> >> >> >> # zpool list -v bench >> >> NAME SIZE ALLOC FREE >> >> CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT >> >> bench 2.72T 226G 2.50T >> >> - - 0% 8% 1.00x ONLINE - >> >> gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 2.72T 226G 2.50T >> >> - - 0% 8.10% - ONLINE >> >> >> >> # zfs list bench >> >> NAME USED AVAIL REFER MOUNTPOINT >> >> bench 226G 2.41T 226G /bench >> >> >> >> # zfs get all bench |egrep "(used|referenced|written)" >> >> bench used 226G - >> >> bench referenced 226G - >> >> bench usedbysnapshots 0B - >> >> bench usedbydataset 226G - >> >> bench usedbychildren 1.80M - >> >> bench usedbyrefreservation 0B - >> >> bench written 226G - >> >> bench logicalused 226G - >> >> bench logicalreferenced 226G - >> >> >> >> # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb >> >> >> >> >> >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> B) 1M >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> # zpool destroy bench >> >> # zpool create -o ashift=3D12 bench >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 >> >> # zfs set recordsize=3D1M bench >> >> >> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench >> >> [...] >> >> sent 241,042,476,154 bytes received 353,830 bytes 80,173,899.88 byt= es/sec >> >> total size is 240,982,439,038 speedup is 1.00 >> >> >> >> # zfs get recordsize bench >> >> NAME PROPERTY VALUE SOURCE >> >> bench recordsize 1M local >> >> >> >> # zpool list -v bench >> >> NAME SIZE ALLOC FREE >> >> CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT >> >> bench 2.72T 232G 2.49T >> >> - - 0% 8% 1.00x ONLINE - >> >> gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 2.72T 232G 2.49T >> >> - - 0% 8.32% - ONLINE >> >> >> >> # zfs list bench >> >> NAME USED AVAIL REFER MOUNTPOINT >> >> bench 232G 2.41T 232G /bench >> >> >> >> # zfs get all bench |egrep "(used|referenced|written)" >> >> bench used 232G - >> >> bench referenced 232G - >> >> bench usedbysnapshots 0B - >> >> bench usedbydataset 232G - >> >> bench usedbychildren 1.96M - >> >> bench usedbyrefreservation 0B - >> >> bench written 232G - >> >> bench logicalused 232G - >> >> bench logicalreferenced 232G - >> >> >> >> # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb >> >> >> >> >> >> >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> Notes: >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> >> >> >> - the source dataset contains ~50% of pictures (raw files and jpg), >> >> and also some music, various archived documents, zip, videos >> >> - no change on the source dataset while testing (cf size logged by re= sync) >> >> - I repeated the tests twice (128K, then 1M, then 128K, then 1M), and >> >> same results >> >> - probably not important here, but: >> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR >> >> (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize dataset >> >> on another zpool that I never tweaked except ashit=3D12 (because usin= g >> >> the same model of Red 3TB) >> >> >> >> # zfs --version >> >> zfs-2.0.6-1 >> >> zfs-kmod-v2021120100-zfs_a8c7652 >> >> >> >> # uname -a >> >> FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11 >> >> 75566f060d4(HEAD) TRUENAS amd64
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2g4KduvFA6W062m93jnrJcjQ9KSzkXjb42F1nvhPWaZsw>