From nobody Tue Jan 18 14:47:40 2022
X-Original-To: freebsd-fs@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 7DFDF1968509
	for <freebsd-fs@mlmmj.nyi.freebsd.org>; Tue, 18 Jan 2022 14:47:59 +0000 (UTC)
	(envelope-from asomers@gmail.com)
Received: from mail-oo1-f52.google.com (mail-oo1-f52.google.com [209.85.161.52])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4JdWpk3fDDz4pVc
	for <freebsd-fs@freebsd.org>; Tue, 18 Jan 2022 14:47:58 +0000 (UTC)
	(envelope-from asomers@gmail.com)
Received: by mail-oo1-f52.google.com with SMTP id x21-20020a4a2a55000000b002ddf492c201so5780907oox.6
        for <freebsd-fs@freebsd.org>; Tue, 18 Jan 2022 06:47:58 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc:content-transfer-encoding;
        bh=EIVf1X+nHESC5rj6ILCSQG74CjJ4/E62pDSWxvdJU90=;
        b=1JNf7UWTUT5w5sxnLP1chjaktpUZAhkXIoUqHxtijyvd3+nkf3u5xMmvz/sUR7OQgo
         FNWxZGO6rugvpaAOm3UcTPSLv2FXKvdM1Xleph9yqMJmI/mb2X9nRzt61RCIXPA3Ze7E
         vZHlZTgyPaVA18pZv31q3/W663dSMrzEZmUyJmD5fE1GU4O4uNVqTWSxRy2cig9T68L6
         G5EI2eOlXGBDScJA+chs60PAj+hQ2pucb6ShwS2RuJJLIrLOTn7VbCQhaStfA7GZtnes
         ClZQPyvzVDlrpI9XNq20KL7S+OOcGLk/7ydGI58QgcEcVlKU+6Q1oiUVpO1HokeXg5aF
         6UGA==
X-Gm-Message-State: AOAM5316eDHrB/9bJkuOJDQdXOP/jOb0MeXjgfCl7pi7Y9wyyoLBi1ig
	2TxhAW63WYe+5bjaRzU8yj2eAqxrPXYc04vePuI=
X-Google-Smtp-Source: ABdhPJxLI1TJBoruAeWEb8xcNdvLji0jujRjTRLJIX3pxEz6J3rL5dIC3CrjUCwsjPHe+n18QdC5fsbth9ERphclAxE=
X-Received: by 2002:a4a:8891:: with SMTP id j17mr18629164ooa.16.1642517271883;
 Tue, 18 Jan 2022 06:47:51 -0800 (PST)
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-fs
List-Help: <mailto:freebsd-fs+help@freebsd.org>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Subscribe: <mailto:freebsd-fs+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-fs+unsubscribe@freebsd.org>
Sender: owner-freebsd-fs@freebsd.org
MIME-Version: 1.0
References: <CADzRhsEsZMGE-SoeWLMG9NTtkwhhy6OGQQ046m9AxGFbp5h_kQ@mail.gmail.com>
 <CAOeNLuopaY3j7P030KO4LMwU3BOU5tXiu6gRsSKsDrFEuGKuaA@mail.gmail.com>
 <CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG+aTiudvK_jp2sQKJQ@mail.gmail.com> <CAOeNLuoQLgKn673FVotxdoDC3HBr1_j+zY0t9-uVj7N+Fkoe1Q@mail.gmail.com>
In-Reply-To: <CAOeNLuoQLgKn673FVotxdoDC3HBr1_j+zY0t9-uVj7N+Fkoe1Q@mail.gmail.com>
From: Alan Somers <asomers@freebsd.org>
Date: Tue, 18 Jan 2022 07:47:40 -0700
Message-ID: <CAOtMX2g4KduvFA6W062m93jnrJcjQ9KSzkXjb42F1nvhPWaZsw@mail.gmail.com>
Subject: Re: [zfs] recordsize: unexpected increase of disk usage when
 increasing it
To: Rich <rincebrain@gmail.com>
Cc: Florent Rivoire <florent@rivoire.fr>, freebsd-fs <freebsd-fs@freebsd.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 4JdWpk3fDDz4pVc
X-Spamd-Bar: /
Authentication-Results: mx1.freebsd.org;
	dkim=none;
	dmarc=none;
	spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates 209.85.161.52 as permitted sender) smtp.mailfrom=asomers@gmail.com
X-Spamd-Result: default: False [0.92 / 15.00];
	 RCVD_TLS_ALL(0.00)[];
	 ARC_NA(0.00)[];
	 FREEFALL_USER(0.00)[asomers];
	 FROM_HAS_DN(0.00)[];
	 RCPT_COUNT_THREE(0.00)[3];
	 R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17];
	 MIME_GOOD(-0.10)[text/plain];
	 PREVIOUSLY_DELIVERED(0.00)[freebsd-fs@freebsd.org];
	 DMARC_NA(0.00)[freebsd.org];
	 RWL_MAILSPIKE_GOOD(0.00)[209.85.161.52:from];
	 NEURAL_SPAM_MEDIUM(0.92)[0.921];
	 TO_MATCH_ENVRCPT_SOME(0.00)[];
	 TO_DN_ALL(0.00)[];
	 NEURAL_HAM_SHORT(-1.00)[-0.999];
	 NEURAL_SPAM_LONG(1.00)[0.998];
	 RCVD_IN_DNSWL_NONE(0.00)[209.85.161.52:from];
	 MLMMJ_DEST(0.00)[freebsd-fs];
	 FREEMAIL_TO(0.00)[gmail.com];
	 FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com];
	 R_DKIM_NA(0.00)[];
	 MIME_TRACE(0.00)[0:+];
	 ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US];
	 FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com];
	 FREEMAIL_ENVFROM(0.00)[gmail.com];
	 RCVD_COUNT_TWO(0.00)[2]
X-ThisMailContainsUnwantedMimeParts: N

Yeah, it does.  Just check "du -sh <FILENAME>".  zdb there is showing
you the logical size of the record, but it isn't showing how many disk
blocks are actually allocated.

On Tue, Jan 18, 2022 at 7:30 AM Rich <rincebrain@gmail.com> wrote:
>
> Really? I didn't know it would still trim the tails on files with compres=
sion off.
>
> ...
>
>         size    1179648
>         parent  34
>         links   1
>         pflags  40800000004
> Indirect blocks:
>                0 L1  DVA[0]=3D<3:c02b96c000:1000> DVA[1]=3D<3:c810733000:=
1000> [L1 ZFS plain file] skein lz4 unencrypted LE contiguous unique double=
 size=3D20000L/1000P birth=3D35675472L/35675472P fill=3D2 cksum=3D5cfba24b3=
51a09aa:8bd9dfef87c5b625:906ed5c3252943db:bed77ce51ad540d4
>                0  L0 DVA[0]=3D<2:a0827db4000:100000> [L0 ZFS plain file] =
skein uncompressed unencrypted LE contiguous unique single size=3D100000L/1=
00000P birth=3D35675472L/35675472P fill=3D1 cksum=3D95b06edf60e5f54c:af6f69=
50775d0863:8fc28b0783fcd9d3:2e44676e48a59360
>           100000  L0 DVA[0]=3D<2:a0827eb4000:100000> [L0 ZFS plain file] =
skein uncompressed unencrypted LE contiguous unique single size=3D100000L/1=
00000P birth=3D35675472L/35675472P fill=3D1 cksum=3D62a1f05769528648:8197c8=
a05ca9f1fb:a750c690124dd2e0:390bddc4314cd4c3
>
> It seems not?
>
> - Rich
>
>
> On Tue, Jan 18, 2022 at 9:23 AM Alan Somers <asomers@freebsd.org> wrote:
>>
>> On Tue, Jan 18, 2022 at 7:13 AM Rich <rincebrain@gmail.com> wrote:
>> >
>> > Compression would have made your life better here, and possibly also m=
ade it clearer what's going on.
>> >
>> > All records in a file are going to be the same size pre-compression - =
so if you set the recordsize to 1M and save a 131.1M file, it's going to ta=
ke up 132M on disk before compression/raidz overhead/whatnot.
>>
>> Not true.  ZFS will trim the file's tails even without compression enabl=
ed.
>>
>> >
>> > Usually compression saves you from the tail padding actually requiring=
 allocation on disk, which is one reason I encourage everyone to at least u=
se lz4 (or, if you absolutely cannot for some reason, I guess zle should al=
so work for this one case...)
>> >
>> > But I would say it's probably the sum of last record padding across th=
e whole dataset, if you don't have compression on.
>> >
>> > - Rich
>> >
>> > On Tue, Jan 18, 2022 at 8:57 AM Florent Rivoire <florent@rivoire.fr> w=
rote:
>> >>
>> >> TLDR: I rsync-ed the same data twice: once with 128K recordsize and
>> >> once with 1M, and the allocated size on disk is ~3% bigger with 1M.
>> >> Why not smaller ?
>> >>
>> >>
>> >> Hello,
>> >>
>> >> I would like some help to understand how the disk usage evolves when =
I
>> >> change the recordsize.
>> >>
>> >> I've read several articles/presentations/forums about recordsize in
>> >> ZFS, and if I try to summarize, I mainly understood that:
>> >> - recordsize is the "maximum" size of "objects" (so "logical blocks")
>> >> that zfs will create for both  -data & metadata, then each object is
>> >> compressed and allocated to one vdev, splitted into smaller (ashift
>> >> size) "physical" blocks and written on disks
>> >> - increasing recordsize is usually good when storing large files that
>> >> are not modified, because it limits the nb of metadata objects
>> >> (block-pointers), which has a positive effect on performance
>> >> - decreasing recordsize is useful for "databases-like" workloads (ie:
>> >> small random writes inside existing objects), because it avoids write
>> >> amplification (read-modify-write a large object for a small update)
>> >>
>> >> Today, I'm trying to observe the effect of increasing recordsize for
>> >> *my* data (because I'm also considering defining special_small_blocks
>> >> & using SSDs as "special", but not tested nor discussed here, just
>> >> recordsize).
>> >> So, I'm doing some benchmarks on my "documents" dataset (details in
>> >> "notes" below), but the results are really strange to me.
>> >>
>> >> When I rsync the same data to a freshly-recreated zpool:
>> >> A) with recordsize=3D128K : 226G allocated on disk
>> >> B) with recordsize=3D1M : 232G allocated on disk =3D> bigger than 128=
K ?!?
>> >>
>> >> I would clearly expect the other way around, because bigger recordsiz=
e
>> >> generates less metadata so smaller disk usage, and there shouldn't be
>> >> any overhead because 1M is just a maximum and not a forced size to
>> >> allocate for every object.
>>
>> A common misconception.  The 1M recordsize applies to every newly
>> created object, and every object must use the same size for all of its
>> records (except possibly the last one).  But objects created before
>> you changed the recsize will retain their old recsize, file tails have
>> a flexible recsize.
>>
>> >> I don't mind the increased usage (I can live with a few GB more), but
>> >> I would like to understand why it happens.
>>
>> You might be seeing the effects of sparsity.  ZFS is smart enough not
>> to store file holes (and if any kind of compression is enabled, it
>> will find long runs of zeroes and turn them into holes).  If your data
>> contains any holes that are >=3D 128 kB but < 1MB, then they can be
>> stored as holes with a 128 kB recsize but must be stored as long runs
>> of zeros with a 1MB recsize.
>>
>> However, I would suggest that you don't bother.  With a 128kB recsize,
>> ZFS has something like a 1000:1 ratio of data:metadata.  In other
>> words, increasing your recsize can save you at most 0.1% of disk
>> space.  Basically, it doesn't matter.  What it _does_ matter for is
>> the tradeoff between write amplification and RAM usage.  1000:1 is
>> comparable to the disk:ram of many computers.  And performance is more
>> sensitive to metadata access times than data access times.  So
>> increasing your recsize can help you keep a greater fraction of your
>> metadata in ARC.  OTOH, as you remarked increasing your recsize will
>> also increase write amplification.
>>
>> So to summarize:
>> * Adjust compression settings to save disk space.
>> * Adjust recsize to save RAM.
>>
>> -Alan
>>
>> >>
>> >> I tried to give all the details of my tests below.
>> >> Did I do something wrong ? Can you explain the increase ?
>> >>
>> >> Thanks !
>> >>
>> >>
>> >>
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> A) 128K
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>
>> >> # zpool destroy bench
>> >> # zpool create -o ashift=3D12 bench
>> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
>> >>
>> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
>> >> [...]
>> >> sent 241,042,476,154 bytes  received 353,838 bytes  81,806,492.45 byt=
es/sec
>> >> total size is 240,982,439,038  speedup is 1.00
>> >>
>> >> # zfs get recordsize bench
>> >> NAME   PROPERTY    VALUE    SOURCE
>> >> bench  recordsize  128K     default
>> >>
>> >> # zpool list -v bench
>> >> NAME                                           SIZE  ALLOC   FREE
>> >> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
>> >> bench                                         2.72T   226G  2.50T
>> >>   -         -     0%     8%  1.00x    ONLINE  -
>> >>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   226G  2.50T
>> >>   -         -     0%  8.10%      -    ONLINE
>> >>
>> >> # zfs list bench
>> >> NAME    USED  AVAIL     REFER  MOUNTPOINT
>> >> bench   226G  2.41T      226G  /bench
>> >>
>> >> # zfs get all bench |egrep "(used|referenced|written)"
>> >> bench  used                  226G                   -
>> >> bench  referenced            226G                   -
>> >> bench  usedbysnapshots       0B                     -
>> >> bench  usedbydataset         226G                   -
>> >> bench  usedbychildren        1.80M                  -
>> >> bench  usedbyrefreservation  0B                     -
>> >> bench  written               226G                   -
>> >> bench  logicalused           226G                   -
>> >> bench  logicalreferenced     226G                   -
>> >>
>> >> # zdb -Lbbbs bench > zpool-bench-rcd128K.zdb
>> >>
>> >>
>> >>
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> B) 1M
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>
>> >> # zpool destroy bench
>> >> # zpool create -o ashift=3D12 bench
>> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4
>> >> # zfs set recordsize=3D1M bench
>> >>
>> >> # rsync -av --exclude '.zfs' /mnt/tank/docs-florent/ /bench
>> >> [...]
>> >> sent 241,042,476,154 bytes  received 353,830 bytes  80,173,899.88 byt=
es/sec
>> >> total size is 240,982,439,038  speedup is 1.00
>> >>
>> >> # zfs get recordsize bench
>> >> NAME   PROPERTY    VALUE    SOURCE
>> >> bench  recordsize  1M       local
>> >>
>> >> # zpool list -v bench
>> >> NAME                                           SIZE  ALLOC   FREE
>> >> CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
>> >> bench                                         2.72T   232G  2.49T
>> >>   -         -     0%     8%  1.00x    ONLINE  -
>> >>   gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4  2.72T   232G  2.49T
>> >>   -         -     0%  8.32%      -    ONLINE
>> >>
>> >> # zfs list bench
>> >> NAME    USED  AVAIL     REFER  MOUNTPOINT
>> >> bench   232G  2.41T      232G  /bench
>> >>
>> >> # zfs get all bench |egrep "(used|referenced|written)"
>> >> bench  used                  232G                   -
>> >> bench  referenced            232G                   -
>> >> bench  usedbysnapshots       0B                     -
>> >> bench  usedbydataset         232G                   -
>> >> bench  usedbychildren        1.96M                  -
>> >> bench  usedbyrefreservation  0B                     -
>> >> bench  written               232G                   -
>> >> bench  logicalused           232G                   -
>> >> bench  logicalreferenced     232G                   -
>> >>
>> >> # zdb -Lbbbs bench > zpool-bench-rcd1M.zdb
>> >>
>> >>
>> >>
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >> Notes:
>> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> >>
>> >> - the source dataset contains ~50% of pictures (raw files and jpg),
>> >> and also some music, various archived documents, zip, videos
>> >> - no change on the source dataset while testing (cf size logged by re=
sync)
>> >> - I repeated the tests twice (128K, then 1M, then 128K, then 1M), and
>> >> same results
>> >> - probably not important here, but:
>> >> /dev/gptid/3c0f5cbc-b0ce-11ea-ab91-c8cbb8cc3ad4 is a Red 3TB CMR
>> >> (WD30EFRX), and /mnt/tank/docs-florent/ is a 128K-recordsize dataset
>> >> on another zpool that I never tweaked except ashit=3D12 (because usin=
g
>> >> the same model of Red 3TB)
>> >>
>> >> # zfs --version
>> >> zfs-2.0.6-1
>> >> zfs-kmod-v2021120100-zfs_a8c7652
>> >>
>> >> # uname -a
>> >> FreeBSD xxxxxxxxx 12.2-RELEASE-p11 FreeBSD 12.2-RELEASE-p11
>> >> 75566f060d4(HEAD) TRUENAS  amd64