Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 Jan 2022 17:07:48 +0100
From:      Florent Rivoire <florent@rivoire.fr>
To:        Alan Somers <asomers@freebsd.org>
Cc:        Rich <rincebrain@gmail.com>, freebsd-fs <freebsd-fs@freebsd.org>
Subject:   Re: [zfs] recordsize: unexpected increase of disk usage when increasing it
Message-ID:  <CADzRhsF%2Bs1NHudToY0J7Wn90D8gwaM16Ym43XXopoaWVQGS8CA@mail.gmail.com>
In-Reply-To: <CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG%2BaTiudvK_jp2sQKJQ@mail.gmail.com>
References:  <CADzRhsEsZMGE-SoeWLMG9NTtkwhhy6OGQQ046m9AxGFbp5h_kQ@mail.gmail.com> <CAOeNLuopaY3j7P030KO4LMwU3BOU5tXiu6gRsSKsDrFEuGKuaA@mail.gmail.com> <CAOtMX2h=miZt=6__oAhPVzsK9ReShy6nG%2BaTiudvK_jp2sQKJQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 18, 2022 at 3:23 PM Alan Somers <asomers@freebsd.org> wrote:
> However, I would suggest that you don't bother.  With a 128kB recsize,
> ZFS has something like a 1000:1 ratio of data:metadata.  In other
> words, increasing your recsize can save you at most 0.1% of disk
> space.  Basically, it doesn't matter.  What it _does_ matter for is
> the tradeoff between write amplification and RAM usage.  1000:1 is
> comparable to the disk:ram of many computers.  And performance is more
> sensitive to metadata access times than data access times.  So
> increasing your recsize can help you keep a greater fraction of your
> metadata in ARC.  OTOH, as you remarked increasing your recsize will
> also increase write amplification.

In the attached zdb files (for 128K recordsize), we can see that the
"L0 ZFS plain file" objects are using 99.89% in my test zpool. So the
ratio in my case is exactly 1000:1 like you said.
I had that rule-of-thumb in mind, but thanks for reminding me !

As quickly mentioned in my first email, the context is that I'm
considering using a mirror of SSDs as "special devices" for a new
zpool which will still be mainly made of magnetic HDDs (raidz2 of
5x3TB).
And to really take advantage of those SSDs, I'm probably going to set
special_small_blocks at a value > 0. Of course, I don't want to put
all data on SSDs, so I need to keep "special_small_blocks" strictly
below "recordsize", so that ZFS splits objects into 2 groups: the
"small"=>special-vdev, vs the "rest"=>main-vdevs.

So basically, I see two kind of solutions:
1) keep default recordsize (128K), and define a fairly small
special_small_blocks (64K, or 32K)
2) increase recordsize (probably to 1M) and be able to define
special_small_blocks higher (maybe: 128K, 256K or 512K, I'm testing)

I'm leaning towards the 2nd option, because :
- it allows a higher percentage of data into the SSD (solution 1 is
only storing ~0.1% of data on SSDs, vs solution 2 with 512K
small_blocks is ~0.3%, and my SSD will have 3% of HDD's vdev capacity)
- the "high" recordsize of 1M looks good for my use-cases (files
usually between MBs, and 100's of MBs, written sequentially, and never
overwritten, so no risk of write amplification on this dataset)

So, my goal is not to optimize disk usage (indeed: 0.1% is nothing),
but to optimize read/write performance at a small cost (using small
SSDs as special).
And also, it's more to do some tech experiments on ZFS (because it's
fun, I'm learning and it's my home-nas) than for a real need :)

--
Florent



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CADzRhsF%2Bs1NHudToY0J7Wn90D8gwaM16Ym43XXopoaWVQGS8CA>