Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 16 Nov 2012 15:58:27 +1100
From:      Stephen McKay <smckay@internode.on.net>
To:        Eitan Adler <eadler@freebsd.org>
Cc:        FreeBSD FS <freebsd-fs@freebsd.org>, Stephen McKay <smckay@internode.on.net>
Subject:   ZFS FAQ (Was: SSD recommendations for ZFS cache/log)
Message-ID:  <57ac1f$gg70bn@ipmail05.adl6.internode.on.net>
In-Reply-To: <CAF6rxgkh6C0LKXOZa264yZcA3AvQdw7zVAzWKpytfh0%2BKnLOJg@mail.gmail.com> from Eitan Adler at "Thu, 15 Nov 2012 22:54:46 -0500"
References:  <CAFHbX1K-NPuAy5tW0N8=sJD=CU0Q1Pm3ZDkVkE%2BdjpCsD1U8_Q@mail.gmail.com> <57ac1f$gf3rkl@ipmail05.adl6.internode.on.net> <50A31D48.3000700@shatow.net><CAF6rxgkh6C0LKXOZa264yZcA3AvQdw7zVAzWKpytfh0%2BKnLOJg@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, 15th November 2012, Eitan Adler wrote:

>Can people here please tell me what is wrong in the following content?

A few things.  I'll intersperse them.

>Is there additional data or questions to add?

The whole ZFS world desperately needs good documentation.  There
are misconceptions everywhere.  There are good tuning hints and
bad (or out of date) ones.  Further, it depends on your target
application whether the defaults are fairly good or plain suck.

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
is one of the places to go, but it's quite Solaris specific.  It's
also starting to get a bit dated.

The existing http://wiki.freebsd.org/ZFS and ZFSTuningGuide are
a mixture of old and recent information and are not a useful intro
to the subject for a system administrator.

As a project, we should have some short pithy "How to do ZFS right"
document that targets workstations, file servers, database servers
and web servers (as a start).  There are so many things to balance
it's not obvious what to do.  Simply whacking in a cheap SSD and
a bunch of slow disks is rarely the right answer.

Included in this hypothetical guide would also be the basics of
partitioning using gpart (GPT) with labels and boot partitions
to allow complex setups.  For example, I've set up small servers
where a small slice of each disk becomes a massively mirrored
root pool while the majority of each disk becomes a raidz2 pool.
This makes sense where you need maximum flexibility and can afford
only a few spindles.

>Please note that I've never used ZFS so being whacked by a cluebat
>would be helpful.

My cluebat is small. :-)  I've used ZFS for real for a few different
target applications but by no means have I covered all uses.  I
think we need input from many people to make a useful ZFS FAQ.

>+	  <question id="what-is-zil">
>+	    <para>What is the ZIL and when does it get used?</para>
>+	  </question>
>+
>+	  <answer>
>+	    <para>The <acronym>ZIL</acronym> (<acronym>ZFS</acronym>
>+	      intent log) is a write cache for ZFS.  All writes get
>+	      recorded in the ZIL.  Eventually ZFS will perform a
>+	      <quote>Transaction Group Commit</quote> in which it
>+	      flushes out the data in the ZIL to disk.</para>
>+	  </answer>

The ZIL is not a cache.  It is only used for synchronous writes,
not for all writes.  It is only read during crash recovery.  Its
purpose is data integrity.  Async writes (most writes) are kept in
RAM and bundled into transactions.  Transactions are written to disk
in an atomic fashion.  The ZIL is needed for writes that have been
acknowledged as written but which are not yet on disk as part of a
transaction.

Sync writes will result from fsync() calls, being a NFS server,
most netatalk stuff I've seen, and probably a lot of other stuff.
But crucially, you get none from editing, compiling, playing nethack,
web browsing and many other things.  So you may not need a separate
fast ZIL, which was basically the question that started all this
off. :-)

So, I guess you need a "Do I need an SSD for ZIL?" question here
somewhere, and a similar "Do I need an SSD for L2ARC?" to go with it.

>+	  <question id="what-is-l2arc">
>+	    <para>What is the L2ARC?</para>
>+	  </question>
>+
>+	  <answer>
>+	    <para>The <acronym>L2ARC</acronym> is a read cache stored
>+	      on a fast device such as an <acronym>SSD</acronym>.  It is
>+	      used to speed up operations such as deduplication or
>+	      encryption.  This cache is not persisent across
>+	      reboots.  Note that RAM is used as the first layer
>+	      of cache and the L2ARC is only needed if there is
>+	      insufficient RAM.</para>
>+	  </answer>
>+	</qandaentry>

The L2ARC is a general read cache which happens to speed up dedup
because the dedup table typically gets very large and will not fit
into ARC.  It's not primarily there to help dedup.  I don't think
it's anything to do with encryption.  Or compression, if that's what
you meant to write.

I wish I could expand on the "you don't need L2ARC if you have enough
RAM" idea, but that's basically true.  L2ARC isn't a free lunch either
as it needs space in the ARC to index it.

So, perversely, a working set that fits perfectly in the ARC will
not fit perfectly any more if an L2ARC is used because part of the
ARC is holding the index, pushing part of the working set into the
L2ARC which is presumably slower than RAM.

I think people could still write research papers on this aspect of ZFS.
It's also any area where the defaults seem poorly tuned, at least if
we believe our own ZFSTuningGuide wiki page.

>+	  <question id="should-enable-dedup">
>+	    <para>Is enabling deduplication advisable?</para>
>+	  </question>
>+
>+	  <answer>
>+	    <para>The answer very much depends on the expected workload.
>+	      Deduplication takes up a signifigent amount of RAM and CPU
>+	      time and may slow down read and write disk access times.
>+	      Unless one is storing data that is very heavily
>+	      duplicated (such as virtual machine images, or user
>+	      backups) it is likely that deduplication will do more harm
>+	      than good.  Another consideration is the inability to
>+	      revert deduplication status.  If deduplication is enabled,
>+	      data written, and then dedup is disabled, those blocks
>+	      which were deduplicated will not be duplicated until
>+	      they are next modified.</para>
>+	  </answer>

s/signifigent/significant/

I've got a really short answer to whether or not you should enable dedup
which I give people who ask: No.  I have a longer answer too, but I
think the short answer is better than typing all day. :-)

I like your version, but would be tempted to make it more scary so people
don't discover too late the long term pain dedup causes.  People rarely
expect, for example, that deleting stuff can be slow, but with dedup it
can be glacial, especially if your dedup table doesn't fit in RAM.

Perhaps you should start with the words "Generally speaking, no."

Cheers,

Stephen.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?57ac1f$gg70bn>