Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 28 Aug 2015 12:27:22 -0400
From:      "Chad J. Milios" <milios@ccsys.com>
To:        Tenzin Lhakhang <tenzin.lhakhang@gmail.com>
Cc:        freebsd-fs@freebsd.org, "freebsd-virtualization@freebsd.org" <freebsd-virtualization@freebsd.org>
Subject:   Re: Options for zfs inside a VM backed by zfs on the host
Message-ID:  <8DB91B3A-44DC-4650-9E90-56F7DE2ABC42@ccsys.com>
In-Reply-To: <CALcn87yArcBs0ybrZBBxaxDU0y6s=wM8di0RmaSCJCgOjUHq9w@mail.gmail.com>
References:  <CALd%2BdcfJ%2BT-f5gk_pim39BSF7nhBqHC3ab7dXgW8fH43VvvhvA@mail.gmail.com> <20150827061044.GA10221@blazingdot.com> <20150827062015.GA10272@blazingdot.com> <1a6745e27d184bb99eca7fdbdc90c8b5@SERVER.ad.usd-group.com> <55DF46F5.4070406@redbarn.org> <453A5A6F-E347-41AE-8CBC-9E0F4DA49D38@ccsys.com> <CALcn87yArcBs0ybrZBBxaxDU0y6s=wM8di0RmaSCJCgOjUHq9w@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

> On Aug 27, 2015, at 7:47 PM, Tenzin Lhakhang =
<tenzin.lhakhang@gmail.com> wrote:
>=20
> On Thu, Aug 27, 2015 at 3:53 PM, Chad J. Milios <milios@ccsys.com =
<mailto:milios@ccsys.com>> wrote:
>=20
> Whether we are talking ffs, ntfs or zpool atop zvol, unfortunately =
there are really no simple answers. You must consider your use case, the =
host and vm hardware/software configuration, perform meaningful =
benchmarks and, if you care about data integrity, thorough tests of the =
likely failure modes (all far more easily said than done). I=E2=80=99m =
curious to hear more about your use case(s) and setups so as to offer =
better insight on what alternatives may make more/less sense for you. =
Performance needs? Are you striving for lower individual latency or =
higher combined throughput? How critical are integrity and availability? =
How do you prefer your backup routine? Do you handle that in guest or =
host? Want features like dedup and/or L2ARC up in the mix? (Then =
everything bears reconsideration, just about triple your research and =
testing efforts.)
>=20
> Sorry, I=E2=80=99m really not trying to scare anyone away from ZFS. It =
is awesome and capable of providing amazing solutions with very reliable =
and sensible behavior if handled with due respect, fear, monitoring and =
upkeep. :)
>=20
> There are cases to be made for caching [meta-]data in the child, in =
the parent, checksumming in the child/parent/both, compressing in the =
child/parent. I believe `gstat` along with your custom-made benchmark or =
test load will greatly help guide you.
>=20
> ZFS on ZFS seems to be a hardly studied, seldom reported, never =
documented, tedious exercise. Prepare for accelerated greying and =
balding of your hair. The parent's volblocksize, child's ashift, =
alignment, interactions involving raidz stripes (if used) can lead to =
problems from slightly decreased performance and storage efficiency to =
pathological write amplification within ZFS, performance and =
responsiveness crashing and sinking to the bottom of the ocean. Some =
datasets can become veritable black holes to vfs system calls. You may =
see ZFS reporting elusive errors, deadlocking or panicing in the child =
or parent altogether. With diligence though, stable and performant =
setups can be discovered for many production situations.
>=20
> For example, for a zpool (whether used by a VM or not, locally, thru =
iscsi, ggate[cd], or whatever) atop zvol which sits on parent zpool with =
no redundancy, I would set primarycache=3Dmetadata checksum=3Doff =
compression=3Doff for the zvol(s) on the host(s) and for the most part =
just use the same zpool settings and sysctl tunings in the VM (or child =
zpool, whatever role it may conduct) that i would otherwise use on bare =
cpu and bare drives (defaults + compression=3Dlz4 atime=3Doff). However, =
that simple case is likely not yours.
>=20
> With ufs/ffs/ntfs/ext4 and most other filesystems atop a zvol i use =
checksums on the parent zvol, and compression too if the child doesn=E2=80=
=99t support it (as ntfs can), but still caching only metadata on the =
host and letting the child vm/fs cache real data.
>=20
> My use case involves charging customers for their memory use so =
admittedly that is one motivating factor, LOL. Plus, i certainly don=E2=80=
=99t want one rude VM marching through host ARC unfairly evacuating and =
starving the other polite neighbors.
>=20
> VM=E2=80=99s swap space becomes another consideration and I treat it =
like any other =E2=80=98dumb=E2=80=99 filesystem with compression and =
checksumming done by the parent but recent versions of many operating =
systems may be paging out only already compressed data, so investigate =
your guest OS. I=E2=80=99ve found lz4=E2=80=99s claims of an =
almost-no-penalty early-abort to be vastly overstated when dealing with =
zvols, small block sizes and high throughput so if you can be certain =
you=E2=80=99ll be dealing with only compressed data then turn it off. =
For the virtual memory pagers in most current-day OS=E2=80=99s though =
set compression on the swap=E2=80=99s backing zvol to lz4.
>=20
> Another factor is the ZIL. One VM can hoard your synchronous write =
performance. Solutions are beyond the scope of this already-too-long =
email :) but I=E2=80=99d be happy to elaborate if queried.
>=20
> And then there=E2=80=99s always netbooting guests from NFS mounts =
served by the host and giving the guest no virtual disks, don=E2=80=99t =
forget to consider that option.
>=20
> Hope this provokes some fruitful ideas for you. Glad to philosophize =
about ZFS setups with ya=E2=80=99ll :)
>=20
> -chad

> That was a really awesome read!  The idea of turning metadata on at =
the backend zpool and then data on the VM was interesting, I will give =
that a try. Please can you elaborate more on the ZILs and synchronous =
writes by VMs.. that seems like a great topic.

> I am right now exploring the question: are SSD ZILs necessary in an =
all SSD pool? and then the question of NVMe SSD ZILs onto of an all SSD =
pool.  My guess at the moment is that SSD ZILs are not necessary at all =
in an SSD pool during intensive IO.  I've been told that ZILs are always =
there to help you, but when your pool aggregate IOPs is greater than the =
a ZIL, it doesn't seem to make sense.. Or is it the latency of writing =
to a single disk vs striping across your "fast" vdevs?
>=20
> Thanks,
> Tenzin

Well the ZIL (ZFS Intent Log) is basically an absolute necessity. =
Without it, a call to fsync() could take over 10 seconds on a system =
serving a relatively light load. HOWEVER, a source of confusion is the =
terminology people often throw around. See, the ZIL is basically a =
concept, a method, a procedure. It is not a device. A 'SLOG' is what =
most people mean when they say ZIL. That is a Seperate Log device. (ZFS =
=E2=80=98log=E2=80=99 vdev type; documented in man 8 zpool.) When you =
aren=E2=80=99t using a SLOG device, your ZIL is transparently allocated =
by ZFS, roughly a little chunk of space reserved near the =E2=80=9Cmiddle=E2=
=80=9D (at least ZFS attempts to locate it there physically but on SSDs =
or SMR HDs there=E2=80=99s no way to and no point to) of the main pool =
(unless you=E2=80=99ve gone out of your way to deliberately disable the =
ZIL entirely).

The other confusion often surrounding the ZIL is when it gets used. Most =
writes (in the world) would bypass the ZIL (built-in or SLOG) entirely =
anyway because they are asynchronous writes, not synchronous ones. Only =
the latter are candidates to clog a ZIL bottleneck. You will need to =
consider your workload specifically to know whether a SLOG will help, =
and if so, how much SLOG performance is required to not put a damper on =
the pool=E2=80=99s overall throughput capability. Conversely you want to =
know how much SLOG performance is overkill because NVMe and SLC SSDs are =
freaking expensive.

Now for many on the list this is going to be some elementary information =
so i apologize but i come across this question all the time, sync vs =
async writes. i=E2=80=99m sure there are many who might find this =
informative and with ZFS the difference becomes more profound and =
important than most other filesystems.

See, ZFS always is always bundling up batches of writes into transaction =
groups (TXGs). Without extraneous detail it can be understood that =
basically these happen every 5 seconds (sysctl vfs.zfs.txg.timeout). So =
picture ZFS typically has two TXGs it=E2=80=99s worried about at any =
given time, one is being filled into memory while the previous one is =
being flushed out to physical disk.

So when you write something asynchronously the operating system is going =
to say =E2=80=98aye aye captain=E2=80=99 and send you along your merry =
way very quickly but if you lose power or crash and then reboot, ZFS =
only guarantees you a CONSISTENT state, not your most recent state. Your =
pool may come back online and you=E2=80=99ve lost 5-15 seconds worth of =
work. For your typical desktop or workstation workload that=E2=80=99s =
probably no big deal. You lost 15 seconds of effort, you repeat it, and =
continue about your business.

However, imagine a mail server that received many many emails in just =
that short time and has told all the senders of all those messages =
=E2=80=9Cgot it, thumbs up=E2=80=9D. You cannot redact those assurances =
you handed out. You have no idea who to contact to ask to repeat =
themselves. Even if you did it's likely the sending mail servers have =
long since forgotten about those particular messages. So, with each =
message you receive, after you tell the operating system to write the =
data you issue a call to fsync(new_message) and only after that call =
returns do you give the sender the thumbs up to forget the message and =
leave it in your capable hands to deliver it to its destination. Thanks =
to the ZIL, fsync() will typically return in miliseconds or less instead =
of the many seconds it could take for that write in a bundled TXG to end =
up physically saved. In an ideal world, the ZIL gets written to and =
never read again, data just becoming stale and overwritten. (The data =
stays in the in-memory TXG so it=E2=80=99s redundant in the ZIL once =
that TXG completes flushing).

The email server is the typical example of the use of fsync but there =
are thousands of others. Typically applications using central databases =
are written in a simplistic way to assume the database is trustworthy =
and fsync is how the database attempts to fulfill that requirement.

To complicate matters, consider VMs, particularly uncooperative, =
impolite, selfish VMs. Synchronous write iops are a particularly scarce =
and expensive resource which hasn=E2=80=99t been increasing as quickly =
and cheaply as, say, io bandwidth, cpu speeds, memory capacities. To =
make it worse the numbers for iops most SSD makers advertise on their =
so-called spec sheets are untrustworthy, they have no standard benchmark =
or enforcement (=E2=80=9CThe PS in IOPS stands for Per Second so we ran =
our benchmark on a fresh drive for one second and got 100,000 IOPS" =
Well, good for you, that is useless to me. Tell me what you can sustain =
all day long a year down the road.) and they=E2=80=99re seldom =
accountable to anybody not buying 10,000 units. All this consolidation =
of VMs/containers/jails can really stress sync i/o capability of even =
the biggest baddest servers.

And FreeBSD, in all it=E2=80=99s glory is not yet very well suited to =
the problem of multi-tennency. (It=E2=80=99s great if all jails and VMs =
on a server are owned and controlled by one stakeholder who can =
coordinate their friendly coexistence.) My firm develops and supports a =
proprietary shim into ZFS and jails for enforcing the polite sharing of =
bandwidth, total iops and sync iops, that can be applied to groups of =
which the granularity of membership are arbitrary ZFS datasets. So =
there, that's my shameless plug, LOL. However there are brighter minds =
than I working on this problem and I=E2=80=99m hoping to maybe some time =
either participate in a more general development of such facilities with =
broader application into mainline FreeBSD or to perhaps open source my =
own work eventually. (I guess I=E2=80=99m being more shy than selfish =
with it, LOL.)

Hope that=E2=80=99s food for thought for some of you

-chad=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8DB91B3A-44DC-4650-9E90-56F7DE2ABC42>