Date: Fri, 28 Aug 2015 12:27:22 -0400 From: "Chad J. Milios" <milios@ccsys.com> To: Tenzin Lhakhang <tenzin.lhakhang@gmail.com> Cc: freebsd-fs@freebsd.org, "freebsd-virtualization@freebsd.org" <freebsd-virtualization@freebsd.org> Subject: Re: Options for zfs inside a VM backed by zfs on the host Message-ID: <8DB91B3A-44DC-4650-9E90-56F7DE2ABC42@ccsys.com> In-Reply-To: <CALcn87yArcBs0ybrZBBxaxDU0y6s=wM8di0RmaSCJCgOjUHq9w@mail.gmail.com> References: <CALd%2BdcfJ%2BT-f5gk_pim39BSF7nhBqHC3ab7dXgW8fH43VvvhvA@mail.gmail.com> <20150827061044.GA10221@blazingdot.com> <20150827062015.GA10272@blazingdot.com> <1a6745e27d184bb99eca7fdbdc90c8b5@SERVER.ad.usd-group.com> <55DF46F5.4070406@redbarn.org> <453A5A6F-E347-41AE-8CBC-9E0F4DA49D38@ccsys.com> <CALcn87yArcBs0ybrZBBxaxDU0y6s=wM8di0RmaSCJCgOjUHq9w@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Aug 27, 2015, at 7:47 PM, Tenzin Lhakhang = <tenzin.lhakhang@gmail.com> wrote: >=20 > On Thu, Aug 27, 2015 at 3:53 PM, Chad J. Milios <milios@ccsys.com = <mailto:milios@ccsys.com>> wrote: >=20 > Whether we are talking ffs, ntfs or zpool atop zvol, unfortunately = there are really no simple answers. You must consider your use case, the = host and vm hardware/software configuration, perform meaningful = benchmarks and, if you care about data integrity, thorough tests of the = likely failure modes (all far more easily said than done). I=E2=80=99m = curious to hear more about your use case(s) and setups so as to offer = better insight on what alternatives may make more/less sense for you. = Performance needs? Are you striving for lower individual latency or = higher combined throughput? How critical are integrity and availability? = How do you prefer your backup routine? Do you handle that in guest or = host? Want features like dedup and/or L2ARC up in the mix? (Then = everything bears reconsideration, just about triple your research and = testing efforts.) >=20 > Sorry, I=E2=80=99m really not trying to scare anyone away from ZFS. It = is awesome and capable of providing amazing solutions with very reliable = and sensible behavior if handled with due respect, fear, monitoring and = upkeep. :) >=20 > There are cases to be made for caching [meta-]data in the child, in = the parent, checksumming in the child/parent/both, compressing in the = child/parent. I believe `gstat` along with your custom-made benchmark or = test load will greatly help guide you. >=20 > ZFS on ZFS seems to be a hardly studied, seldom reported, never = documented, tedious exercise. Prepare for accelerated greying and = balding of your hair. The parent's volblocksize, child's ashift, = alignment, interactions involving raidz stripes (if used) can lead to = problems from slightly decreased performance and storage efficiency to = pathological write amplification within ZFS, performance and = responsiveness crashing and sinking to the bottom of the ocean. Some = datasets can become veritable black holes to vfs system calls. You may = see ZFS reporting elusive errors, deadlocking or panicing in the child = or parent altogether. With diligence though, stable and performant = setups can be discovered for many production situations. >=20 > For example, for a zpool (whether used by a VM or not, locally, thru = iscsi, ggate[cd], or whatever) atop zvol which sits on parent zpool with = no redundancy, I would set primarycache=3Dmetadata checksum=3Doff = compression=3Doff for the zvol(s) on the host(s) and for the most part = just use the same zpool settings and sysctl tunings in the VM (or child = zpool, whatever role it may conduct) that i would otherwise use on bare = cpu and bare drives (defaults + compression=3Dlz4 atime=3Doff). However, = that simple case is likely not yours. >=20 > With ufs/ffs/ntfs/ext4 and most other filesystems atop a zvol i use = checksums on the parent zvol, and compression too if the child doesn=E2=80= =99t support it (as ntfs can), but still caching only metadata on the = host and letting the child vm/fs cache real data. >=20 > My use case involves charging customers for their memory use so = admittedly that is one motivating factor, LOL. Plus, i certainly don=E2=80= =99t want one rude VM marching through host ARC unfairly evacuating and = starving the other polite neighbors. >=20 > VM=E2=80=99s swap space becomes another consideration and I treat it = like any other =E2=80=98dumb=E2=80=99 filesystem with compression and = checksumming done by the parent but recent versions of many operating = systems may be paging out only already compressed data, so investigate = your guest OS. I=E2=80=99ve found lz4=E2=80=99s claims of an = almost-no-penalty early-abort to be vastly overstated when dealing with = zvols, small block sizes and high throughput so if you can be certain = you=E2=80=99ll be dealing with only compressed data then turn it off. = For the virtual memory pagers in most current-day OS=E2=80=99s though = set compression on the swap=E2=80=99s backing zvol to lz4. >=20 > Another factor is the ZIL. One VM can hoard your synchronous write = performance. Solutions are beyond the scope of this already-too-long = email :) but I=E2=80=99d be happy to elaborate if queried. >=20 > And then there=E2=80=99s always netbooting guests from NFS mounts = served by the host and giving the guest no virtual disks, don=E2=80=99t = forget to consider that option. >=20 > Hope this provokes some fruitful ideas for you. Glad to philosophize = about ZFS setups with ya=E2=80=99ll :) >=20 > -chad > That was a really awesome read! The idea of turning metadata on at = the backend zpool and then data on the VM was interesting, I will give = that a try. Please can you elaborate more on the ZILs and synchronous = writes by VMs.. that seems like a great topic. > I am right now exploring the question: are SSD ZILs necessary in an = all SSD pool? and then the question of NVMe SSD ZILs onto of an all SSD = pool. My guess at the moment is that SSD ZILs are not necessary at all = in an SSD pool during intensive IO. I've been told that ZILs are always = there to help you, but when your pool aggregate IOPs is greater than the = a ZIL, it doesn't seem to make sense.. Or is it the latency of writing = to a single disk vs striping across your "fast" vdevs? >=20 > Thanks, > Tenzin Well the ZIL (ZFS Intent Log) is basically an absolute necessity. = Without it, a call to fsync() could take over 10 seconds on a system = serving a relatively light load. HOWEVER, a source of confusion is the = terminology people often throw around. See, the ZIL is basically a = concept, a method, a procedure. It is not a device. A 'SLOG' is what = most people mean when they say ZIL. That is a Seperate Log device. (ZFS = =E2=80=98log=E2=80=99 vdev type; documented in man 8 zpool.) When you = aren=E2=80=99t using a SLOG device, your ZIL is transparently allocated = by ZFS, roughly a little chunk of space reserved near the =E2=80=9Cmiddle=E2= =80=9D (at least ZFS attempts to locate it there physically but on SSDs = or SMR HDs there=E2=80=99s no way to and no point to) of the main pool = (unless you=E2=80=99ve gone out of your way to deliberately disable the = ZIL entirely). The other confusion often surrounding the ZIL is when it gets used. Most = writes (in the world) would bypass the ZIL (built-in or SLOG) entirely = anyway because they are asynchronous writes, not synchronous ones. Only = the latter are candidates to clog a ZIL bottleneck. You will need to = consider your workload specifically to know whether a SLOG will help, = and if so, how much SLOG performance is required to not put a damper on = the pool=E2=80=99s overall throughput capability. Conversely you want to = know how much SLOG performance is overkill because NVMe and SLC SSDs are = freaking expensive. Now for many on the list this is going to be some elementary information = so i apologize but i come across this question all the time, sync vs = async writes. i=E2=80=99m sure there are many who might find this = informative and with ZFS the difference becomes more profound and = important than most other filesystems. See, ZFS always is always bundling up batches of writes into transaction = groups (TXGs). Without extraneous detail it can be understood that = basically these happen every 5 seconds (sysctl vfs.zfs.txg.timeout). So = picture ZFS typically has two TXGs it=E2=80=99s worried about at any = given time, one is being filled into memory while the previous one is = being flushed out to physical disk. So when you write something asynchronously the operating system is going = to say =E2=80=98aye aye captain=E2=80=99 and send you along your merry = way very quickly but if you lose power or crash and then reboot, ZFS = only guarantees you a CONSISTENT state, not your most recent state. Your = pool may come back online and you=E2=80=99ve lost 5-15 seconds worth of = work. For your typical desktop or workstation workload that=E2=80=99s = probably no big deal. You lost 15 seconds of effort, you repeat it, and = continue about your business. However, imagine a mail server that received many many emails in just = that short time and has told all the senders of all those messages = =E2=80=9Cgot it, thumbs up=E2=80=9D. You cannot redact those assurances = you handed out. You have no idea who to contact to ask to repeat = themselves. Even if you did it's likely the sending mail servers have = long since forgotten about those particular messages. So, with each = message you receive, after you tell the operating system to write the = data you issue a call to fsync(new_message) and only after that call = returns do you give the sender the thumbs up to forget the message and = leave it in your capable hands to deliver it to its destination. Thanks = to the ZIL, fsync() will typically return in miliseconds or less instead = of the many seconds it could take for that write in a bundled TXG to end = up physically saved. In an ideal world, the ZIL gets written to and = never read again, data just becoming stale and overwritten. (The data = stays in the in-memory TXG so it=E2=80=99s redundant in the ZIL once = that TXG completes flushing). The email server is the typical example of the use of fsync but there = are thousands of others. Typically applications using central databases = are written in a simplistic way to assume the database is trustworthy = and fsync is how the database attempts to fulfill that requirement. To complicate matters, consider VMs, particularly uncooperative, = impolite, selfish VMs. Synchronous write iops are a particularly scarce = and expensive resource which hasn=E2=80=99t been increasing as quickly = and cheaply as, say, io bandwidth, cpu speeds, memory capacities. To = make it worse the numbers for iops most SSD makers advertise on their = so-called spec sheets are untrustworthy, they have no standard benchmark = or enforcement (=E2=80=9CThe PS in IOPS stands for Per Second so we ran = our benchmark on a fresh drive for one second and got 100,000 IOPS" = Well, good for you, that is useless to me. Tell me what you can sustain = all day long a year down the road.) and they=E2=80=99re seldom = accountable to anybody not buying 10,000 units. All this consolidation = of VMs/containers/jails can really stress sync i/o capability of even = the biggest baddest servers. And FreeBSD, in all it=E2=80=99s glory is not yet very well suited to = the problem of multi-tennency. (It=E2=80=99s great if all jails and VMs = on a server are owned and controlled by one stakeholder who can = coordinate their friendly coexistence.) My firm develops and supports a = proprietary shim into ZFS and jails for enforcing the polite sharing of = bandwidth, total iops and sync iops, that can be applied to groups of = which the granularity of membership are arbitrary ZFS datasets. So = there, that's my shameless plug, LOL. However there are brighter minds = than I working on this problem and I=E2=80=99m hoping to maybe some time = either participate in a more general development of such facilities with = broader application into mainline FreeBSD or to perhaps open source my = own work eventually. (I guess I=E2=80=99m being more shy than selfish = with it, LOL.) Hope that=E2=80=99s food for thought for some of you -chad=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8DB91B3A-44DC-4650-9E90-56F7DE2ABC42>