Date: Fri, 8 Jul 2016 01:20:14 +0200 From: Cedric Blancher <cedric.blancher@gmail.com> To: Karl Denninger <karl@denninger.net> Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, illumos-dev <developer@lists.illumos.org>, "Garrett D'Amore" <garrett@damore.org> Subject: Re: ZFS ARC and mmap/page cache coherency question Message-ID: <CALXu0UexG1G6ozZ%2B-QOpO168fT5n=L%2ByfKLJTzyRMWbCu6BjEg@mail.gmail.com> In-Reply-To: <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> References: <20160630140625.3b4aece3@splash.akips.com> <CALXu0UfxRMnaamh%2Bpo5zp=iXdNUNuyj%2B7e_N1z8j46MtJmvyVA@mail.gmail.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
I think Garrett D'Amore <garrett@damore.org> had some ideas about the VM<---->ZFS communication and double/multicaching issues too. Ced On 3 July 2016 at 17:43, Karl Denninger <karl@denninger.net> wrote: > > On 7/3/2016 02:45, Matthew Macy wrote: >> >> Cedric greatly overstates the intractability of resolving it= . Nonetheless, since the initial import very little has been done to improv= e integration, and I don't know of anyone who is up to the task taking an i= nterest in it. Consequently, mmap() performance is likely "doomed" for the = foreseeable future.-M---- > > Wellllll.... > > I've done a fair bit of work here (see > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) and the > political issues are at least as bad as the coding ones. > > In short what Cedric says about the root of the issue is real. VM is > really-well implemented for what it handles, but the root of the issue > is that while the UFS data cache is part of VM and thus it "knows" about > it, ZFS is not because it is a "bolt-on." UMA leads to further (severe) > complications for certain workloads. > > Finally the underlying ZFS dmu_tx sizing code is just plain wrong and in > fact this is one of the biggest issues as when the system runs into > trouble it can take a bad situation and make it a *lot* worse. There is > only one write-back cache maintained instead of one per zvol, and that's > flat-out broken. Being able to re-order async writes to disk (where > fsync() has not been called) and minimizing seek latency is excellent. > Sadly rotating media these days sabotages much of this due to opacity > introduced at the drive level (e.g. varying sector counts per track, > etc) but it can still help. But where things go dramatically wrong is > on a system where a large write-back cache is allocated relative to the > underlying zvol I/O performance (this occurs on moderately-large and > bigger RAM systems) with moderate numbers of modest-performance rotating > media; in this case it is entirely possible for a flush of the write > buffers to require upwards of a *minute* to complete, during which all > other writes block. If this happens during periods of high RAM demand > and you manage to trigger a page-out at the same time system performance > will go straight into the toilet. I have seen instances where simply > trying to edit a text file with vi (or a "select" against a database > table) will hang for upwards of a minute leading you to believe the > system has crashed, when it fact it has not. > > The interaction of VM with the above can lead to severe pathological > behavior because the VM system has no way to tell the ZFS subsystem to > pare back ARC (and at least as important, perhaps more-so -- unused but > allocated UMA) when memory pressure exists *before* it pages. ZFS tries > to detect memory pressure and do this itself but it winds up competing > with the VM system. This leads to demonstrably wrong behavior because > you never want to hold disk cache in preference to RSS; if you have a > block of data from the disk the best case is you avoid one I/O (to > re-read it); if you page you are *guaranteed* to take one I/O (to write > the paged-out RSS to disk) and *might* take two (if you then must read > it back in.) > > In short trading the avoidance of one *possible* I/O for a *guaranteed* > I/O and a second possible one is *always* a net lose. > > To "fix" all of this "correctly" (for all cases, instead of certain > cases) VM would have to "know" about ARC and its use of UMA, along with > being able to police both. ZFS also must have the dmu_tx writeback > cache sized per-zvol with its size chosen by the actual I/O performance > characteristics of the disks in the zvol itself. I've looked into doing > both and it's fairly complex, and what's worse is that it would > effectively "marry" VM and ZFS, removing the "bolt-on" aspect of > things. This then leads to a lot of maintenance work over time because > any time ZFS code changes (and it does, quite a bit) you then have to go > back through that process in order to become coherent with Illumos. > > The PR above resolved (completely) the issues I was having along with a > number of other people on 10.x and before (I've not yet rolled it > forward to 11.) but it's quite clearly a hack of sorts, in that it > detects and treats symptoms (e.g. dynamic TX cache size modification, > etc) rather than integrating VM and ZFS cache management. > > -- > Karl Denninger > karl@denninger.net <mailto:karl@denninger.net> > /The Market Ticker/ > /[S/MIME encrypted email preferred]/ --=20 Cedric Blancher <cedric.blancher@gmail.com> Institute Pasteur
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CALXu0UexG1G6ozZ%2B-QOpO168fT5n=L%2ByfKLJTzyRMWbCu6BjEg>