Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 8 Jul 2016 01:20:14 +0200
From:      Cedric Blancher <cedric.blancher@gmail.com>
To:        Karl Denninger <karl@denninger.net>
Cc:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, illumos-dev <developer@lists.illumos.org>,  "Garrett D'Amore" <garrett@damore.org>
Subject:   Re: ZFS ARC and mmap/page cache coherency question
Message-ID:  <CALXu0UexG1G6ozZ%2B-QOpO168fT5n=L%2ByfKLJTzyRMWbCu6BjEg@mail.gmail.com>
In-Reply-To: <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net>
References:  <20160630140625.3b4aece3@splash.akips.com> <CALXu0UfxRMnaamh%2Bpo5zp=iXdNUNuyj%2B7e_N1z8j46MtJmvyVA@mail.gmail.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
I think Garrett D'Amore <garrett@damore.org> had some ideas about the
VM<---->ZFS communication and double/multicaching issues too.

Ced

On 3 July 2016 at 17:43, Karl Denninger <karl@denninger.net> wrote:
>
> On 7/3/2016 02:45, Matthew Macy wrote:
>>
>>             Cedric greatly overstates the intractability of resolving it=
. Nonetheless, since the initial import very little has been done to improv=
e integration, and I don't know of anyone who is up to the task taking an i=
nterest in it. Consequently, mmap() performance is likely "doomed" for the =
foreseeable future.-M----
>
> Wellllll....
>
> I've done a fair bit of work here (see
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) and the
> political issues are at least as bad as the coding ones.
>
> In short what Cedric says about the root of the issue is real.  VM is
> really-well implemented for what it handles, but the root of the issue
> is that while the UFS data cache is part of VM and thus it "knows" about
> it, ZFS is not because it is a "bolt-on."  UMA leads to further (severe)
> complications for certain workloads.
>
> Finally the underlying ZFS dmu_tx sizing code is just plain wrong and in
> fact this is one of the biggest issues as when the system runs into
> trouble it can take a bad situation and make it a *lot* worse.  There is
> only one write-back cache maintained instead of one per zvol, and that's
> flat-out broken.  Being able to re-order async writes to disk (where
> fsync() has not been called) and minimizing seek latency is excellent.
> Sadly rotating media these days sabotages much of this due to opacity
> introduced at the drive level (e.g. varying sector counts per track,
> etc) but it can still help.  But where things go dramatically wrong is
> on a system where a large write-back cache is allocated relative to the
> underlying zvol I/O performance (this occurs on moderately-large and
> bigger RAM systems) with moderate numbers of modest-performance rotating
> media; in this case it is entirely possible for a flush of the write
> buffers to require upwards of a *minute* to complete, during which all
> other writes block.  If this happens during periods of high RAM demand
> and you manage to trigger a page-out at the same time system performance
> will go straight into the toilet.  I have seen instances where simply
> trying to edit a text file with vi (or a "select" against a database
> table) will hang for upwards of a minute leading you to believe the
> system has crashed, when it fact it has not.
>
> The interaction of VM with the above can lead to severe pathological
> behavior because the VM system has no way to tell the ZFS subsystem to
> pare back ARC (and at least as important, perhaps more-so -- unused but
> allocated UMA) when memory pressure exists *before* it pages.  ZFS tries
> to detect memory pressure and do this itself but it winds up competing
> with the VM system.  This leads to demonstrably wrong behavior because
> you never want to hold disk cache in preference to RSS; if you have a
> block of data from the disk the best case is you avoid one I/O (to
> re-read it); if you page you are *guaranteed* to take one I/O (to write
> the paged-out RSS to disk) and *might* take two (if you then must read
> it back in.)
>
> In short trading the avoidance of one *possible* I/O for a *guaranteed*
> I/O and a second possible one is *always* a net lose.
>
> To "fix" all of this "correctly" (for all cases, instead of certain
> cases) VM would have to "know" about ARC and its use of UMA, along with
> being able to police both.  ZFS also must have the dmu_tx writeback
> cache sized per-zvol with its size chosen by the actual I/O performance
> characteristics of the disks in the zvol itself.  I've looked into doing
> both and it's fairly complex, and what's worse is that it would
> effectively "marry" VM and ZFS, removing the "bolt-on" aspect of
> things.  This then leads to a lot of maintenance work over time because
> any time ZFS code changes (and it does, quite a bit) you then have to go
> back through that process in order to become coherent with Illumos.
>
> The PR above resolved (completely) the issues I was having along with a
> number of other people on 10.x and before (I've not yet rolled it
> forward to 11.) but it's quite clearly a hack of sorts, in that it
> detects and treats symptoms (e.g. dynamic TX cache size modification,
> etc) rather than integrating VM and ZFS cache management.
>
> --
> Karl Denninger
> karl@denninger.net <mailto:karl@denninger.net>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/



--=20
Cedric Blancher <cedric.blancher@gmail.com>
Institute Pasteur



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CALXu0UexG1G6ozZ%2B-QOpO168fT5n=L%2ByfKLJTzyRMWbCu6BjEg>