Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 4 Jul 2016 23:01:16 -0400
From:      Allan Jude <allanjude@freebsd.org>
To:        freebsd-hackers@freebsd.org
Subject:   Re: ZFS ARC and mmap/page cache coherency question
Message-ID:  <272d657a-52ae-4f45-008c-3de6fb1b0c48@freebsd.org>
In-Reply-To: <ec4685b2-bdaf-c18d-8aff-38b17edf4ebb@denninger.net>
References:  <20160630140625.3b4aece3@splash.akips.com> <CALXu0UfxRMnaamh%2Bpo5zp=iXdNUNuyj%2B7e_N1z8j46MtJmvyVA@mail.gmail.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> <155b84da0aa.ad3af0e6139335.8627172617037605875@nextbsd.org> <7e00af5a-86cd-25f8-a4c6-2d946b507409@denninger.net> <34cf2d30-8884-95b6-f852-457d55710daf@freebsd.org> <768b6169-70d9-5500-c455-563d8340972e@denninger.net> <b03f73a1-95c9-c753-3464-74fcb45351e5@freebsd.org> <ec4685b2-bdaf-c18d-8aff-38b17edf4ebb@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2016-07-04 22:46, Karl Denninger wrote:
> On 7/4/2016 21:36, Allan Jude wrote:
>> On 2016-07-04 22:32, Karl Denninger wrote:
>>> On 7/4/2016 21:28, Allan Jude wrote:
>>>> On 2016-07-04 22:26, Karl Denninger wrote:
>>>>>
>>>>> On 7/4/2016 18:45, Matthew Macy wrote:
>>>>>>
>>>>>>  ---- On Sun, 03 Jul 2016 08:43:19 -0700 Karl Denninger
>>>>>> <karl@denninger.net> wrote ----
>>>>>>  >
>>>>>>  > On 7/3/2016 02:45, Matthew Macy wrote:
>>>>>>  > >
>>>>>>  > >             Cedric greatly overstates the intractability of
>>>>>> resolving it. Nonetheless, since the initial import very little
>>>>>> has been done to improve integration, and I don't know of anyone
>>>>>> who is up to the task taking an interest in it. Consequently,
>>>>>> mmap() performance is likely "doomed" for the foreseeable
>>>>>> future.-M----
>>>>>>  >
>>>>>>  > Wellllll....
>>>>>>  >
>>>>>>  > I've done a fair bit of work here (see
>>>>>>  > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594) and the
>>>>>>  > political issues are at least as bad as the coding ones.
>>>>>>  >
>>>>>>
>>>>>>
>>>>>> Strictly speaking, the root of the problem is the ARC. Not ZFS per
>>>>>> se. Have you ever tried disabling MFU caching to see how much
>>>>>> worse LRU only is? I'm not really convinced the ARC's benefits
>>>>>> justify its cost.
>>>>>>
>>>>>> -M
>>>>>>
>>>>> The ARC is very useful when it gets a hit as it avoid an I/O that
>>>>> would
>>>>> otherwise take place.
>>>>>
>>>>> Where it sucks is when the system evicts working set to preserve ARC.
>>>>> That's always wrong in that you're trading a speculative I/O (if the
>>>>> cache is hit later) for a *guaranteed* one (to page out) and maybe
>>>>> *two*
>>>>> (to page back in.)
>>>>>
>>>> ZFS is better behaved in 11.x, there is a sysctl
>>>> vfs.zfs.arc_free_target
>>>> that makes sure the ARC is reined in when there is memory pressure, by
>>>> ensuring a minimum amount of actually free pages.
>>>>
>>> Oh, but.....
>>>
>>> Again, go read the PR I linked (and the current version of the patch
>>> against 10-STABLE.)  The issues are far more intertwined than that.
>>> Specifically, the dmu_tx cache decision (size of the write-back cache)
>>> is flat-out broken and inappropriate in essentially all cases, and the
>>> interaction of UMA and ARC is very destructive under a wide variety of
>>> workloads.  The patch has hack-around for the dmu_tx problem and a
>>> reasonably-effective fix for the UMA issues.  Actually fixing dmu_tx,
>>> however, is nowhere near that easy since it really needs to be computed
>>> per-zvol on an actual bytes moved per-unit-of-time basis.
>>>
>>> Note that one of the patches in the set I developed is indeed
>>> arc_free_target (indeed it was the first approach I took) -- but without
>>> addressing the other two issues it doesn't solve the problem.
>>>
>>
>> You keep saying per zvol. Do you mean per vdev? I am under the
>> impression that no zvol's are involved in the use case this thread is
>> about.
> Sorry, per-vdev.  The problem with dmu_tx is that it's system-wide.
> This is wildly inappropriate for several reasons -- first, it is
> computed on size-of-RAM with a hard cap (which is stupid on its face)
> and it entirely insensitive to the performance of the vdev's in
> question.  Specifically, it is very common for a system to have very
> fast (e.g. SSD) disks, perhaps in a mirror configuration, and then
> spinning rust in a RaidZ2 config for bulk storage.  Those are very, very
> different performance wise and they should have wildly different
> write-back cache sizes.  At present there is exactly one such write-back
> cache and it's both system-wide and pays exactly zero attention to the
> throughput of the underlying vdevs it is talking to.
>
> This is why you can provoke minute-long stalls on a system with moderate
> (e.g. 32GB) amounts of RAM if there are spinning rust devices in the
> configuration.
>
>>
>> Improving the way ZFS frees memory, specifically UMA and the 'kmem
>> caches' will help a lot as well.
>>
> Well, yeah.  But that means you have to police up the size of the UMA
> .vs. how much is actually in use in the UMA.  What the PR does is get
> pretty aggressive with that whenever RAM is tight, and before the pager
> can start playing hell with system performance.
>
>> In addition, another patch just went in to allow you to change the
>> arc_max and arc_min on a running system.
>>
> Yes, the PR I did a long time ago made that "active" on a running
> system.... so I've had that for quite some time.  Not that you really
> ought to need to play with that (if you feel a need to then you're still
> at step 1 or 2 of what I went through with analyzing and working on this
> in the 10.x code.....)
>

Have you looked into the the ZFS 'Write Throttle', it seems like it was 
meant to solve the writeback problem you are describing. It starts 
sending back pressure up to the application by introducing larger and 
larger delays in the write() call until your disks can keep up with your 
applications.

http://dtrace.org/blogs/ahl/2014/02/10/the-openzfs-write-throttle/

http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/



-- 
Allan Jude



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?272d657a-52ae-4f45-008c-3de6fb1b0c48>