Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 5 Jul 2016 20:40:30 +0200
From:      Lionel Cons <lionelcons1972@gmail.com>
To:        Karl Denninger <karl@denninger.net>
Cc:        Freebsd hackers list <freebsd-hackers@freebsd.org>
Subject:   Re: ZFS ARC and mmap/page cache coherency question
Message-ID:  <CAPJSo4VtJ1%2Btxt4s13nKSWrj9fDTv5VsLVyMsX%2BDarBUVYMbOQ@mail.gmail.com>
In-Reply-To: <31f4d30f-4170-0d04-bd23-1b998474a92e@denninger.net>
References:  <20160630140625.3b4aece3@splash.akips.com> <CALXu0UfxRMnaamh%2Bpo5zp=iXdNUNuyj%2B7e_N1z8j46MtJmvyVA@mail.gmail.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> <155b84da0aa.ad3af0e6139335.8627172617037605875@nextbsd.org> <7e00af5a-86cd-25f8-a4c6-2d946b507409@denninger.net> <155bc1260e6.12001bf18198857.6272515207330027022@nextbsd.org> <31f4d30f-4170-0d04-bd23-1b998474a92e@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
So what Oracle did (based on work done by SUN for Opensolaris) was to:
1. Modify ZFS to prevent *ANY* double/multi caching [this is
considered a design defect]
2. Introduce a new VM subsystem which scales a lot better and provides
hooks for [1] so there are never two or more copies of the same data
in the system

Given that this was a huge, paid, multiyear effort its not likely
going to happen that the design defects in opensource ZFS will ever go
away.

Lionel

On 5 July 2016 at 19:50, Karl Denninger <karl@denninger.net> wrote:
>
> On 7/5/2016 12:19, Matthew Macy wrote:
>>
>>
>>  ---- On Mon, 04 Jul 2016 19:26:06 -0700 Karl Denninger <karl@denninger.=
net> wrote ----
>>  >
>>  >
>>  > On 7/4/2016 18:45, Matthew Macy wrote:
>>  > >
>>  > >
>>  > >  ---- On Sun, 03 Jul 2016 08:43:19 -0700 Karl Denninger <karl@denni=
nger.net> wrote ----
>>  > >  >
>>  > >  > On 7/3/2016 02:45, Matthew Macy wrote:
>>  > >  > >
>>  > >  > >             Cedric greatly overstates the intractability of re=
solving it. Nonetheless, since the initial import very little has been done=
 to improve integration, and I don't know of anyone who is up to the task t=
aking an interest in it. Consequently, mmap() performance is likely "doomed=
" for the foreseeable future.-M----
>>  > >  >
>>  > >  > Wellllll....
>>  > >  >
>>  > >  > I've done a fair bit of work here (see
>>  > >  > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) and =
the
>>  > >  > political issues are at least as bad as the coding ones.
>>  > >  >
>>  > >
>>  > >
>>  > > Strictly speaking, the root of the problem is the ARC. Not ZFS per =
se. Have you ever tried disabling MFU caching to see how much worse LRU onl=
y is? I'm not really convinced the ARC's benefits justify its cost.
>>  > >
>>  > > -M
>>  > >
>>  >
>>  > The ARC is very useful when it gets a hit as it avoid an I/O that wou=
ld
>>  > otherwise take place.
>>  >
>>  > Where it sucks is when the system evicts working set to preserve ARC.
>>  > That's always wrong in that you're trading a speculative I/O (if the
>>  > cache is hit later) for a *guaranteed* one (to page out) and maybe *t=
wo*
>>  > (to page back in.)
>>
>> The question wasn't ARC vs. no-caching. It was LRU only vs LRU + MFU. Th=
ere are a lot of issues stemming from the fact that ZFS is a transactional =
object store with a POSIX FS on top. One is that it caches disk blocks as o=
pposed to file blocks. However, if one could resolve that and have the page=
 cache manage these blocks life would be much much better. However, you'd l=
ose MFU. Hence my question.
>>
>> -M
>>
> I suspect there's an argument to be made there but the present problems
> make determining the impact of that difficult or impossible as those
> effects are swamped by the other issues.
>
> I can fairly-easily create workloads on the base code where simply
> typing "vi <some file>", making a change and hitting ":w" will result in
> a stall of tens of seconds or more while the cache flush that gets
> requested is run down.  I've resolved a good part (but not all
> instances) of this through my work.
>
> My understanding is that 11- has had additional work done to the base
> code, but three underlying issues are not, from what I can see in the
> commit logs and discussions, addressed: The VM system will page out
> working set while leaving ARC alone, UMA reserved-but-not-in-use space
> is not policed adequately when memory pressure exists *before* the pager
> starts considering evicting working set and the write-back cache is for
> many machine configurations grossly inappropriate and cannot be tuned
> adequately by hand (particularly being true on a system with vdevs that
> have materially-varying performance levels.)
>
> I have more-or-less stopped work on the tree on a forward basis since I
> got to a place with 10.2 that (1) works for my production requirements,
> resolving the problems and (2) ran into what I deemed to be intractable
> political issues within core on progress toward eradicating the root of
> the problem.
>
> I will probably revisit the situation with 11- at some point, as I'll
> want to roll my production systems forward.  However, I don't know when
> that will be -- right now 11- is stable enough for some of my embedded
> work (e.g. on the Raspberry Pi2) but is not on my server and
> client-class machines.  Indeed just yesterday I got a lock-order
> reversal panic while doing a shutdown after a kernel update on one of my
> lab boxes running a just-updated 11- codebase.
>
> --
> Karl Denninger
> karl@denninger.net <mailto:karl@denninger.net>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/



--=20
Lionel



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAPJSo4VtJ1%2Btxt4s13nKSWrj9fDTv5VsLVyMsX%2BDarBUVYMbOQ>