Date: Tue, 5 Jul 2016 20:40:30 +0200 From: Lionel Cons <lionelcons1972@gmail.com> To: Karl Denninger <karl@denninger.net> Cc: Freebsd hackers list <freebsd-hackers@freebsd.org> Subject: Re: ZFS ARC and mmap/page cache coherency question Message-ID: <CAPJSo4VtJ1%2Btxt4s13nKSWrj9fDTv5VsLVyMsX%2BDarBUVYMbOQ@mail.gmail.com> In-Reply-To: <31f4d30f-4170-0d04-bd23-1b998474a92e@denninger.net> References: <20160630140625.3b4aece3@splash.akips.com> <CALXu0UfxRMnaamh%2Bpo5zp=iXdNUNuyj%2B7e_N1z8j46MtJmvyVA@mail.gmail.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> <155b84da0aa.ad3af0e6139335.8627172617037605875@nextbsd.org> <7e00af5a-86cd-25f8-a4c6-2d946b507409@denninger.net> <155bc1260e6.12001bf18198857.6272515207330027022@nextbsd.org> <31f4d30f-4170-0d04-bd23-1b998474a92e@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
So what Oracle did (based on work done by SUN for Opensolaris) was to: 1. Modify ZFS to prevent *ANY* double/multi caching [this is considered a design defect] 2. Introduce a new VM subsystem which scales a lot better and provides hooks for [1] so there are never two or more copies of the same data in the system Given that this was a huge, paid, multiyear effort its not likely going to happen that the design defects in opensource ZFS will ever go away. Lionel On 5 July 2016 at 19:50, Karl Denninger <karl@denninger.net> wrote: > > On 7/5/2016 12:19, Matthew Macy wrote: >> >> >> ---- On Mon, 04 Jul 2016 19:26:06 -0700 Karl Denninger <karl@denninger.= net> wrote ---- >> > >> > >> > On 7/4/2016 18:45, Matthew Macy wrote: >> > > >> > > >> > > ---- On Sun, 03 Jul 2016 08:43:19 -0700 Karl Denninger <karl@denni= nger.net> wrote ---- >> > > > >> > > > On 7/3/2016 02:45, Matthew Macy wrote: >> > > > > >> > > > > Cedric greatly overstates the intractability of re= solving it. Nonetheless, since the initial import very little has been done= to improve integration, and I don't know of anyone who is up to the task t= aking an interest in it. Consequently, mmap() performance is likely "doomed= " for the foreseeable future.-M---- >> > > > >> > > > Wellllll.... >> > > > >> > > > I've done a fair bit of work here (see >> > > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D187594) and = the >> > > > political issues are at least as bad as the coding ones. >> > > > >> > > >> > > >> > > Strictly speaking, the root of the problem is the ARC. Not ZFS per = se. Have you ever tried disabling MFU caching to see how much worse LRU onl= y is? I'm not really convinced the ARC's benefits justify its cost. >> > > >> > > -M >> > > >> > >> > The ARC is very useful when it gets a hit as it avoid an I/O that wou= ld >> > otherwise take place. >> > >> > Where it sucks is when the system evicts working set to preserve ARC. >> > That's always wrong in that you're trading a speculative I/O (if the >> > cache is hit later) for a *guaranteed* one (to page out) and maybe *t= wo* >> > (to page back in.) >> >> The question wasn't ARC vs. no-caching. It was LRU only vs LRU + MFU. Th= ere are a lot of issues stemming from the fact that ZFS is a transactional = object store with a POSIX FS on top. One is that it caches disk blocks as o= pposed to file blocks. However, if one could resolve that and have the page= cache manage these blocks life would be much much better. However, you'd l= ose MFU. Hence my question. >> >> -M >> > I suspect there's an argument to be made there but the present problems > make determining the impact of that difficult or impossible as those > effects are swamped by the other issues. > > I can fairly-easily create workloads on the base code where simply > typing "vi <some file>", making a change and hitting ":w" will result in > a stall of tens of seconds or more while the cache flush that gets > requested is run down. I've resolved a good part (but not all > instances) of this through my work. > > My understanding is that 11- has had additional work done to the base > code, but three underlying issues are not, from what I can see in the > commit logs and discussions, addressed: The VM system will page out > working set while leaving ARC alone, UMA reserved-but-not-in-use space > is not policed adequately when memory pressure exists *before* the pager > starts considering evicting working set and the write-back cache is for > many machine configurations grossly inappropriate and cannot be tuned > adequately by hand (particularly being true on a system with vdevs that > have materially-varying performance levels.) > > I have more-or-less stopped work on the tree on a forward basis since I > got to a place with 10.2 that (1) works for my production requirements, > resolving the problems and (2) ran into what I deemed to be intractable > political issues within core on progress toward eradicating the root of > the problem. > > I will probably revisit the situation with 11- at some point, as I'll > want to roll my production systems forward. However, I don't know when > that will be -- right now 11- is stable enough for some of my embedded > work (e.g. on the Raspberry Pi2) but is not on my server and > client-class machines. Indeed just yesterday I got a lock-order > reversal panic while doing a shutdown after a kernel update on one of my > lab boxes running a just-updated 11- codebase. > > -- > Karl Denninger > karl@denninger.net <mailto:karl@denninger.net> > /The Market Ticker/ > /[S/MIME encrypted email preferred]/ --=20 Lionel
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAPJSo4VtJ1%2Btxt4s13nKSWrj9fDTv5VsLVyMsX%2BDarBUVYMbOQ>