Date: Sun, 3 Jul 2016 10:43:19 -0500 From: Karl Denninger <karl@denninger.net> To: freebsd-hackers@freebsd.org Subject: Re: ZFS ARC and mmap/page cache coherency question Message-ID: <45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2@denninger.net> In-Reply-To: <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org> References: <20160630140625.3b4aece3@splash.akips.com> <CALXu0UfxRMnaamh%2Bpo5zp=iXdNUNuyj%2B7e_N1z8j46MtJmvyVA@mail.gmail.com> <20160703123004.74a7385a@splash.akips.com> <155afb8148f.c6f5294d33485.2952538647262141073@nextbsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --] On 7/3/2016 02:45, Matthew Macy wrote: > > Cedric greatly overstates the intractability of resolving it. Nonetheless, since the initial import very little has been done to improve integration, and I don't know of anyone who is up to the task taking an interest in it. Consequently, mmap() performance is likely "doomed" for the foreseeable future.-M---- Wellllll.... I've done a fair bit of work here (see https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594) and the political issues are at least as bad as the coding ones. In short what Cedric says about the root of the issue is real. VM is really-well implemented for what it handles, but the root of the issue is that while the UFS data cache is part of VM and thus it "knows" about it, ZFS is not because it is a "bolt-on." UMA leads to further (severe) complications for certain workloads. Finally the underlying ZFS dmu_tx sizing code is just plain wrong and in fact this is one of the biggest issues as when the system runs into trouble it can take a bad situation and make it a *lot* worse. There is only one write-back cache maintained instead of one per zvol, and that's flat-out broken. Being able to re-order async writes to disk (where fsync() has not been called) and minimizing seek latency is excellent. Sadly rotating media these days sabotages much of this due to opacity introduced at the drive level (e.g. varying sector counts per track, etc) but it can still help. But where things go dramatically wrong is on a system where a large write-back cache is allocated relative to the underlying zvol I/O performance (this occurs on moderately-large and bigger RAM systems) with moderate numbers of modest-performance rotating media; in this case it is entirely possible for a flush of the write buffers to require upwards of a *minute* to complete, during which all other writes block. If this happens during periods of high RAM demand and you manage to trigger a page-out at the same time system performance will go straight into the toilet. I have seen instances where simply trying to edit a text file with vi (or a "select" against a database table) will hang for upwards of a minute leading you to believe the system has crashed, when it fact it has not. The interaction of VM with the above can lead to severe pathological behavior because the VM system has no way to tell the ZFS subsystem to pare back ARC (and at least as important, perhaps more-so -- unused but allocated UMA) when memory pressure exists *before* it pages. ZFS tries to detect memory pressure and do this itself but it winds up competing with the VM system. This leads to demonstrably wrong behavior because you never want to hold disk cache in preference to RSS; if you have a block of data from the disk the best case is you avoid one I/O (to re-read it); if you page you are *guaranteed* to take one I/O (to write the paged-out RSS to disk) and *might* take two (if you then must read it back in.) In short trading the avoidance of one *possible* I/O for a *guaranteed* I/O and a second possible one is *always* a net lose. To "fix" all of this "correctly" (for all cases, instead of certain cases) VM would have to "know" about ARC and its use of UMA, along with being able to police both. ZFS also must have the dmu_tx writeback cache sized per-zvol with its size chosen by the actual I/O performance characteristics of the disks in the zvol itself. I've looked into doing both and it's fairly complex, and what's worse is that it would effectively "marry" VM and ZFS, removing the "bolt-on" aspect of things. This then leads to a lot of maintenance work over time because any time ZFS code changes (and it does, quite a bit) you then have to go back through that process in order to become coherent with Illumos. The PR above resolved (completely) the issues I was having along with a number of other people on 10.x and before (I've not yet rolled it forward to 11.) but it's quite clearly a hack of sorts, in that it detects and treats symptoms (e.g. dynamic TX cache size modification, etc) rather than integrating VM and ZFS cache management. -- Karl Denninger karl@denninger.net <mailto:karl@denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/ [-- Attachment #2 --] 0 *H 010 `He 0 *H _0[0C)0 *H 010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA0 150421022159Z 200419022159Z0Z10 UUS10UFlorida10U Cuda Systems LLC10UKarl Denninger (OCSP)0"0 *H 0 X@vkY Tq/vE]5#֯MX\8LJ/V?5Da+ sJc*/r{ȼnS+ w")ąZ^DtdCOZ ~7Q '@a#ijc۴oZdB&!Ӝ-< ?HN5y 5}F|ef"Vلio74zn">a1qWuɖbFeGE&3(KhixG3!#e_XƬϜ/,$+;4y'Bz<qT9_?rRUpn5 Jn&Rx/p Jyel*pN8/#9u/YPEC)TY>~/˘N[vyiDKˉ,^" ?$T8 v&K%z8C @?K{9f`+@,|Mbia 007++0)0'+0http://cudasystems.net:88880 U0 0 `HB0U0, `HB OpenSSL Generated Certificate0U-h\Ff Y0U#0$q}ݽʒm50U0karl@denninger.net0 *H Owbabɺx&Uk[(Oj!%p MQ0I!#QH}.>~2&D}<wm_>V6v]f>=Nn+8;q wfΰ/RLyUG#b}n!Dր_up|_ǰc/%ۥ nN8:d;-UJd/m1~VނיnN I˾$tF1&}|?q?\đXԑ&\4V<lKۮ3%Am_(q-(cAeGX)f}-˥6cv~Kg8m~v;|9:-iAPқ6ېn-.)<[$KJtt/L4ᖣ^Cmu4vb{+BG$M0c\[MR|0FԸP&78"4p#}DZ9;V9#>Sw"[UP7100010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0 `He M0 *H 1 *H 0 *H 1 160703154319Z0O *H 1B@ [ vA㔷P) 4 ?9 qfq9$^q#c`KaΒ90l *H 1_0]0 `He*0 `He0 *H 0*H 0 *H @0+0 *H (0 +710010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0*H 1010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0 *H (AqK{1+W|48e$I>sʮ3P-4_rƊh 5<AtV{XI3=Χvr{ыc -ǾBAv5`I)HDA،NJ\'ԩFeS#JI/NJ98 Y"˓"x?܍>"UQ pR^~0 }P YQ=Ц*VW|$4z._=p8YP3 4h tQ0 M C(|F.(RҦļ&VxF,O"m ::X\ KPܤrm:$=7@mJ)gjcu_ =\M1c&"W_:*$zqAvݝk)Z41R<^CAXs :FV
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?45865ae6-18c9-ce9a-4a1e-6b2a8e44a8b2>
