Date: Fri, 19 Aug 2016 15:38:55 -0500 From: Karl Denninger <karl@denninger.net> To: Slawa Olhovchenkov <slw@zxy.spb.ru>, freebsd-fs@freebsd.org Subject: Re: ZFS ARC under memory pressure Message-ID: <bcb14d0b-bd6d-cb93-ea71-3656cfce8b3b@denninger.net> In-Reply-To: <20160819201840.GA12519@zxy.spb.ru> References: <20160816193416.GM8192@zxy.spb.ru> <8dbf2a3a-da64-f7f8-5463-bfa23462446e@FreeBSD.org> <20160818202657.GS8192@zxy.spb.ru> <c3bc6c5a-961c-e3a4-2302-f0f7417bc34f@denninger.net> <20160819201840.GA12519@zxy.spb.ru>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --] On 8/19/2016 15:18, Slawa Olhovchenkov wrote: > On Thu, Aug 18, 2016 at 03:31:26PM -0500, Karl Denninger wrote: > >> On 8/18/2016 15:26, Slawa Olhovchenkov wrote: >>> On Thu, Aug 18, 2016 at 11:00:28PM +0300, Andriy Gapon wrote: >>> >>>> On 16/08/2016 22:34, Slawa Olhovchenkov wrote: >>>>> I see issuses with ZFS ARC inder memory pressure. >>>>> ZFS ARC size can be dramaticaly reduced, up to arc_min. >>>>> >>>>> As I see memory pressure event cause call arc_lowmem and set needfree: >>>>> >>>>> arc.c:arc_lowmem >>>>> >>>>> needfree = btoc(arc_c >> arc_shrink_shift); >>>>> >>>>> After this, arc_available_memory return negative vaules (PAGESIZE * >>>>> (-needfree)) until needfree is zero. Independent how too much memory >>>>> freed. needfree set to 0 in arc_reclaim_thread(), when arc_size <= >>>>> arc_c. Until arc_size don't drop below arc_c (arc_c deceased at every >>>>> loop interation). >>>>> >>>>> arc_c droped to minimum value if arc_size fast enough droped. >>>>> >>>>> No control current to initial memory allocation. >>>>> >>>>> As result, I can see needless arc reclaim, from 10x to 100x times. >>>>> >>>>> Can some one check me and comment this? >>>> You might have found a real problem here, but I am short of time right now to >>>> properly analyze the issue. I think that on illumos 'needfree' is a variable >>>> that's managed by the virtual memory system and it is akin to our >>>> vm_pageout_deficit. But during the porting it became an artificial value and >>>> its handling might be sub-optimal. >>> As I see, totaly not optimal. >>> I am create some patch for sub-optimal handling and now test it. >>> _______________________________________________ >>> freebsd-fs at freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org" >> You might want to look at the code contained in here: >> >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 > In may case arc.c issuse cused by revision r286625 in HEAD (and > r288562 in STABLE) -- all in 2015, not touch in 2014. > >> There are some ugly interactions with the VM system you can run into if >> you're not careful; I've chased this issue before and while I haven't >> yet done the work to integrate it into 11.x (and the underlying code >> *has* changed since the 10.x patches I developed) if you wind up driving >> the VM system to evict pages to swap rather than pare back ARC you're >> probably making the wrong choice. >> >> In addition UMA can come into the picture too and (at least previously) >> was a severe contributor to pathological behavior. > I am only do less aggresive (and more controlled) shrink of ARC size. > Now ARC just collapsed. > > Pointed PR is realy BIG. I am can't read and understund all of this. > r286625 change behaivor of interaction between ARC and VM. > You problem still exist? Can you explain (in list)? > Essentially ZFS is a "bolt-on" and unlike UFS which uses the unified buffer cache (which the VM system manages) ZFS does not. ARC is allocated out of kernel memory and (by default) also uses UMA; the VM system is not involved in its management. When the VM system gets constrained (low memory) it thus cannot tell the ARC to pare back. So when the VM system gets low on RAM it will start to page. The problem with this is that if the VM system is low on RAM because the ARC is consuming memory you do NOT want to page, you want to evict some of the ARC. Consider this: ARC data *at best* prevents one I/O. That is, if there is data in the cache when you go to read from disk, you avoid one I/O per unit of data in the ARC you didn't have to read. Paging *always* requires one I/O (to write the page(s) to the swap) and MAY involve two (to later page it back in.) It is never a "win" to spend a *guaranteed* I/O when you can instead act in a way that *might* cause you to (later) need to execute one. Unfortunately the VM system has another interaction that causes trouble too. The VM system will "demote" a page to inactive or cache status but not actually free it. It only starts to go through those pages and free them when the vm system wakes up, and that only happens when free space gets low enough to trigger it. Finally, there's another problem that comes into play; UMA. Kernel memory allocation is fairly expensive. UMA grabs memory from the kernel allocation system in big chunks and manages it, and by doing so gains a pretty-significant performance boost. But this means that you can have large amounts of RAM that are allocated, not in use, and yet the VM system cannot reclaim them on its own. The ZFS code has to reap those caches, but reaping them is a moderately expensive operation too, thus you don't want to do it unnecessarily. I've not yet gone through the 11.x code to see what changed from 10.x; what I do know is that it is materially better-behaved than it used to be, in that prior to 11.x I would have (by now) pretty much been forced into rolling that forward and testing it because the misbehavior in one of my production systems was severe enough to render it basically unusable without the patch in that PR inline, with the most-serious misbehavior being paging-induced stalls that could reach 10s of seconds or more in duration. 11.x hasn't exhibited the severe problems, unpatched, that 10.x was known to do on my production systems -- but it is far less than great in that it sure as heck does have UMA coherence issues..... ARC Size: 38.58% 8.61 GiB Target Size: (Adaptive) 70.33% 15.70 GiB Min Size (Hard Limit): 12.50% 2.79 GiB Max Size (High Water): 8:1 22.32 GiB I have 20GB out in kernel memory on this machine right now but only 8.6 of it in ARC; the rest is (mostly) sitting in UMA allocated-but-unused -- so despite the belief expressed by some that the 11.x code is "better" at reaping UMA I'm sure not seeing it here. I'll get around to rolling forward and modifying that PR since that particular bit of jackassery with UMA is a definite performance problem. I suspect a big part of what you're seeing lies there as well. When I do get that code done and tested I suspect it may solve your problems as well. -- Karl Denninger karl@denninger.net <mailto:karl@denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/ [-- Attachment #2 --] 0 *H 010 `He 0 *H _0[0C)0 *H 010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA0 150421022159Z 200419022159Z0Z10 UUS10UFlorida10U Cuda Systems LLC10UKarl Denninger (OCSP)0"0 *H 0 X@vkY Tq/vE]5#֯MX\8LJ/V?5Da+ sJc*/r{ȼnS+ w")ąZ^DtdCOZ ~7Q '@a#ijc۴oZdB&!Ӝ-< ?HN5y 5}F|ef"Vلio74zn">a1qWuɖbFeGE&3(KhixG3!#e_XƬϜ/,$+;4y'Bz<qT9_?rRUpn5 Jn&Rx/p Jyel*pN8/#9u/YPEC)TY>~/˘N[vyiDKˉ,^" ?$T8 v&K%z8C @?K{9f`+@,|Mbia 007++0)0'+0http://cudasystems.net:88880 U0 0 `HB0U0, `HB OpenSSL Generated Certificate0U-h\Ff Y0U#0$q}ݽʒm50U0karl@denninger.net0 *H Owbabɺx&Uk[(Oj!%p MQ0I!#QH}.>~2&D}<wm_>V6v]f>=Nn+8;q wfΰ/RLyUG#b}n!Dր_up|_ǰc/%ۥ nN8:d;-UJd/m1~VނיnN I˾$tF1&}|?q?\đXԑ&\4V<lKۮ3%Am_(q-(cAeGX)f}-˥6cv~Kg8m~v;|9:-iAPқ6ېn-.)<[$KJtt/L4ᖣ^Cmu4vb{+BG$M0c\[MR|0FԸP&78"4p#}DZ9;V9#>Sw"[UP7100010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0 `He M0 *H 1 *H 0 *H 1 160819203855Z0O *H 1B@_: 6thZ>ٜ_1&Μ!wO"ٖ-"G6V%0l *H 1_0]0 `He*0 `He0 *H 0*H 0 *H @0+0 *H (0 +710010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0*H 1010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0 *H kNLfO5㣑$n q3;2-2Wu)X~,Kx |uXXNW>b!wQoU,P8'0!zRyZ]Bot9-&^ ˑŮv]sHR87Z7w!iatDZq+!IzFXsķ-U^eRj p![[14,yb a2/V$p/qU|ɔU$zA@7?A]!yп"n6pT2MJg1iʟ4_T=mhwk7<y%'QNLa,u"`LWjnۥGF _x'B44G5PÄ.i3`ErL<dWteh@@l6KD^I^5CN$help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bcb14d0b-bd6d-cb93-ea71-3656cfce8b3b>
