Date: Fri, 19 Aug 2016 16:52:00 -0500 From: Karl Denninger <karl@denninger.net> To: freebsd-fs@freebsd.org Subject: Re: ZFS ARC under memory pressure Message-ID: <05ba785a-c86f-1ec8-fcf3-71d22551f4f3@denninger.net> In-Reply-To: <20160819213446.GT8192@zxy.spb.ru> References: <20160816193416.GM8192@zxy.spb.ru> <8dbf2a3a-da64-f7f8-5463-bfa23462446e@FreeBSD.org> <20160818202657.GS8192@zxy.spb.ru> <c3bc6c5a-961c-e3a4-2302-f0f7417bc34f@denninger.net> <20160819201840.GA12519@zxy.spb.ru> <bcb14d0b-bd6d-cb93-ea71-3656cfce8b3b@denninger.net> <20160819213446.GT8192@zxy.spb.ru>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --] On 8/19/2016 16:34, Slawa Olhovchenkov wrote: > On Fri, Aug 19, 2016 at 03:38:55PM -0500, Karl Denninger wrote: > >> On 8/19/2016 15:18, Slawa Olhovchenkov wrote: >>> On Thu, Aug 18, 2016 at 03:31:26PM -0500, Karl Denninger wrote: >>> >>>> On 8/18/2016 15:26, Slawa Olhovchenkov wrote: >>>>> On Thu, Aug 18, 2016 at 11:00:28PM +0300, Andriy Gapon wrote: >>>>> >>>>>> On 16/08/2016 22:34, Slawa Olhovchenkov wrote: >>>>>>> I see issuses with ZFS ARC inder memory pressure. >>>>>>> ZFS ARC size can be dramaticaly reduced, up to arc_min. >>>>>>> >>>>>>> As I see memory pressure event cause call arc_lowmem and set needfree: >>>>>>> >>>>>>> arc.c:arc_lowmem >>>>>>> >>>>>>> needfree = btoc(arc_c >> arc_shrink_shift); >>>>>>> >>>>>>> After this, arc_available_memory return negative vaules (PAGESIZE * >>>>>>> (-needfree)) until needfree is zero. Independent how too much memory >>>>>>> freed. needfree set to 0 in arc_reclaim_thread(), when arc_size <= >>>>>>> arc_c. Until arc_size don't drop below arc_c (arc_c deceased at every >>>>>>> loop interation). >>>>>>> >>>>>>> arc_c droped to minimum value if arc_size fast enough droped. >>>>>>> >>>>>>> No control current to initial memory allocation. >>>>>>> >>>>>>> As result, I can see needless arc reclaim, from 10x to 100x times. >>>>>>> >>>>>>> Can some one check me and comment this? >>>>>> You might have found a real problem here, but I am short of time right now to >>>>>> properly analyze the issue. I think that on illumos 'needfree' is a variable >>>>>> that's managed by the virtual memory system and it is akin to our >>>>>> vm_pageout_deficit. But during the porting it became an artificial value and >>>>>> its handling might be sub-optimal. >>>>> As I see, totaly not optimal. >>>>> I am create some patch for sub-optimal handling and now test it. >>>>> _______________________________________________ >>>>> freebsd-fs at freebsd.org mailing list >>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs >>>>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org" >>>> You might want to look at the code contained in here: >>>> >>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 >>> In may case arc.c issuse cused by revision r286625 in HEAD (and >>> r288562 in STABLE) -- all in 2015, not touch in 2014. >>> >>>> There are some ugly interactions with the VM system you can run into if >>>> you're not careful; I've chased this issue before and while I haven't >>>> yet done the work to integrate it into 11.x (and the underlying code >>>> *has* changed since the 10.x patches I developed) if you wind up driving >>>> the VM system to evict pages to swap rather than pare back ARC you're >>>> probably making the wrong choice. >>>> >>>> In addition UMA can come into the picture too and (at least previously) >>>> was a severe contributor to pathological behavior. >>> I am only do less aggresive (and more controlled) shrink of ARC size. >>> Now ARC just collapsed. >>> >>> Pointed PR is realy BIG. I am can't read and understund all of this. >>> r286625 change behaivor of interaction between ARC and VM. >>> You problem still exist? Can you explain (in list)? >>> >> Essentially ZFS is a "bolt-on" and unlike UFS which uses the unified >> buffer cache (which the VM system manages) ZFS does not. ARC is >> allocated out of kernel memory and (by default) also uses UMA; the VM >> system is not involved in its management. >> >> When the VM system gets constrained (low memory) it thus cannot tell the >> ARC to pare back. So when the VM system gets low on RAM it will start > Currently VM generate event and ARC listen for this event, handle it > by arc.c:arc_lowmem(). > >> to page. The problem with this is that if the VM system is low on RAM >> because the ARC is consuming memory you do NOT want to page, you want to >> evict some of the ARC. > Now by event `lowmem` ARC try to evict 1/128 of ARC. > >> Unfortunately the VM system has another interaction that causes trouble >> too. The VM system will "demote" a page to inactive or cache status but >> not actually free it. It only starts to go through those pages and free >> them when the vm system wakes up, and that only happens when free space >> gets low enough to trigger it. > >> Finally, there's another problem that comes into play; UMA. Kernel >> memory allocation is fairly expensive. UMA grabs memory from the kernel >> allocation system in big chunks and manages it, and by doing so gains a >> pretty-significant performance boost. But this means that you can have >> large amounts of RAM that are allocated, not in use, and yet the VM >> system cannot reclaim them on its own. The ZFS code has to reap those >> caches, but reaping them is a moderately expensive operation too, thus >> you don't want to do it unnecessarily. > Not sure, but some code in ZFS may be handle this. > arc.c:arc_kmem_reap_now(). > Not sure. > >> I've not yet gone through the 11.x code to see what changed from 10.x; >> what I do know is that it is materially better-behaved than it used to >> be, in that prior to 11.x I would have (by now) pretty much been forced >> into rolling that forward and testing it because the misbehavior in one >> of my production systems was severe enough to render it basically >> unusable without the patch in that PR inline, with the most-serious >> misbehavior being paging-induced stalls that could reach 10s of seconds >> or more in duration. >> >> 11.x hasn't exhibited the severe problems, unpatched, that 10.x was >> known to do on my production systems -- but it is far less than great in >> that it sure as heck does have UMA coherence issues..... >> >> ARC Size: 38.58% 8.61 GiB >> Target Size: (Adaptive) 70.33% 15.70 GiB >> Min Size (Hard Limit): 12.50% 2.79 GiB >> Max Size (High Water): 8:1 22.32 GiB >> >> I have 20GB out in kernel memory on this machine right now but only 8.6 >> of it in ARC; the rest is (mostly) sitting in UMA allocated-but-unused >> -- so despite the belief expressed by some that the 11.x code is >> "better" at reaping UMA I'm sure not seeing it here. > I see. > In my case: > > ARC Size: 79.65% 98.48 GiB > Target Size: (Adaptive) 79.60% 98.42 GiB > Min Size (Hard Limit): 12.50% 15.46 GiB > Max Size (High Water): 8:1 123.64 GiB > > System Memory: > > 2.27% 2.83 GiB Active, 9.58% 11.94 GiB Inact > 86.34% 107.62 GiB Wired, 0.00% 0 Cache > 1.80% 2.25 GiB Free, 0.00% 0 Gap > > Real Installed: 128.00 GiB > Real Available: 99.96% 127.95 GiB > Real Managed: 97.41% 124.64 GiB > > Logical Total: 128.00 GiB > Logical Used: 88.92% 113.81 GiB > Logical Free: 11.08% 14.19 GiB > > Kernel Memory: 758.25 MiB > Data: 97.81% 741.61 MiB > Text: 2.19% 16.64 MiB > > Kernel Memory Map: 124.64 GiB > Size: 81.84% 102.01 GiB > Free: 18.16% 22.63 GiB > > Mem: 2895M Active, 12G Inact, 108G Wired, 528K Buf, 2303M Free > ARC: 98G Total, 89G MFU, 9535M MRU, 35M Anon, 126M Header, 404M Other > Swap: 32G Total, 394M Used, 32G Free, 1% Inuse > > Is this 12G Inactive as 'UMA allocated-but-unused'? > This is also may be freed but not reclaimed network bufs. > >> I'll get around to rolling forward and modifying that PR since that >> particular bit of jackassery with UMA is a definite performance >> problem. I suspect a big part of what you're seeing lies there as >> well. When I do get that code done and tested I suspect it may solve >> your problems as well. > No. May problem is completly different: under memory pressure, after arc_lowmem() > set needfree to non-zero arc_reclaim_thread() start to shrink ARC. But > arc_reclaim_thread (in FreeBSD case) don't correctly control this process > and shrink stoped in random time (when after next iteration arc_size <= arc_c), > mostly after drop to Min Size (Hard Limit). > > I am just resore control of shrink process. Not quite due to the UMA issue, among other things. There's also a potential "stall" issue that can arise also having to do with dirty_max sizing, especially if you are using rotating media. The PR patch scaled that back dynamically as well under memory pressure and eliminated that issue as well. I won't have time to look at this for at least another week on my test machine as I'm unfortunately buried with unrelated work at present, but I should be able to put some effort into this within the next couple weeks and see if I can quickly roll forward the important parts of the previous PR patch. I think you'll find that it stops the behavior you're seeing - I'm just pointing out that this was more-complex internally than it first appeared in the 10.x branch and I have no reason to believe the interactions that lead to bad behavior are not still in play given what you're describing for symptoms. -- Karl Denninger karl@denninger.net <mailto:karl@denninger.net> /The Market Ticker/ /[S/MIME encrypted email preferred]/ [-- Attachment #2 --] 0 *H 010 `He 0 *H _0[0C)0 *H 010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA0 150421022159Z 200419022159Z0Z10 UUS10UFlorida10U Cuda Systems LLC10UKarl Denninger (OCSP)0"0 *H 0 X@vkY Tq/vE]5#֯MX\8LJ/V?5Da+ sJc*/r{ȼnS+ w")ąZ^DtdCOZ ~7Q '@a#ijc۴oZdB&!Ӝ-< ?HN5y 5}F|ef"Vلio74zn">a1qWuɖbFeGE&3(KhixG3!#e_XƬϜ/,$+;4y'Bz<qT9_?rRUpn5 Jn&Rx/p Jyel*pN8/#9u/YPEC)TY>~/˘N[vyiDKˉ,^" ?$T8 v&K%z8C @?K{9f`+@,|Mbia 007++0)0'+0http://cudasystems.net:88880 U0 0 `HB0U0, `HB OpenSSL Generated Certificate0U-h\Ff Y0U#0$q}ݽʒm50U0karl@denninger.net0 *H Owbabɺx&Uk[(Oj!%p MQ0I!#QH}.>~2&D}<wm_>V6v]f>=Nn+8;q wfΰ/RLyUG#b}n!Dր_up|_ǰc/%ۥ nN8:d;-UJd/m1~VނיnN I˾$tF1&}|?q?\đXԑ&\4V<lKۮ3%Am_(q-(cAeGX)f}-˥6cv~Kg8m~v;|9:-iAPқ6ېn-.)<[$KJtt/L4ᖣ^Cmu4vb{+BG$M0c\[MR|0FԸP&78"4p#}DZ9;V9#>Sw"[UP7100010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0 `He M0 *H 1 *H 0 *H 1 160819215200Z0O *H 1B@ mޝVH<= t.'W5BXS Z!9p] D^Z0l *H 1_0]0 `He*0 `He0 *H 0*H 0 *H @0+0 *H (0 +710010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0*H 1010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1"0 *H Cuda Systems LLC CA)0 *H 3i*kJ":lO(cat)$l3AX.~S>ruBRb@k@jɩbgTNY( ]cѣi%1΅$hC`BݯXfzj;#.fX9P^;>\QơnVWb$6f3?MLHFĒ*ғ :##N""Q0ƅZEvyVd+jмQ' /cA4h (@SH#2yaVy4%|ޡvp^aro\<: *ݟpq4*#? YI*S=2d9tKVp 3f8+wi04?ϳ7wi@\J3;3F#ivˊ$4 pf.s Y<95'U1!hE2唰$Ϡcjxgt(DGYI9L y(&1ێ-뇽?`bB2^SQ{VHL*۬N~6home | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?05ba785a-c86f-1ec8-fcf3-71d22551f4f3>
