From owner-freebsd-fs@freebsd.org Fri Aug 19 21:34:50 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A060FBC09B7 for ; Fri, 19 Aug 2016 21:34:50 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from zxy.spb.ru (zxy.spb.ru [195.70.199.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6088311AA for ; Fri, 19 Aug 2016 21:34:50 +0000 (UTC) (envelope-from slw@zxy.spb.ru) Received: from slw by zxy.spb.ru with local (Exim 4.86 (FreeBSD)) (envelope-from ) id 1barR8-0005Lk-9Y; Sat, 20 Aug 2016 00:34:46 +0300 Date: Sat, 20 Aug 2016 00:34:46 +0300 From: Slawa Olhovchenkov To: Karl Denninger Cc: freebsd-fs@freebsd.org Subject: Re: ZFS ARC under memory pressure Message-ID: <20160819213446.GT8192@zxy.spb.ru> References: <20160816193416.GM8192@zxy.spb.ru> <8dbf2a3a-da64-f7f8-5463-bfa23462446e@FreeBSD.org> <20160818202657.GS8192@zxy.spb.ru> <20160819201840.GA12519@zxy.spb.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) X-SA-Exim-Connect-IP: X-SA-Exim-Mail-From: slw@zxy.spb.ru X-SA-Exim-Scanned: No (on zxy.spb.ru); SAEximRunCond expanded to false X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 19 Aug 2016 21:34:50 -0000 On Fri, Aug 19, 2016 at 03:38:55PM -0500, Karl Denninger wrote: > On 8/19/2016 15:18, Slawa Olhovchenkov wrote: > > On Thu, Aug 18, 2016 at 03:31:26PM -0500, Karl Denninger wrote: > > > >> On 8/18/2016 15:26, Slawa Olhovchenkov wrote: > >>> On Thu, Aug 18, 2016 at 11:00:28PM +0300, Andriy Gapon wrote: > >>> > >>>> On 16/08/2016 22:34, Slawa Olhovchenkov wrote: > >>>>> I see issuses with ZFS ARC inder memory pressure. > >>>>> ZFS ARC size can be dramaticaly reduced, up to arc_min. > >>>>> > >>>>> As I see memory pressure event cause call arc_lowmem and set needfree: > >>>>> > >>>>> arc.c:arc_lowmem > >>>>> > >>>>> needfree = btoc(arc_c >> arc_shrink_shift); > >>>>> > >>>>> After this, arc_available_memory return negative vaules (PAGESIZE * > >>>>> (-needfree)) until needfree is zero. Independent how too much memory > >>>>> freed. needfree set to 0 in arc_reclaim_thread(), when arc_size <= > >>>>> arc_c. Until arc_size don't drop below arc_c (arc_c deceased at every > >>>>> loop interation). > >>>>> > >>>>> arc_c droped to minimum value if arc_size fast enough droped. > >>>>> > >>>>> No control current to initial memory allocation. > >>>>> > >>>>> As result, I can see needless arc reclaim, from 10x to 100x times. > >>>>> > >>>>> Can some one check me and comment this? > >>>> You might have found a real problem here, but I am short of time right now to > >>>> properly analyze the issue. I think that on illumos 'needfree' is a variable > >>>> that's managed by the virtual memory system and it is akin to our > >>>> vm_pageout_deficit. But during the porting it became an artificial value and > >>>> its handling might be sub-optimal. > >>> As I see, totaly not optimal. > >>> I am create some patch for sub-optimal handling and now test it. > >>> _______________________________________________ > >>> freebsd-fs at freebsd.org mailing list > >>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs > >>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org" > >> You might want to look at the code contained in here: > >> > >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 > > In may case arc.c issuse cused by revision r286625 in HEAD (and > > r288562 in STABLE) -- all in 2015, not touch in 2014. > > > >> There are some ugly interactions with the VM system you can run into if > >> you're not careful; I've chased this issue before and while I haven't > >> yet done the work to integrate it into 11.x (and the underlying code > >> *has* changed since the 10.x patches I developed) if you wind up driving > >> the VM system to evict pages to swap rather than pare back ARC you're > >> probably making the wrong choice. > >> > >> In addition UMA can come into the picture too and (at least previously) > >> was a severe contributor to pathological behavior. > > I am only do less aggresive (and more controlled) shrink of ARC size. > > Now ARC just collapsed. > > > > Pointed PR is realy BIG. I am can't read and understund all of this. > > r286625 change behaivor of interaction between ARC and VM. > > You problem still exist? Can you explain (in list)? > > > > Essentially ZFS is a "bolt-on" and unlike UFS which uses the unified > buffer cache (which the VM system manages) ZFS does not. ARC is > allocated out of kernel memory and (by default) also uses UMA; the VM > system is not involved in its management. > > When the VM system gets constrained (low memory) it thus cannot tell the > ARC to pare back. So when the VM system gets low on RAM it will start Currently VM generate event and ARC listen for this event, handle it by arc.c:arc_lowmem(). > to page. The problem with this is that if the VM system is low on RAM > because the ARC is consuming memory you do NOT want to page, you want to > evict some of the ARC. Now by event `lowmem` ARC try to evict 1/128 of ARC. > Unfortunately the VM system has another interaction that causes trouble > too. The VM system will "demote" a page to inactive or cache status but > not actually free it. It only starts to go through those pages and free > them when the vm system wakes up, and that only happens when free space > gets low enough to trigger it. > Finally, there's another problem that comes into play; UMA. Kernel > memory allocation is fairly expensive. UMA grabs memory from the kernel > allocation system in big chunks and manages it, and by doing so gains a > pretty-significant performance boost. But this means that you can have > large amounts of RAM that are allocated, not in use, and yet the VM > system cannot reclaim them on its own. The ZFS code has to reap those > caches, but reaping them is a moderately expensive operation too, thus > you don't want to do it unnecessarily. Not sure, but some code in ZFS may be handle this. arc.c:arc_kmem_reap_now(). Not sure. > I've not yet gone through the 11.x code to see what changed from 10.x; > what I do know is that it is materially better-behaved than it used to > be, in that prior to 11.x I would have (by now) pretty much been forced > into rolling that forward and testing it because the misbehavior in one > of my production systems was severe enough to render it basically > unusable without the patch in that PR inline, with the most-serious > misbehavior being paging-induced stalls that could reach 10s of seconds > or more in duration. > > 11.x hasn't exhibited the severe problems, unpatched, that 10.x was > known to do on my production systems -- but it is far less than great in > that it sure as heck does have UMA coherence issues..... > > ARC Size: 38.58% 8.61 GiB > Target Size: (Adaptive) 70.33% 15.70 GiB > Min Size (Hard Limit): 12.50% 2.79 GiB > Max Size (High Water): 8:1 22.32 GiB > > I have 20GB out in kernel memory on this machine right now but only 8.6 > of it in ARC; the rest is (mostly) sitting in UMA allocated-but-unused > -- so despite the belief expressed by some that the 11.x code is > "better" at reaping UMA I'm sure not seeing it here. I see. In my case: ARC Size: 79.65% 98.48 GiB Target Size: (Adaptive) 79.60% 98.42 GiB Min Size (Hard Limit): 12.50% 15.46 GiB Max Size (High Water): 8:1 123.64 GiB System Memory: 2.27% 2.83 GiB Active, 9.58% 11.94 GiB Inact 86.34% 107.62 GiB Wired, 0.00% 0 Cache 1.80% 2.25 GiB Free, 0.00% 0 Gap Real Installed: 128.00 GiB Real Available: 99.96% 127.95 GiB Real Managed: 97.41% 124.64 GiB Logical Total: 128.00 GiB Logical Used: 88.92% 113.81 GiB Logical Free: 11.08% 14.19 GiB Kernel Memory: 758.25 MiB Data: 97.81% 741.61 MiB Text: 2.19% 16.64 MiB Kernel Memory Map: 124.64 GiB Size: 81.84% 102.01 GiB Free: 18.16% 22.63 GiB Mem: 2895M Active, 12G Inact, 108G Wired, 528K Buf, 2303M Free ARC: 98G Total, 89G MFU, 9535M MRU, 35M Anon, 126M Header, 404M Other Swap: 32G Total, 394M Used, 32G Free, 1% Inuse Is this 12G Inactive as 'UMA allocated-but-unused'? This is also may be freed but not reclaimed network bufs. > I'll get around to rolling forward and modifying that PR since that > particular bit of jackassery with UMA is a definite performance > problem. I suspect a big part of what you're seeing lies there as > well. When I do get that code done and tested I suspect it may solve > your problems as well. No. May problem is completly different: under memory pressure, after arc_lowmem() set needfree to non-zero arc_reclaim_thread() start to shrink ARC. But arc_reclaim_thread (in FreeBSD case) don't correctly control this process and shrink stoped in random time (when after next iteration arc_size <= arc_c), mostly after drop to Min Size (Hard Limit). I am just resore control of shrink process.