Date: Tue, 18 Mar 2014 12:19:32 -0500 From: Karl Denninger <karl@denninger.net> To: avg@FreeBSD.org Cc: freebsd-fs@freebsd.org Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix Message-ID: <53288024.2060005@denninger.net> In-Reply-To: <201403181520.s2IFK1M3069036@freefall.freebsd.org> References: <201403181520.s2IFK1M3069036@freefall.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --] On 3/18/2014 10:20 AM, Andriy Gapon wrote: > The following reply was made to PR kern/187594; it has been noted by GNATS. > > From: Andriy Gapon <avg@FreeBSD.org> > To: bug-followup@FreeBSD.org, karl@fs.denninger.net > Cc: > Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix > Date: Tue, 18 Mar 2014 17:15:05 +0200 > > Karl Denninger <karl@fs.denninger.net> wrote: > > ZFS can be convinced to engage in pathological behavior due to a bad > > low-memory test in arc.c > > > > The offending file is at > > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it allegedly > > checks for 25% free memory, and if it is less asks for the cache to shrink. > > > > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path > > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs) > > > > #else /* !sun */ > > if (kmem_used() > (kmem_size() * 3) / 4) > > return (1); > > #endif /* sun */ > > > > Unfortunately these two functions do not return what the authors thought > > they did. It's clear what they're trying to do from the Solaris-specific > > code up above this test. > > No, these functions do return what the authors think they do. > The check is for KVA usage (kernel virtual address space), not for physical memory. I understand, but that's nonsensical in the context of the Solaris code. "lotsfree" is *not* a declaration of free kvm space, it's a declaration of when the system has "lots" of free *physical* memory. Further it makes no sense at all to allow the ARC cache to force things into virtual (e.g. swap-space backed) memory. But that's the behavior that has been observed, and it fits with the code as originally written. > > > The result is that the cache only shrinks when vm_paging_needed() tests > > true, but by that time the system is in serious memory trouble and by > > No, it is not. > The description and numbers here are a little bit outdated but they should give > an idea of how paging works in general: > https://wiki.freebsd.org/AvgPageoutAlgorithm > > > triggering only there it actually drives the system further into paging, > > How does ARC eviction drives the system further into paging? 1. System gets low on physical memory but the ARC cache is looking at available kvm (of which there is plenty.) The ARC cache continues to expand. 2. vm_paging_needed() returns true and the system begins to page off to the swap. At the same time the ARC cache is pared down because arc_reclaim_needed has returned "1". 3. As the ARC cache shrinks and paging occurs vm_paging_needed() returns false. Paging out ceases but inactive pages remain on the swap. They are not recalled until and unless they are scheduled to execute. Arc_reclaim_needed again returns "0". 4. The hold-down timer expires in the ARC cache code ("arc_grow_retry", declared as 60 seconds) and the ARC cache begins to expand again. Go back to #2 until the system's performance starts to deteriorate badly enough due to the paging that you notice it, which occurs when something that is actually consuming CPU time has to be called in from swap. This is consistent with what I and others have observed on both 9.2 and 10.0; the ARC will expand until it hits the maximum configured even at the expense of forcing pages onto the swap. In this specific machine's case left to defaults it will grab nearly all physical memory (over 20GB of 24) and wire it down. Limiting arc_max to 16GB sorta fixes it. I say "sorta" because it turns out that 16GB is still too much for the workload; it prevents the pathological behavior where system "stalls" happen but only in the extreme. It turns out with the patch in my ARC cache stabilizes at about 13.5GB during the busiest part of the day, growing to about 16 off-hours. One of the problems with just limiting it in /boot/loader.conf is that you have to guess and the system doesn't reasonably adapt to changing memory loads. The code is clearly intended to do that but it doesn't end up working that way in practice. > > > because the pager will not recall pages from the swap until they are next > > executed. This leads the ARC to try to fill in all the available RAM even > > though pages have been pushed off onto swap. Not good. > > Unused physical memory is a waste. It is true that ARC tries to use as much of > memory as it is allowed. The same applies to the page cache (Active, Inactive). > Memory management is a dynamic system and there are a few competing agents. > That's true. However, what the stock code does is force working set out of memory and into the swap. The ideal situation is one in which there is no free memory because cache has sized itself to consume everything *not* necessary for the working set of the processes that are running. Unfortunately we cannot determine this presciently because a new process may come along and we do not necessarily know for how long a process that is blocked on an event will remain blocked (e.g. something waiting on network I/O, etc.) However, it is my contention that you do not want to evict a process that is scheduled to run (or is going to be) in favor of disk cache because you're defeating yourself by doing so. The point of the disk cache is to avoid going to the physical disk for I/O, but if you page something you have ditched a physical I/O for data in favor of having to go to physical disk *twice* -- first to write the paged-out data to swap, and then to retrieve it when it is to be executed. This also appears to be consistent with what is present for Solaris machines. From the Sun code: #ifdef sun /* * take 'desfree' extra pages, so we reclaim sooner, rather than later */ extra = desfree; /* * check that we're out of range of the pageout scanner. It starts to * schedule paging if freemem is less than lotsfree and needfree. * lotsfree is the high-water mark for pageout, and needfree is the * number of needed free pages. We add extra pages here to make sure * the scanner doesn't start up while we're freeing memory. */ if (freemem < lotsfree + needfree + extra) return (1); /* * check to make sure that swapfs has enough space so that anon * reservations can still succeed. anon_resvmem() checks that the * availrmem is greater than swapfs_minfree, and the number of reserved * swap pages. We also add a bit of extra here just to prevent * circumstances from getting really dire. */ if (availrmem < swapfs_minfree + swapfs_reserve + extra) return (1); "freemem" is not virtual memory, it's actual memory. "Lotsfree" is the point where the system considers free RAM to be "ample"; "needfree" is the "desperation" point and "extra" is the margin (presumably for image activation.) The base code on FreeBSD doesn't look at physical memory at all; it looks at kvm space instead. > It is hard to correctly tune that system using a large hummer such as your > patch. I believe that with your patch ARC will get shrunk to its minimum size > in due time. Active + Inactive will grow to use the memory that you are denying > to ARC driving Free below a threshold, which will reduce ARC. Repeated enough > times this will drive ARC to its minimum. I disagree both in design theory and based on the empirical evidence of actual operation. First, I don't (ever) want to give memory to the ARC cache that otherwise would go to "active", because any time I do that I'm going to force two page events, which is double the amount of I/O I would take on a cache *miss*, and even with the ARC at minimum I get a reasonable hit percentage. If I therefore prefer ARC over "active" pages I am going to take *at least* a 200% penalty on physical I/O and if I get an 80% hit ratio with the ARC at a minimum the penalty is closer to 800%! For inactive pages it's a bit more complicated as those may not be reactivated. However, I am trusting FreeBSD's VM subsystem to demote those that are unlikely to be reactivated to the cache bucket and then to "free", where they are able to be re-used. This is consistent with what I actually see on a running system -- the "inact" bucket is typically fairly large (often on a busy machine close to that of "active") but pages demoted to "cache" don't stay there long - they either get re-promoted back up or they are freed and go on the free list. The only time I see "inact" get out of control is when there's a kernel memory leak somewhere (such as what I ran into the other day with the in-kernel NAT subsystem on 10-STABLE.) But that's a bug and if it happens you're going to get bit anyway. For example right now on one of my very busy systems with 24GB of installed RAM and many terabytes of storage across three ZFS pools I'm seeing 17GB wired of which 13.5 is ARC cache. That's the adaptive figure it currently is running at, with a maximum of 22.3 and a minimum of 2.79 (8:1 ratio.) The remainder is wired down for other reasons (there's a fairly large Postgres server running on that box, among other things, and it has a big shared buffer declaration -- that's most of the difference.) Cache hit efficiency is currently 97.8%. Active is 2.26G right now, and inactive is 2.09G. Both are stable. Overnight inactive will drop to about 1.1GB while active will not change all that much since most of it postgres and the middleware that talks to it along with apache, which leaves most of its processes present even when they go idle. Peak load times are about right now (mid-day), and again when the system is running backups nightly. Cache is 7448, in other words, insignificant. Free memory is 2.6G. The tunable is set to 10%, which is almost exactly what free memory is. I find that when the system gets under 1G free transient image activation can drive it into paging and performance starts to suffer for my particular workload. > > Also, there are a few technical problems with the patch: > - you don't need to use sysctl interface in kernel, the values you need are > available directly, just take a look at e.g. implementation of vm_paging_needed() That's easily fixed. I will look at it. > - similarly, querying vfs.zfs.arc_freepage_percent_target value via > kernel_sysctlbyname is just bogus; you can use percent_target directly I did not know if during setup of the OID the value was copied (and thus you had to reference it later on) or the entry simply took the pointer and stashed that. Easily corrected. > - you don't need to sum various page counters to get a total count, there is > v_page_count > Fair enough as well. > Lastly, can you try to test reverting your patch and instead setting > vm.lowmem_period=0 ? > Yes. By default it's 10; I have not tampered with that default. Let me do a bit of work and I'll post back with a revised patch. Perhaps a tunable for percentage free + a free reserve that is a "floor"? The problem with that is where to put the defaults. One option would be to grab total size at init time and compute something similar to what "lotsfree" is for Solaris, allowing that to be tuned with the percentage if desired. I selected 25% because that's what the original test was expressing and it should be reasonable for modest RAM configurations. It's clearly too high for moderately large (or huge) memory machines unless they have a lot of RAM -hungry processes running on them. The percentage test, however, is an easy knob to twist that is unlikely to severely harm you if you dial it too far in either direction; anyone setting it to zero obviously knows what they're getting into, and if you crank it too high all you end up doing is limiting the ARC to the minimum value. -- -- Karl karl@denninger.net [-- Attachment #2 --] 0 *H 010 + 0 *H O0K030 *H 010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1/0- *H customer-service@cudasystems.net0 130824190344Z 180823190344Z0[10 UUS10UFlorida10UKarl Denninger1!0 *H karl@denninger.net0"0 *H 0 bi՞]MNԿawx?`)'ҴcWgR@BlWh+ u}ApdCF JVй~FOL}EW^bچYp3K&ׂ(R lxڝ.xz?6&nsJ +1v9v/( kqĪp[vjcK%fϻe?iq]z lyzFO'ppdX//Lw(3JIA*S#՟H[f|CGqJKooy.oEuOw$/섀$삻J9b|AP~8]D1YI<"""Y^T2iQ2b yH)] Ƶ0y$_N6XqMC 9 XgώjGTP"#nˋ"Bk1 00 U0 0 `HB0U0, `HB OpenSSL Generated Certificate0U|8 ˴d[20U#0]Af4U3x&^"408 `HB+)https://cudasystems.net:11443/revoked.crl0 *H gBwH]j\x`( &gW32"Uf^. ^Iϱ k!DQA g{(w/)\N'[oRW@CHO>)XrTNɘ!u`xt5(=f\-l3<@C6mnhv##1ŃbH͍_Nq aʷ?rk$^9TIa!kh,D -ct1 00010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1/0- *H customer-service@cudasystems.net0 + ;0 *H 1 *H 0 *H 1 140318171932Z0# *H 1*6Zs47(Nh0l *H 1_0]0 `He*0 `He0 *H 0*H 0 *H @0+0 *H (0 +710010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1/0- *H customer-service@cudasystems.net0*H 1010 UUS10UFlorida10U Niceville10U Cuda Systems LLC10UCuda Systems LLC CA1/0- *H customer-service@cudasystems.net0 *H 0xR'뇁&aNE>ܲ:}g*毖 ·a{?a ^'K;hzɦGY<fq-Ъo!$HVIE}uM)p 1`xZʥ? )t;%59Os< (d֎ׂ`P r)%sY:еTB_Ԣ&
