Date: Tue, 18 Mar 2014 19:45:13 +0200 From: Andriy Gapon <avg@FreeBSD.org> To: Karl Denninger <karl@denninger.net> Cc: freebsd-fs@FreeBSD.org Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix Message-ID: <53288629.60309@FreeBSD.org> In-Reply-To: <53288024.2060005@denninger.net> References: <201403181520.s2IFK1M3069036@freefall.freebsd.org> <53288024.2060005@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
on 18/03/2014 19:19 Karl Denninger said the following: > > On 3/18/2014 10:20 AM, Andriy Gapon wrote: >> The following reply was made to PR kern/187594; it has been noted by GNATS. >> >> From: Andriy Gapon <avg@FreeBSD.org> >> To: bug-followup@FreeBSD.org, karl@fs.denninger.net >> Cc: >> Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix >> Date: Tue, 18 Mar 2014 17:15:05 +0200 >> >> Karl Denninger <karl@fs.denninger.net> wrote: >> > ZFS can be convinced to engage in pathological behavior due to a bad >> > low-memory test in arc.c >> > >> > The offending file is at >> > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it allegedly >> > checks for 25% free memory, and if it is less asks for the cache to shrink. >> > >> > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path >> > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs) >> > >> > #else /* !sun */ >> > if (kmem_used() > (kmem_size() * 3) / 4) >> > return (1); >> > #endif /* sun */ >> > >> > Unfortunately these two functions do not return what the authors thought >> > they did. It's clear what they're trying to do from the Solaris-specific >> > code up above this test. >> No, these functions do return what the authors think they do. >> The check is for KVA usage (kernel virtual address space), not for physical >> memory. > I understand, but that's nonsensical in the context of the Solaris code. > "lotsfree" is *not* a declaration of free kvm space, it's a declaration of when > the system has "lots" of free *physical* memory. No, it's not nonsensical. Replacement for lotsfree stuff is vm_paging_needed(). kmem_* stuff is replacement for vmem_* stuff in Solaris code. > Further it makes no sense at all to allow the ARC cache to force things into > virtual (e.g. swap-space backed) memory. Seems like you don't have proper understanding of what kernel virtual memory is. That makes conversation harder. > But that's the behavior that has been > observed, and it fits with the code as originally written. > >> > The result is that the cache only shrinks when vm_paging_needed() tests >> > true, but by that time the system is in serious memory trouble and by >> No, it is not. >> The description and numbers here are a little bit outdated but they should give >> an idea of how paging works in general: >> https://wiki.freebsd.org/AvgPageoutAlgorithm >> > triggering only there it actually drives the system further into paging, >> How does ARC eviction drives the system further into paging? > 1. System gets low on physical memory but the ARC cache is looking at available > kvm (of which there is plenty.) The ARC cache continues to expand. > > 2. vm_paging_needed() returns true and the system begins to page off to the > swap. At the same time the ARC cache is pared down because arc_reclaim_needed > has returned "1". Except that ARC is supposed to be evicted before page daemon does anything. > 3. As the ARC cache shrinks and paging occurs vm_paging_needed() returns false. > Paging out ceases but inactive pages remain on the swap. They are not recalled > until and unless they are scheduled to execute. Arc_reclaim_needed again > returns "0". > > 4. The hold-down timer expires in the ARC cache code ("arc_grow_retry", declared > as 60 seconds) and the ARC cache begins to expand again. > > Go back to #2 until the system's performance starts to deteriorate badly enough > due to the paging that you notice it, which occurs when something that is > actually consuming CPU time has to be called in from swap. > > This is consistent with what I and others have observed on both 9.2 and 10.0; > the ARC will expand until it hits the maximum configured even at the expense of > forcing pages onto the swap. In this specific machine's case left to defaults > it will grab nearly all physical memory (over 20GB of 24) and wire it down. Well, this does not match my experience from before 10.x times. > Limiting arc_max to 16GB sorta fixes it. I say "sorta" because it turns out > that 16GB is still too much for the workload; it prevents the pathological > behavior where system "stalls" happen but only in the extreme. It turns out > with the patch in my ARC cache stabilizes at about 13.5GB during the busiest > part of the day, growing to about 16 off-hours. > > One of the problems with just limiting it in /boot/loader.conf is that you have > to guess and the system doesn't reasonably adapt to changing memory loads. The > code is clearly intended to do that but it doesn't end up working that way in > practice. >> > because the pager will not recall pages from the swap until they are next >> > executed. This leads the ARC to try to fill in all the available RAM even >> > though pages have been pushed off onto swap. Not good. >> Unused physical memory is a waste. It is true that ARC tries to use as >> much of >> memory as it is allowed. The same applies to the page cache (Active, >> Inactive). >> Memory management is a dynamic system and there are a few competing agents. >> > That's true. However, what the stock code does is force working set out of > memory and into the swap. The ideal situation is one in which there is no free > memory because cache has sized itself to consume everything *not* necessary for > the working set of the processes that are running. Unfortunately we cannot > determine this presciently because a new process may come along and we do not > necessarily know for how long a process that is blocked on an event will remain > blocked (e.g. something waiting on network I/O, etc.) > > However, it is my contention that you do not want to evict a process that is > scheduled to run (or is going to be) in favor of disk cache because you're > defeating yourself by doing so. The point of the disk cache is to avoid going > to the physical disk for I/O, but if you page something you have ditched a > physical I/O for data in favor of having to go to physical disk *twice* -- first > to write the paged-out data to swap, and then to retrieve it when it is to be > executed. This also appears to be consistent with what is present for Solaris > machines. > > From the Sun code: > > #ifdef sun > /* > * take 'desfree' extra pages, so we reclaim sooner, rather than later > */ > extra = desfree; > > /* > * check that we're out of range of the pageout scanner. It starts to > * schedule paging if freemem is less than lotsfree and needfree. > * lotsfree is the high-water mark for pageout, and needfree is the > * number of needed free pages. We add extra pages here to make sure > * the scanner doesn't start up while we're freeing memory. > */ > if (freemem < lotsfree + needfree + extra) > return (1); > > /* > * check to make sure that swapfs has enough space so that anon > * reservations can still succeed. anon_resvmem() checks that the > * availrmem is greater than swapfs_minfree, and the number of reserved > * swap pages. We also add a bit of extra here just to prevent > * circumstances from getting really dire. > */ > if (availrmem < swapfs_minfree + swapfs_reserve + extra) > return (1); > > "freemem" is not virtual memory, it's actual memory. "Lotsfree" is the point > where the system considers free RAM to be "ample"; "needfree" is the > "desperation" point and "extra" is the margin (presumably for image activation.) > > The base code on FreeBSD doesn't look at physical memory at all; it looks at kvm > space instead. This is an incorrect statement as I explained above. vm_paging_needed() looks at physical memory. >> It is hard to correctly tune that system using a large hummer such as your >> patch. I believe that with your patch ARC will get shrunk to its minimum size >> in due time. Active + Inactive will grow to use the memory that you are >> denying >> to ARC driving Free below a threshold, which will reduce ARC. Repeated enough >> times this will drive ARC to its minimum. > I disagree both in design theory and based on the empirical evidence of actual > operation. > > First, I don't (ever) want to give memory to the ARC cache that otherwise would > go to "active", because any time I do that I'm going to force two page events, > which is double the amount of I/O I would take on a cache *miss*, and even with > the ARC at minimum I get a reasonable hit percentage. If I therefore prefer ARC > over "active" pages I am going to take *at least* a 200% penalty on physical I/O > and if I get an 80% hit ratio with the ARC at a minimum the penalty is closer to > 800%! > > For inactive pages it's a bit more complicated as those may not be reactivated. > However, I am trusting FreeBSD's VM subsystem to demote those that are unlikely > to be reactivated to the cache bucket and then to "free", where they are able to > be re-used. This is consistent with what I actually see on a running system -- > the "inact" bucket is typically fairly large (often on a busy machine close to > that of "active") but pages demoted to "cache" don't stay there long - they > either get re-promoted back up or they are freed and go on the free list. > > The only time I see "inact" get out of control is when there's a kernel memory > leak somewhere (such as what I ran into the other day with the in-kernel NAT > subsystem on 10-STABLE.) But that's a bug and if it happens you're going to get > bit anyway. > > For example right now on one of my very busy systems with 24GB of installed RAM > and many terabytes of storage across three ZFS pools I'm seeing 17GB wired of > which 13.5 is ARC cache. That's the adaptive figure it currently is running at, > with a maximum of 22.3 and a minimum of 2.79 (8:1 ratio.) The remainder is > wired down for other reasons (there's a fairly large Postgres server running on > that box, among other things, and it has a big shared buffer declaration -- > that's most of the difference.) Cache hit efficiency is currently 97.8%. > > Active is 2.26G right now, and inactive is 2.09G. Both are stable. Overnight > inactive will drop to about 1.1GB while active will not change all that much > since most of it postgres and the middleware that talks to it along with apache, > which leaves most of its processes present even when they go idle. Peak load > times are about right now (mid-day), and again when the system is running > backups nightly. > > Cache is 7448, in other words, insignificant. Free memory is 2.6G. > > The tunable is set to 10%, which is almost exactly what free memory is. I find > that when the system gets under 1G free transient image activation can drive it > into paging and performance starts to suffer for my particular workload. > >> Also, there are a few technical problems with the patch: >> - you don't need to use sysctl interface in kernel, the values you need are >> available directly, just take a look at e.g. implementation of >> vm_paging_needed() > That's easily fixed. I will look at it. >> - similarly, querying vfs.zfs.arc_freepage_percent_target value via >> kernel_sysctlbyname is just bogus; you can use percent_target directly > I did not know if during setup of the OID the value was copied (and thus you had > to reference it later on) or the entry simply took the pointer and stashed > that. Easily corrected. >> - you don't need to sum various page counters to get a total count, there is >> v_page_count >> > Fair enough as well. >> Lastly, can you try to test reverting your patch and instead setting >> vm.lowmem_period=0 ? >> > Yes. By default it's 10; I have not tampered with that default. > > Let me do a bit of work and I'll post back with a revised patch. Perhaps a > tunable for percentage free + a free reserve that is a "floor"? The problem > with that is where to put the defaults. One option would be to grab total size > at init time and compute something similar to what "lotsfree" is for Solaris, > allowing that to be tuned with the percentage if desired. I selected 25% > because that's what the original test was expressing and it should be reasonable > for modest RAM configurations. It's clearly too high for moderately large (or > huge) memory machines unless they have a lot of RAM -hungry processes running on > them. > > The percentage test, however, is an easy knob to twist that is unlikely to > severely harm you if you dial it too far in either direction; anyone setting it > to zero obviously knows what they're getting into, and if you crank it too high > all you end up doing is limiting the ARC to the minimum value. > -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53288629.60309>