Date: Tue, 18 Mar 2014 12:19:32 -0500 From: Karl Denninger <karl@denninger.net> To: avg@FreeBSD.org Cc: freebsd-fs@freebsd.org Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix Message-ID: <53288024.2060005@denninger.net> In-Reply-To: <201403181520.s2IFK1M3069036@freefall.freebsd.org> References: <201403181520.s2IFK1M3069036@freefall.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
This is a cryptographically signed message in MIME format. --------------ms000607060402000804060701 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable On 3/18/2014 10:20 AM, Andriy Gapon wrote: > The following reply was made to PR kern/187594; it has been noted by GN= ATS. > > From: Andriy Gapon <avg@FreeBSD.org> > To: bug-followup@FreeBSD.org, karl@fs.denninger.net > Cc: > Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fi= x > Date: Tue, 18 Mar 2014 17:15:05 +0200 > > Karl Denninger <karl@fs.denninger.net> wrote: > > ZFS can be convinced to engage in pathological behavior due to a ba= d > > low-memory test in arc.c > > > > The offending file is at > > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it a= llegedly > > checks for 25% free memory, and if it is less asks for the cache to= shrink. > > > > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path > > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs) > > > > #else /* !sun */ > > if (kmem_used() > (kmem_size() * 3) / 4) > > return (1); > > #endif /* sun */ > > > > Unfortunately these two functions do not return what the authors th= ought > > they did. It's clear what they're trying to do from the Solaris-spe= cific > > code up above this test. > =20 > No, these functions do return what the authors think they do. > The check is for KVA usage (kernel virtual address space), not for ph= ysical memory. I understand, but that's nonsensical in the context of the Solaris=20 code. "lotsfree" is *not* a declaration of free kvm space, it's a=20 declaration of when the system has "lots" of free *physical* memory. Further it makes no sense at all to allow the ARC cache to force things=20 into virtual (e.g. swap-space backed) memory. But that's the behavior=20 that has been observed, and it fits with the code as originally written. > =20 > > The result is that the cache only shrinks when vm_paging_needed() t= ests > > true, but by that time the system is in serious memory trouble and = by > =20 > No, it is not. > The description and numbers here are a little bit outdated but they s= hould give > an idea of how paging works in general: > https://wiki.freebsd.org/AvgPageoutAlgorithm > =20 > > triggering only there it actually drives the system further into pa= ging, > =20 > How does ARC eviction drives the system further into paging? 1. System gets low on physical memory but the ARC cache is looking at=20 available kvm (of which there is plenty.) The ARC cache continues to=20 expand. 2. vm_paging_needed() returns true and the system begins to page off to=20 the swap. At the same time the ARC cache is pared down because=20 arc_reclaim_needed has returned "1". 3. As the ARC cache shrinks and paging occurs vm_paging_needed() returns = false. Paging out ceases but inactive pages remain on the swap. They=20 are not recalled until and unless they are scheduled to execute. =20 Arc_reclaim_needed again returns "0". 4. The hold-down timer expires in the ARC cache code ("arc_grow_retry",=20 declared as 60 seconds) and the ARC cache begins to expand again. Go back to #2 until the system's performance starts to deteriorate badly = enough due to the paging that you notice it, which occurs when something = that is actually consuming CPU time has to be called in from swap. This is consistent with what I and others have observed on both 9.2 and=20 10.0; the ARC will expand until it hits the maximum configured even at=20 the expense of forcing pages onto the swap. In this specific machine's=20 case left to defaults it will grab nearly all physical memory (over 20GB = of 24) and wire it down. Limiting arc_max to 16GB sorta fixes it. I say "sorta" because it turns = out that 16GB is still too much for the workload; it prevents the=20 pathological behavior where system "stalls" happen but only in the=20 extreme. It turns out with the patch in my ARC cache stabilizes at=20 about 13.5GB during the busiest part of the day, growing to about 16=20 off-hours. One of the problems with just limiting it in /boot/loader.conf is that=20 you have to guess and the system doesn't reasonably adapt to changing=20 memory loads. The code is clearly intended to do that but it doesn't=20 end up working that way in practice. > =20 > > because the pager will not recall pages from the swap until they ar= e next > > executed. This leads the ARC to try to fill in all the available RA= M even > > though pages have been pushed off onto swap. Not good. > =20 > Unused physical memory is a waste. It is true that ARC tries to use = as much of > memory as it is allowed. The same applies to the page cache (Active,= Inactive). > Memory management is a dynamic system and there are a few competing a= gents. > =20 That's true. However, what the stock code does is force working set out = of memory and into the swap. The ideal situation is one in which there=20 is no free memory because cache has sized itself to consume everything=20 *not* necessary for the working set of the processes that are running. =20 Unfortunately we cannot determine this presciently because a new process = may come along and we do not necessarily know for how long a process=20 that is blocked on an event will remain blocked (e.g. something waiting=20 on network I/O, etc.) However, it is my contention that you do not want to evict a process=20 that is scheduled to run (or is going to be) in favor of disk cache=20 because you're defeating yourself by doing so. The point of the disk=20 cache is to avoid going to the physical disk for I/O, but if you page=20 something you have ditched a physical I/O for data in favor of having to = go to physical disk *twice* -- first to write the paged-out data to=20 swap, and then to retrieve it when it is to be executed. This also=20 appears to be consistent with what is present for Solaris machines. From the Sun code: #ifdef sun /* * take 'desfree' extra pages, so we reclaim sooner, rather than= later */ extra =3D desfree; =20 /* * check that we're out of range of the pageout scanner. It sta= rts to * schedule paging if freemem is less than lotsfree and needfree= =2E * lotsfree is the high-water mark for pageout, and needfree is = the * number of needed free pages. We add extra pages here to make= sure * the scanner doesn't start up while we're freeing memory. */ if (freemem < lotsfree + needfree + extra) return (1); =20 /* * check to make sure that swapfs has enough space so that anon * reservations can still succeed. anon_resvmem() checks that th= e * availrmem is greater than swapfs_minfree, and the number of r= eserved * swap pages. We also add a bit of extra here just to prevent * circumstances from getting really dire. */ if (availrmem < swapfs_minfree + swapfs_reserve + extra) return (1); "freemem" is not virtual memory, it's actual memory. "Lotsfree" is the=20 point where the system considers free RAM to be "ample"; "needfree" is=20 the "desperation" point and "extra" is the margin (presumably for image=20 activation.) The base code on FreeBSD doesn't look at physical memory at all; it=20 looks at kvm space instead. > It is hard to correctly tune that system using a large hummer such as= your > patch. I believe that with your patch ARC will get shrunk to its min= imum size > in due time. Active + Inactive will grow to use the memory that you = are denying > to ARC driving Free below a threshold, which will reduce ARC. Repeat= ed enough > times this will drive ARC to its minimum. I disagree both in design theory and based on the empirical evidence of=20 actual operation. First, I don't (ever) want to give memory to the ARC cache that=20 otherwise would go to "active", because any time I do that I'm going to=20 force two page events, which is double the amount of I/O I would take on = a cache *miss*, and even with the ARC at minimum I get a reasonable hit=20 percentage. If I therefore prefer ARC over "active" pages I am going to = take *at least* a 200% penalty on physical I/O and if I get an 80% hit=20 ratio with the ARC at a minimum the penalty is closer to 800%! For inactive pages it's a bit more complicated as those may not be=20 reactivated. However, I am trusting FreeBSD's VM subsystem to demote=20 those that are unlikely to be reactivated to the cache bucket and then=20 to "free", where they are able to be re-used. This is consistent with=20 what I actually see on a running system -- the "inact" bucket is=20 typically fairly large (often on a busy machine close to that of=20 "active") but pages demoted to "cache" don't stay there long - they=20 either get re-promoted back up or they are freed and go on the free list.= The only time I see "inact" get out of control is when there's a kernel=20 memory leak somewhere (such as what I ran into the other day with the=20 in-kernel NAT subsystem on 10-STABLE.) But that's a bug and if it=20 happens you're going to get bit anyway. For example right now on one of my very busy systems with 24GB of=20 installed RAM and many terabytes of storage across three ZFS pools I'm=20 seeing 17GB wired of which 13.5 is ARC cache. That's the adaptive=20 figure it currently is running at, with a maximum of 22.3 and a minimum=20 of 2.79 (8:1 ratio.) The remainder is wired down for other reasons=20 (there's a fairly large Postgres server running on that box, among other = things, and it has a big shared buffer declaration -- that's most of the = difference.) Cache hit efficiency is currently 97.8%. Active is 2.26G right now, and inactive is 2.09G. Both are stable.=20 Overnight inactive will drop to about 1.1GB while active will not change = all that much since most of it postgres and the middleware that talks to = it along with apache, which leaves most of its processes present even=20 when they go idle. Peak load times are about right now (mid-day), and=20 again when the system is running backups nightly. Cache is 7448, in other words, insignificant. Free memory is 2.6G. The tunable is set to 10%, which is almost exactly what free memory is. = I find that when the system gets under 1G free transient image=20 activation can drive it into paging and performance starts to suffer for = my particular workload. > =20 > Also, there are a few technical problems with the patch: > - you don't need to use sysctl interface in kernel, the values you ne= ed are > available directly, just take a look at e.g. implementation of vm_pag= ing_needed() That's easily fixed. I will look at it. > - similarly, querying vfs.zfs.arc_freepage_percent_target value via > kernel_sysctlbyname is just bogus; you can use percent_target directl= y I did not know if during setup of the OID the value was copied (and thus = you had to reference it later on) or the entry simply took the pointer=20 and stashed that. Easily corrected. > - you don't need to sum various page counters to get a total count, t= here is > v_page_count > =20 Fair enough as well. > Lastly, can you try to test reverting your patch and instead setting > vm.lowmem_period=3D0 ? > =20 Yes. By default it's 10; I have not tampered with that default. Let me do a bit of work and I'll post back with a revised patch. Perhaps = a tunable for percentage free + a free reserve that is a "floor"? The=20 problem with that is where to put the defaults. One option would be to=20 grab total size at init time and compute something similar to what=20 "lotsfree" is for Solaris, allowing that to be tuned with the percentage = if desired. I selected 25% because that's what the original test was=20 expressing and it should be reasonable for modest RAM configurations. =20 It's clearly too high for moderately large (or huge) memory machines=20 unless they have a lot of RAM -hungry processes running on them. The percentage test, however, is an easy knob to twist that is unlikely=20 to severely harm you if you dial it too far in either direction; anyone=20 setting it to zero obviously knows what they're getting into, and if you = crank it too high all you end up doing is limiting the ARC to the=20 minimum value. --=20 -- Karl karl@denninger.net --------------ms000607060402000804060701 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5 MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W 6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5 SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY 5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8 Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4 GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4 +LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO 31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1 YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2 aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDAzMTgxNzE5MzJaMCMGCSqGSIb3DQEJBDEW BBSu4P0qnTanu1q3c/E0uTfSKE5okTBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG 9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIAMK/1qngYhrlSJ5PZ64eBJmH+0k5F 3j4G3LKc1jp9Z9wq5q+WCcK3YeuIe8M/F2ENpPNemRGXJ0s7p/xoz8LnehLJpt/ER+cCqAIa WQ48ZnHUxC3QqthvstwhA54kSKZW8UlFfXXMTf+D2imGA3DQCjFghHhaEMqlnz+ICoH1KXSU gtk7JbuSgDWhOapPtXOWPCCbmSgCZA7Wjr0D6NeCmmD1UOrrBgkTi81yKSVzjoWlu8fB7FkW Op/Qtb2AulRCXxPUoiYADmnB6b37JO/ZS8fcAC/RJF0ogluIY7GkzEbiO0t5yE9HsReaM6g3 skAi6QhldIKYi7poLZwuGJgqKbd4NcAmiS45ApnlqV1V4ZlByfFDApsvYF/HyNkpJfvrSWxf XZr85qOdgHMe3vAmEb5OAIRa/ymHvBgVGwnOFSI6VlTltMUymVMubzENx7mw8yZIdE7BgPgW vzAT02qwbv8GRkDbRLhtFs2ASp1l2WX90lD6ba2Lple/IMBo5xKpv9huH5mzLFPbwd9MP3as Evx8mVNqjyVvGvd7EYRppAKMww7vqzM5FV8RE/8idd6dJ35oyOvJKyVGMfuXE0gwH7e4xzIN bn3KiA9iV1R7pua32GA4a9782I2x4nyqln1ZkxrlwCNYqU6MCPuXLted0wcgkSJ0CT4y/xMm h17b5M8AAAAAAAA= --------------ms000607060402000804060701--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53288024.2060005>