Date: Fri, 14 Mar 2014 06:21:50 -0500 From: Karl Denninger <karl@denninger.net> To: freebsd-fs@freebsd.org Subject: Re: Reoccurring ZFS performance problems [RESOLVED] Message-ID: <5322E64E.8020009@denninger.net> In-Reply-To: <5320A0E8.2070406@denninger.net> References: <531E2406.8010301@denninger.net> <5320A0E8.2070406@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
This is a cryptographically signed message in MIME format. --------------ms010709050506010308020207 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable On 3/12/2014 1:01 PM, Karl Denninger wrote: > > On 3/10/2014 2:38 PM, Adrian Gschwend wrote: >> On 10.03.14 18:40, Adrian Gschwend wrote: >> >>> It looks like finally my MySQL process finished and now the system is= >>> back to completely fine: >> ok it doesn't look it's only MySQL, stopped the process a while ago an= d >> while it got calmer, I still have the issue. > ZFS can be convinced to engage in what I can only surmise is=20 > pathological behavior, and I've seen no fix for it when it happens --=20 > but there are things you can do to mitigate it. > > What IMHO _*should*_ happen is that the ARC cache should shrink as=20 > necessary to prevent paging, subject to vfs.zfs.arc_min. To prevent=20 > pathological problems with segments that have been paged off hours (or = > more!) ago and never get paged back in because that particular piece=20 > of code never executes again (but the process is also still alive so=20 > the system cannot reclaim it and thus it shows "committed" in pstat -s = > but unless it is paged back in has no impact on system performance)=20 > the policing on this would have to apply a "reasonableness" filter to=20 > those pages (e.g. if it has been out on the page file for longer than=20 > "X", ignore that particular allocation unit for this purpose.) > > This would cause the ARC cache to flush itself down automatically as=20 > executable and data segment RAM commitments increase. > > The documentation says that this is the case and how it should work=20 > but it doesn't appear to actually be this way in practice for many=20 > workloads. I have seen "wired" RAM pinned at 20GB on one of my=20 > servers here with a fairly large DBMS running -- with pieces of its=20 > working set and even the a user's shell (!) getting paged off, yet the = > ARC cache is not pared down to release memory. Indeed you can let the = > system run for hours under these conditions and the ARC wired memory=20 > will not decrease. Cutting back the DBMS's internal buffering does=20 > not help. > > What I've done here is restrict the ARC cache size in an attempt to=20 > prevent this particular bit of bogosity from biting me, and it appears = > to (sort of) work. Unfortunately you cannot tune this while the=20 > system is running (otherwise a user daemon could conceivably slash=20 > away at the arc_max sysctl and force the deallocation of wired memory=20 > if it detected paging -- or near-paging, such as free memory below=20 > some user-configured threshold), only at boot time in /boot/loader.conf= =2E > > This is something that, should I get myself a nice hunk of free time,=20 > I may dive into and attempt to fix. It would likely take me quite a=20 > while to get up to speed on this as I've not gotten into the zfs code=20 > at all -- and mistakes in there could easily corrupt files.... (in=20 > other words definitely NOT something to play with on a production=20 > system!) > > I have to assume there's a pretty-good reason why you can't change=20 > arc_max while the system is running; it _*can*_ be changed on a=20 > running system on some other implementations (e.g. Solaris.) It is=20 > marked with CTLFLAG_RDTUN in the arc management file which prohibits=20 > run-time changes and the only place I see it referenced with a quick=20 > look is in the arc_init code. > > Note that the test in arc.c for "arc_reclaim_needed" appears to be=20 > pretty basic -- essentially the system will not aggressively try to=20 > reclaim memory unless used kmem > 3/4 of its size. > > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path=20 > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs) > > #else /* !sun */ > if (kmem_used() > (kmem_size() * 3) / 4) > return (1); > #endif /* sun */ > > Up above that there's a test for "vm_paging_needed()" that would=20 > (theoretically) appear to trigger first in these situations, but it=20 > doesn't in many cases. > > IMHO this is too-basic of a test and leads to pathological situations=20 > in that the system may wind up paging things off as opposed to paring=20 > back the ARC cache. As soon as the working set of something that's=20 > actually getting cycles gets paged out in most cases system=20 > performance goes straight in the trash. > > On sun machines (from reading the code) it will allegedly try to pare=20 > any time the "lotsfree" (plus "needfree" + "extra") amount of free=20 > memory is invaded. > > As an example this is what a server I own that is exhibiting this=20 > behavior now shows: > 20202500 wire > 1414052 act > 2323280 inact > 110340 cache > 414484 free > 1694896 buf > > Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81,=20 > so it's essentially right up against it.) > > That "free" number would be ok if it didn't result in the system=20 > having trashy performance -- but it does on occasion. Incidentally the = > allocated swap is about 195k blocks (~200 Megabytes) which isn't much=20 > all-in, but it's enough to force actual fetches of recently-used=20 > programs (e.g. your shell!) from paged-off space. The thing is that if = > the test in the code (75% of kmem available consumed) was looking only = > at "free" the system should be aggressively trying to free up ARC=20 > cache. It clearly is not; the included code calls this: > > uint64_t > kmem_used(void) > { > > return (vmem_size(kmem_arena, VMEM_ALLOC)); > } > > I need to dig around and see exactly what that's measuring, because=20 > what's quite clear is that the system _*thinks*_ it has plenty of free = > memory when it very-clearly is essentially out! In fact free memory=20 > at the moment (~400MB) is 1.7% of the total, _*not*_ 25%. From this I = > surmise that the "vmem_size" call is not returning the sum of all the=20 > above "in use" sizes (except perhaps "inact"); were it to do so that=20 > would be essentially 100% of installed RAM and the ARC cache should be = > actively under shrinkage, but it clearly is not. > > I'll keep this one on my "to-do" list somewhere and if I get the=20 > chance see if I can come up with a better test. What might be=20 > interesting is to change the test to be "pare if free space less=20 > (pagefile space in use plus some modest margin) < 0" > > Fixing this tidbit of code could potentially be pretty significant in=20 > terms of resolving the occasional but very annoying "freeze" problems=20 > that people sometimes run into, along with some mildly-pathological=20 > but very-significant behavior in terms of how the ARC cache=20 > auto-scales and its impact on performance. I'm nowhere near=20 > up-to-speed enough on the internals of the kernel when it comes to=20 > figuring out what it has committed (e.g. how much swap is out, etc)=20 > and thus there's going to be a lot of code-reading involved before I=20 > can attempt something useful. > In the context of the above, here's a fix. Enjoy. http://www.freebsd.org/cgi/query-pr.cgi?pr=3D187572 > Category: kern > Responsible: freebsd-bugs > Synopsis: ZFS ARC cache code does not properly handle low memory > Arrival-Date: Fri Mar 14 11:20:00 UTC 2014 --=20 -- Karl karl@denninger.net --------------ms010709050506010308020207 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5 MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W 6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5 SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY 5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8 Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4 GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4 +LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO 31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1 YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2 aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDAzMTQxMTIxNTBaMCMGCSqGSIb3DQEJBDEW BBTds4ydaijPjXGHN2BAQIuJIKPoyjBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG 9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIAlIFVrzi/Urxw4OAC9iyAts0LEh8T RROxpUdarFJr1p9z5mBO7zHv6WhUH6hQ4w2E/3qimiWFkWoNjQ1YPclGBsMEI04d3nqBMNQI eSvq4KZNOtgxpp7pFsGJxSU9nHjTurH9P/dgb/UV3J4N5/lGZnOFQJ5DZA0q9FMddmzChfty /JNx4kdLafXIhXzQkDToCQzV/St4eVLYye5WLuqvbVG70ZeM2XE2Sqd+UOmSA28QbBvKAxSa LOnAcBSf6/yKbPfAXq7otk3OPr+OX2hMfYz4C7b19GM6CLe5b8OBfiGkGg6/QtM5o8Q50tMK iPNARAmN8dm13g1EJGcrxE75BoceqtyNplQfSpjSuIWSHkzTtJF0jdQYjyS2uVzZ/DzAprTB m8/Vi9r4M1s04n6uVjOu1oz9PWZQtYPiw70ZZcum5TWHvneKMbDwM0nDyECQKmDb9RG3ML3u IexB6enxpWBbUpeDxaw6g98/Grla31thTrb7ZvnYNAHcQ5dieZ8U1/JRrxsZA1SfHwpB/q3Q Iig5hDFxEe8KfzPIs5fMSp+swOeQSD9JG4SXoPKDwysr3Lw7KxifROwatAcpQa0czz0HmYuN ULivfmJDYzCZjBNxcLv45KufSOKVz5d2EVH+v/DTX0yDZ5lZQsoOSOaDfiZhN7kHU348SJZu TSG9EPwAAAAAAAA= --------------ms010709050506010308020207--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5322E64E.8020009>