From owner-freebsd-fs@FreeBSD.ORG Wed Mar 12 18:01:30 2014 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 14924DBC for ; Wed, 12 Mar 2014 18:01:30 +0000 (UTC) Received: from fs.denninger.net (wsip-70-169-168-7.pn.at.cox.net [70.169.168.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.freebsd.org (Postfix) with ESMTPS id AF15BC91 for ; Wed, 12 Mar 2014 18:01:29 +0000 (UTC) Received: from [127.0.0.1] (localhost [127.0.0.1]) by fs.denninger.net (8.14.8/8.14.8) with ESMTP id s2CI1H5i089558 for ; Wed, 12 Mar 2014 13:01:17 -0500 (CDT) (envelope-from karl@denninger.net) Received: from [127.0.0.1] (TLS/SSL) [192.168.1.40] by Spamblock-sys (LOCAL/AUTH); Wed Mar 12 13:01:17 2014 Message-ID: <5320A0E8.2070406@denninger.net> Date: Wed, 12 Mar 2014 13:01:12 -0500 From: Karl Denninger User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: freebsd-fs@freebsd.org Subject: Re: Reoccurring ZFS performance problems [[Possible Analysis]] References: <531E2406.8010301@denninger.net> In-Reply-To: <531E2406.8010301@denninger.net> X-Forwarded-Message-Id: <531E2406.8010301@denninger.net> Content-Type: multipart/signed; protocol="application/pkcs7-signature"; micalg=sha1; boundary="------------ms090709030901010208060606" X-Antivirus: avast! (VPS 140312-0, 03/12/2014), Outbound message X-Antivirus-Status: Clean X-Content-Filtered-By: Mailman/MimeDel 2.1.17 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.17 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 12 Mar 2014 18:01:30 -0000 This is a cryptographically signed message in MIME format. --------------ms090709030901010208060606 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable On 3/10/2014 2:38 PM, Adrian Gschwend wrote: > On 10.03.14 18:40, Adrian Gschwend wrote: > >> It looks like finally my MySQL process finished and now the system is >> back to completely fine: > ok it doesn't look it's only MySQL, stopped the process a while ago and= > while it got calmer, I still have the issue. ZFS can be convinced to engage in what I can only surmise is=20 pathological behavior, and I've seen no fix for it when it happens --=20 but there are things you can do to mitigate it. What IMHO _*should*_ happen is that the ARC cache should shrink as=20 necessary to prevent paging, subject to vfs.zfs.arc_min. To prevent=20 pathological problems with segments that have been paged off hours (or=20 more!) ago and never get paged back in because that particular piece of=20 code never executes again (but the process is also still alive so the=20 system cannot reclaim it and thus it shows "committed" in pstat -s but=20 unless it is paged back in has no impact on system performance) the=20 policing on this would have to apply a "reasonableness" filter to those=20 pages (e.g. if it has been out on the page file for longer than "X",=20 ignore that particular allocation unit for this purpose.) This would cause the ARC cache to flush itself down automatically as=20 executable and data segment RAM commitments increase. The documentation says that this is the case and how it should work but=20 it doesn't appear to actually be this way in practice for many=20 workloads. I have seen "wired" RAM pinned at 20GB on one of my servers=20 here with a fairly large DBMS running -- with pieces of its working set=20 and even the a user's shell (!) getting paged off, yet the ARC cache is=20 not pared down to release memory. Indeed you can let the system run for = hours under these conditions and the ARC wired memory will not=20 decrease. Cutting back the DBMS's internal buffering does not help. What I've done here is restrict the ARC cache size in an attempt to=20 prevent this particular bit of bogosity from biting me, and it appears=20 to (sort of) work. Unfortunately you cannot tune this while the system=20 is running (otherwise a user daemon could conceivably slash away at the=20 arc_max sysctl and force the deallocation of wired memory if it detected = paging -- or near-paging, such as free memory below some user-configured = threshold), only at boot time in /boot/loader.conf. This is something that, should I get myself a nice hunk of free time, I=20 may dive into and attempt to fix. It would likely take me quite a while = to get up to speed on this as I've not gotten into the zfs code at all=20 -- and mistakes in there could easily corrupt files.... (in other words = definitely NOT something to play with on a production system!) I have to assume there's a pretty-good reason why you can't change=20 arc_max while the system is running; it _*can*_ be changed on a running=20 system on some other implementations (e.g. Solaris.) It is marked with=20 CTLFLAG_RDTUN in the arc management file which prohibits run-time=20 changes and the only place I see it referenced with a quick look is in=20 the arc_init code. Note that the test in arc.c for "arc_reclaim_needed" appears to be=20 pretty basic -- essentially the system will not aggressively try to=20 reclaim memory unless used kmem > 3/4 of its size. (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path=20 /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs) #else /* !sun */ if (kmem_used() > (kmem_size() * 3) / 4) return (1); #endif /* sun */ Up above that there's a test for "vm_paging_needed()" that would=20 (theoretically) appear to trigger first in these situations, but it=20 doesn't in many cases. IMHO this is too-basic of a test and leads to pathological situations in = that the system may wind up paging things off as opposed to paring back=20 the ARC cache. As soon as the working set of something that's actually=20 getting cycles gets paged out in most cases system performance goes=20 straight in the trash. On sun machines (from reading the code) it will allegedly try to pare=20 any time the "lotsfree" (plus "needfree" + "extra") amount of free=20 memory is invaded. As an example this is what a server I own that is exhibiting this=20 behavior now shows: 20202500 wire 1414052 act 2323280 inact 110340 cache 414484 free 1694896 buf Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81, so = it's essentially right up against it.) That "free" number would be ok if it didn't result in the system having=20 trashy performance -- but it does on occasion. Incidentally the=20 allocated swap is about 195k blocks (~200 Megabytes) which isn't much=20 all-in, but it's enough to force actual fetches of recently-used=20 programs (e.g. your shell!) from paged-off space. The thing is that if=20 the test in the code (75% of kmem available consumed) was looking only=20 at "free" the system should be aggressively trying to free up ARC=20 cache. It clearly is not; the included code calls this: uint64_t kmem_used(void) { return (vmem_size(kmem_arena, VMEM_ALLOC)); } I need to dig around and see exactly what that's measuring, because=20 what's quite clear is that the system _*thinks*_ it has plenty of free=20 memory when it very-clearly is essentially out! In fact free memory at=20 the moment (~400MB) is 1.7% of the total, _*not*_ 25%. From this I=20 surmise that the "vmem_size" call is not returning the sum of all the=20 above "in use" sizes (except perhaps "inact"); were it to do so that=20 would be essentially 100% of installed RAM and the ARC cache should be=20 actively under shrinkage, but it clearly is not. I'll keep this one on my "to-do" list somewhere and if I get the chance=20 see if I can come up with a better test. What might be interesting is=20 to change the test to be "pare if free space less (pagefile space in use = plus some modest margin) < 0" Fixing this tidbit of code could potentially be pretty significant in=20 terms of resolving the occasional but very annoying "freeze" problems=20 that people sometimes run into, along with some mildly-pathological but=20 very-significant behavior in terms of how the ARC cache auto-scales and=20 its impact on performance. I'm nowhere near up-to-speed enough on the=20 internals of the kernel when it comes to figuring out what it has=20 committed (e.g. how much swap is out, etc) and thus there's going to be=20 a lot of code-reading involved before I can attempt something useful. --=20 -- Karl karl@denninger.net --------------ms090709030901010208060606 Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="smime.p7s" Content-Description: S/MIME Cryptographic Signature MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5 MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W 6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5 SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY 5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8 Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4 GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4 +LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO 31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1 YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2 aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDAzMTIxODAxMTJaMCMGCSqGSIb3DQEJBDEW BBTTsHHNsMXEiDgYKv6lHaLFrJKVyDBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG 9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIApnwlq4BYcb0fQDN3W0JUi1uGrbON m4j/SjxhhWalnoXuKrci3PmyL74IVqfZduu1QukbUf1gSnn25fFPxIYjMHQZ0nNHHrWlkIKk RS5N5d4GvTs/iyNs1m/p8O6PFMFr7DAuBLsah/JS7VjiXuODstxZCLrGMIrsaBftqcdLql4z oQakj2GrmaLFeZSIOY7ffXuHqGF7ZphoZCD5R3Rp0ronx3Oqhx/rinAPw4F0Yxi7cW3+t+K9 ICnTmGckLYfrapNovig3akIsbJm8p4CXoGyqsU2azog+JPcHlpW0keEF7wyGNfeGh96l4zvg O49NZuoekHuXj0DMU1fA+5qQvSGFrgKgWPbGfx5j3za04bFscn81W7TXU/KgkimGBZ6RIjmf kCkxytPvfmAq4loJwXIsLcSksnr+vF8Sm3pxdiTadUXnmjL29pjr6GkwAx9Uf4PZjWKJuLhT iikalBdr/tkn6OHMy6PvTE39wUtZWvg7W8A+9i1r7DOCU2Chb3d8Mi+jiDyMN6t1rVuYt2F2 K7oUDIVhIk5cZtLtSG58iaRDrqyVZ7rdKbJeg913UGPLpKL8i7110RygDlhMunxcDR/um4sd 2jHTDqO0IV1f3oe4SQjBrhiNNGCYgXbkTV/sKhFoJ4lxnVRo69NAQSG04SYYEEaKpf+tYtSx 6Gfp7NcAAAAAAAA= --------------ms090709030901010208060606--