Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 14 Mar 2014 06:21:50 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-fs@freebsd.org
Subject:   Re: Reoccurring ZFS performance problems  [RESOLVED]
Message-ID:  <5322E64E.8020009@denninger.net>
In-Reply-To: <5320A0E8.2070406@denninger.net>
References:  <531E2406.8010301@denninger.net> <5320A0E8.2070406@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a cryptographically signed message in MIME format.

--------------ms010709050506010308020207
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable


On 3/12/2014 1:01 PM, Karl Denninger wrote:
>
> On 3/10/2014 2:38 PM, Adrian Gschwend wrote:
>> On 10.03.14 18:40, Adrian Gschwend wrote:
>>
>>> It looks like finally my MySQL process finished and now the system is=

>>> back to completely fine:
>> ok it doesn't look it's only MySQL, stopped the process a while ago an=
d
>> while it got calmer, I still have the issue.
> ZFS can be convinced to engage in what I can only surmise is=20
> pathological behavior, and I've seen no fix for it when it happens --=20
> but there are things you can do to mitigate it.
>
> What IMHO _*should*_ happen is that the ARC cache should shrink as=20
> necessary to prevent paging, subject to vfs.zfs.arc_min.  To prevent=20
> pathological problems with segments that have been paged off hours (or =

> more!) ago and never get paged back in because that particular piece=20
> of code never executes again (but the process is also still alive so=20
> the system cannot reclaim it and thus it shows "committed" in pstat -s =

> but unless it is paged back in has no impact on system performance)=20
> the policing on this would have to apply a "reasonableness" filter to=20
> those pages (e.g. if it has been out on the page file for longer than=20
> "X", ignore that particular allocation unit for this purpose.)
>
> This would cause the ARC cache to flush itself down automatically as=20
> executable and data segment RAM commitments increase.
>
> The documentation says that this is the case and how it should work=20
> but it doesn't appear to actually be this way in practice for many=20
> workloads.  I have seen "wired" RAM pinned at 20GB on one of my=20
> servers here with a fairly large DBMS running -- with pieces of its=20
> working set and even the a user's shell (!) getting paged off, yet the =

> ARC cache is not pared down to release memory.  Indeed you can let the =

> system run for hours under these conditions and the ARC wired memory=20
> will not decrease.  Cutting back the DBMS's internal buffering does=20
> not help.
>
> What I've done here is restrict the ARC cache size in an attempt to=20
> prevent this particular bit of bogosity from biting me, and it appears =

> to (sort of) work.  Unfortunately you cannot tune this while the=20
> system is running (otherwise a user daemon could conceivably slash=20
> away at the arc_max sysctl and force the deallocation of wired memory=20
> if it detected paging -- or near-paging, such as free memory below=20
> some user-configured threshold), only at boot time in /boot/loader.conf=
=2E
>
> This is something that, should I get myself a nice hunk of free time,=20
> I may dive into and attempt to fix.  It would likely take me quite a=20
> while to get up to speed on this as I've not gotten into the zfs code=20
> at all -- and mistakes in there could easily corrupt files....  (in=20
> other words definitely NOT something to play with on a production=20
> system!)
>
> I have to assume there's a pretty-good reason why you can't change=20
> arc_max while the system is running; it _*can*_ be changed on a=20
> running system on some other implementations (e.g. Solaris.)  It is=20
> marked with CTLFLAG_RDTUN in the arc management file which prohibits=20
> run-time changes and the only place I see it referenced with a quick=20
> look is in the arc_init code.
>
> Note that the test in arc.c for "arc_reclaim_needed" appears to be=20
> pretty basic -- essentially the system will not aggressively try to=20
> reclaim memory unless used kmem > 3/4 of its size.
>
> (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path=20
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>
> #else   /* !sun */
>         if (kmem_used() > (kmem_size() * 3) / 4)
>                 return (1);
> #endif  /* sun */
>
> Up above that there's a test for "vm_paging_needed()" that would=20
> (theoretically) appear to trigger first in these situations, but it=20
> doesn't in many cases.
>
> IMHO this is too-basic of a test and leads to pathological situations=20
> in that the system may wind up paging things off as opposed to paring=20
> back the ARC cache.  As soon as the working set of something that's=20
> actually getting cycles gets paged out in most cases system=20
> performance goes straight in the trash.
>
> On sun machines (from reading the code) it will allegedly try to pare=20
> any time the "lotsfree" (plus "needfree" + "extra") amount of free=20
> memory is invaded.
>
> As an example this is what a server I own that is exhibiting this=20
> behavior now shows:
> 20202500 wire
>   1414052 act
>   2323280 inact
>   110340 cache
>    414484 free
>  1694896 buf
>
> Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81,=20
> so it's essentially right up against it.)
>
> That "free" number would be ok if it didn't result in the system=20
> having trashy performance -- but it does on occasion. Incidentally the =

> allocated swap is about 195k blocks (~200 Megabytes) which isn't much=20
> all-in, but it's enough to force actual fetches of recently-used=20
> programs (e.g. your shell!) from paged-off space. The thing is that if =

> the test in the code (75% of kmem available consumed) was looking only =

> at "free" the system should be aggressively trying to free up ARC=20
> cache.  It clearly is not; the included code calls this:
>
> uint64_t
> kmem_used(void)
> {
>
>         return (vmem_size(kmem_arena, VMEM_ALLOC));
> }
>
> I need to dig around and see exactly what that's measuring, because=20
> what's quite clear is that the system _*thinks*_ it has plenty of free =

> memory when it very-clearly is essentially out!  In fact free memory=20
> at the moment (~400MB) is 1.7% of the total, _*not*_ 25%.  From this I =

> surmise that the "vmem_size" call is not returning the sum of all the=20
> above "in use" sizes (except perhaps "inact"); were it to do so that=20
> would be essentially 100% of installed RAM and the ARC cache should be =

> actively under shrinkage, but it clearly is not.
>
> I'll keep this one on my "to-do" list somewhere and if I get the=20
> chance see if I can come up with a better test.  What might be=20
> interesting is to change the test to be "pare if free space less=20
> (pagefile space in use plus some modest margin) < 0"
>
> Fixing this tidbit of code could potentially be pretty significant in=20
> terms of resolving the occasional but very annoying "freeze" problems=20
> that people sometimes run into, along with some mildly-pathological=20
> but very-significant behavior in terms of how the ARC cache=20
> auto-scales and its impact on performance.  I'm nowhere near=20
> up-to-speed enough on the internals of the kernel when it comes to=20
> figuring out what it has committed (e.g. how much swap is out, etc)=20
> and thus there's going to be a lot of code-reading involved before I=20
> can attempt something useful.
>

In the context of the above, here's a fix.  Enjoy.

http://www.freebsd.org/cgi/query-pr.cgi?pr=3D187572

> Category:       kern
> Responsible:    freebsd-bugs
> Synopsis:       ZFS ARC cache code does not properly handle low memory
> Arrival-Date:   Fri Mar 14 11:20:00 UTC 2014

--=20
-- Karl
karl@denninger.net



--------------ms010709050506010308020207
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC
BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI
EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM
TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv
bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5
MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg
RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG
SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th
d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W
6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV
jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5
SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY
5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8
Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4
GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci
WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN
nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB
o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg
hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4
+LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG
CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy
bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO
31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c
L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1
YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD
pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE
f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk
YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD
VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2
aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ
KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDAzMTQxMTIxNTBaMCMGCSqGSIb3DQEJBDEW
BBTds4ydaijPjXGHN2BAQIuJIKPoyjBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL
BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA
MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV
BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT
EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq
hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG
9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV
BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk
YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh
c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIAlIFVrzi/Urxw4OAC9iyAts0LEh8T
RROxpUdarFJr1p9z5mBO7zHv6WhUH6hQ4w2E/3qimiWFkWoNjQ1YPclGBsMEI04d3nqBMNQI
eSvq4KZNOtgxpp7pFsGJxSU9nHjTurH9P/dgb/UV3J4N5/lGZnOFQJ5DZA0q9FMddmzChfty
/JNx4kdLafXIhXzQkDToCQzV/St4eVLYye5WLuqvbVG70ZeM2XE2Sqd+UOmSA28QbBvKAxSa
LOnAcBSf6/yKbPfAXq7otk3OPr+OX2hMfYz4C7b19GM6CLe5b8OBfiGkGg6/QtM5o8Q50tMK
iPNARAmN8dm13g1EJGcrxE75BoceqtyNplQfSpjSuIWSHkzTtJF0jdQYjyS2uVzZ/DzAprTB
m8/Vi9r4M1s04n6uVjOu1oz9PWZQtYPiw70ZZcum5TWHvneKMbDwM0nDyECQKmDb9RG3ML3u
IexB6enxpWBbUpeDxaw6g98/Grla31thTrb7ZvnYNAHcQ5dieZ8U1/JRrxsZA1SfHwpB/q3Q
Iig5hDFxEe8KfzPIs5fMSp+swOeQSD9JG4SXoPKDwysr3Lw7KxifROwatAcpQa0czz0HmYuN
ULivfmJDYzCZjBNxcLv45KufSOKVz5d2EVH+v/DTX0yDZ5lZQsoOSOaDfiZhN7kHU348SJZu
TSG9EPwAAAAAAAA=
--------------ms010709050506010308020207--





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5322E64E.8020009>