Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 12 Mar 2014 13:01:12 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-fs@freebsd.org
Subject:   Re: Reoccurring ZFS performance problems  [[Possible Analysis]]
Message-ID:  <5320A0E8.2070406@denninger.net>
In-Reply-To: <531E2406.8010301@denninger.net>
References:  <531E2406.8010301@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a cryptographically signed message in MIME format.

--------------ms090709030901010208060606
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable


On 3/10/2014 2:38 PM, Adrian Gschwend wrote:
> On 10.03.14 18:40, Adrian Gschwend wrote:
>
>> It looks like finally my MySQL process finished and now the system is
>> back to completely fine:
> ok it doesn't look it's only MySQL, stopped the process a while ago and=

> while it got calmer, I still have the issue.
ZFS can be convinced to engage in what I can only surmise is=20
pathological behavior, and I've seen no fix for it when it happens --=20
but there are things you can do to mitigate it.

What IMHO _*should*_ happen is that the ARC cache should shrink as=20
necessary to prevent paging, subject to vfs.zfs.arc_min.  To prevent=20
pathological problems with segments that have been paged off hours (or=20
more!) ago and never get paged back in because that particular piece of=20
code never executes again (but the process is also still alive so the=20
system cannot reclaim it and thus it shows "committed" in pstat -s but=20
unless it is paged back in has no impact on system performance) the=20
policing on this would have to apply a "reasonableness" filter to those=20
pages (e.g. if it has been out on the page file for longer than "X",=20
ignore that particular allocation unit for this purpose.)

This would cause the ARC cache to flush itself down automatically as=20
executable and data segment RAM commitments increase.

The documentation says that this is the case and how it should work but=20
it doesn't appear to actually be this way in practice for many=20
workloads.  I have seen "wired" RAM pinned at 20GB on one of my servers=20
here with a fairly large DBMS running -- with pieces of its working set=20
and even the a user's shell (!) getting paged off, yet the ARC cache is=20
not pared down to release memory.  Indeed you can let the system run for =

hours under these conditions and the ARC wired memory will not=20
decrease.  Cutting back the DBMS's internal buffering does not help.

What I've done here is restrict the ARC cache size in an attempt to=20
prevent this particular bit of bogosity from biting me, and it appears=20
to (sort of) work.  Unfortunately you cannot tune this while the system=20
is running (otherwise a user daemon could conceivably slash away at the=20
arc_max sysctl and force the deallocation of wired memory if it detected =

paging -- or near-paging, such as free memory below some user-configured =

threshold), only at boot time in /boot/loader.conf.

This is something that, should I get myself a nice hunk of free time, I=20
may dive into and attempt to fix.  It would likely take me quite a while =

to get up to speed on this as I've not gotten into the zfs code at all=20
-- and mistakes in there could easily corrupt files....  (in other words =

definitely NOT something to play with on a production system!)

I have to assume there's a pretty-good reason why you can't change=20
arc_max while the system is running; it _*can*_ be changed on a running=20
system on some other implementations (e.g. Solaris.)  It is marked with=20
CTLFLAG_RDTUN in the arc management file which prohibits run-time=20
changes and the only place I see it referenced with a quick look is in=20
the arc_init code.

Note that the test in arc.c for "arc_reclaim_needed" appears to be=20
pretty basic -- essentially the system will not aggressively try to=20
reclaim memory unless used kmem > 3/4 of its size.

(snippet from arc.c around line 2494 of arc.c in 10-STABLE; path=20
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)

#else   /* !sun */
         if (kmem_used() > (kmem_size() * 3) / 4)
                 return (1);
#endif  /* sun */

Up above that there's a test for "vm_paging_needed()" that would=20
(theoretically) appear to trigger first in these situations, but it=20
doesn't in many cases.

IMHO this is too-basic of a test and leads to pathological situations in =

that the system may wind up paging things off as opposed to paring back=20
the ARC cache.  As soon as the working set of something that's actually=20
getting cycles gets paged out in most cases system performance goes=20
straight in the trash.

On sun machines (from reading the code) it will allegedly try to pare=20
any time the "lotsfree" (plus "needfree" + "extra") amount of free=20
memory is invaded.

As an example this is what a server I own that is exhibiting this=20
behavior now shows:
20202500 wire
   1414052 act
   2323280 inact
   110340 cache
    414484 free
  1694896 buf

Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81, so =

it's essentially right up against it.)

That "free" number would be ok if it didn't result in the system having=20
trashy performance -- but it does on occasion. Incidentally the=20
allocated swap is about 195k blocks (~200 Megabytes) which isn't much=20
all-in, but it's enough to force actual fetches of recently-used=20
programs (e.g. your shell!) from paged-off space.  The thing is that if=20
the test in the code (75% of kmem available consumed) was looking only=20
at "free" the system should be aggressively trying to free up ARC=20
cache.  It clearly is not; the included code calls this:

uint64_t
kmem_used(void)
{

         return (vmem_size(kmem_arena, VMEM_ALLOC));
}

I need to dig around and see exactly what that's measuring, because=20
what's quite clear is that the system _*thinks*_ it has plenty of free=20
memory when it very-clearly is essentially out!  In fact free memory at=20
the moment (~400MB) is 1.7% of the total, _*not*_ 25%.  From this I=20
surmise that the "vmem_size" call is not returning the sum of all the=20
above "in use" sizes (except perhaps "inact"); were it to do so that=20
would be essentially 100% of installed RAM and the ARC cache should be=20
actively under shrinkage, but it clearly is not.

I'll keep this one on my "to-do" list somewhere and if I get the chance=20
see if I can come up with a better test.  What might be interesting is=20
to change the test to be "pare if free space less (pagefile space in use =

plus some modest margin) < 0"

Fixing this tidbit of code could potentially be pretty significant in=20
terms of resolving the occasional but very annoying "freeze" problems=20
that people sometimes run into, along with some mildly-pathological but=20
very-significant behavior in terms of how the ARC cache auto-scales and=20
its impact on performance.  I'm nowhere near up-to-speed enough on the=20
internals of the kernel when it comes to figuring out what it has=20
committed (e.g. how much swap is out, etc) and thus there's going to be=20
a lot of code-reading involved before I can attempt something useful.

--=20
-- Karl
karl@denninger.net


--------------ms090709030901010208060606
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="smime.p7s"
Content-Description: S/MIME Cryptographic Signature

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC
BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI
EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM
TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv
bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5
MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg
RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG
SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th
d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W
6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV
jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5
SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY
5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8
Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4
GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci
WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN
nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB
o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg
hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4
+LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG
CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy
bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO
31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c
L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1
YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD
pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE
f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk
YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD
VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2
aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ
KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDAzMTIxODAxMTJaMCMGCSqGSIb3DQEJBDEW
BBTTsHHNsMXEiDgYKv6lHaLFrJKVyDBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL
BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA
MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV
BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT
EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq
hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG
9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV
BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk
YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh
c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIApnwlq4BYcb0fQDN3W0JUi1uGrbON
m4j/SjxhhWalnoXuKrci3PmyL74IVqfZduu1QukbUf1gSnn25fFPxIYjMHQZ0nNHHrWlkIKk
RS5N5d4GvTs/iyNs1m/p8O6PFMFr7DAuBLsah/JS7VjiXuODstxZCLrGMIrsaBftqcdLql4z
oQakj2GrmaLFeZSIOY7ffXuHqGF7ZphoZCD5R3Rp0ronx3Oqhx/rinAPw4F0Yxi7cW3+t+K9
ICnTmGckLYfrapNovig3akIsbJm8p4CXoGyqsU2azog+JPcHlpW0keEF7wyGNfeGh96l4zvg
O49NZuoekHuXj0DMU1fA+5qQvSGFrgKgWPbGfx5j3za04bFscn81W7TXU/KgkimGBZ6RIjmf
kCkxytPvfmAq4loJwXIsLcSksnr+vF8Sm3pxdiTadUXnmjL29pjr6GkwAx9Uf4PZjWKJuLhT
iikalBdr/tkn6OHMy6PvTE39wUtZWvg7W8A+9i1r7DOCU2Chb3d8Mi+jiDyMN6t1rVuYt2F2
K7oUDIVhIk5cZtLtSG58iaRDrqyVZ7rdKbJeg913UGPLpKL8i7110RygDlhMunxcDR/um4sd
2jHTDqO0IV1f3oe4SQjBrhiNNGCYgXbkTV/sKhFoJ4lxnVRo69NAQSG04SYYEEaKpf+tYtSx
6Gfp7NcAAAAAAAA=
--------------ms090709030901010208060606--





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5320A0E8.2070406>