FreeBSD Mail Archives

Date:      Fri, 14 Mar 2014 06:21:50 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-fs@freebsd.org
Subject:   Re: Reoccurring ZFS performance problems  [RESOLVED]
Message-ID:  <5322E64E.8020009@denninger.net>
In-Reply-To: <5320A0E8.2070406@denninger.net>
References:  <531E2406.8010301@denninger.net> <5320A0E8.2070406@denninger.net>

index | next in thread | previous in thread | raw e-mail


[-- Attachment #1 --]

On 3/12/2014 1:01 PM, Karl Denninger wrote:
>
> On 3/10/2014 2:38 PM, Adrian Gschwend wrote:
>> On 10.03.14 18:40, Adrian Gschwend wrote:
>>
>>> It looks like finally my MySQL process finished and now the system is
>>> back to completely fine:
>> ok it doesn't look it's only MySQL, stopped the process a while ago and
>> while it got calmer, I still have the issue.
> ZFS can be convinced to engage in what I can only surmise is 
> pathological behavior, and I've seen no fix for it when it happens -- 
> but there are things you can do to mitigate it.
>
> What IMHO _*should*_ happen is that the ARC cache should shrink as 
> necessary to prevent paging, subject to vfs.zfs.arc_min.  To prevent 
> pathological problems with segments that have been paged off hours (or 
> more!) ago and never get paged back in because that particular piece 
> of code never executes again (but the process is also still alive so 
> the system cannot reclaim it and thus it shows "committed" in pstat -s 
> but unless it is paged back in has no impact on system performance) 
> the policing on this would have to apply a "reasonableness" filter to 
> those pages (e.g. if it has been out on the page file for longer than 
> "X", ignore that particular allocation unit for this purpose.)
>
> This would cause the ARC cache to flush itself down automatically as 
> executable and data segment RAM commitments increase.
>
> The documentation says that this is the case and how it should work 
> but it doesn't appear to actually be this way in practice for many 
> workloads.  I have seen "wired" RAM pinned at 20GB on one of my 
> servers here with a fairly large DBMS running -- with pieces of its 
> working set and even the a user's shell (!) getting paged off, yet the 
> ARC cache is not pared down to release memory.  Indeed you can let the 
> system run for hours under these conditions and the ARC wired memory 
> will not decrease.  Cutting back the DBMS's internal buffering does 
> not help.
>
> What I've done here is restrict the ARC cache size in an attempt to 
> prevent this particular bit of bogosity from biting me, and it appears 
> to (sort of) work.  Unfortunately you cannot tune this while the 
> system is running (otherwise a user daemon could conceivably slash 
> away at the arc_max sysctl and force the deallocation of wired memory 
> if it detected paging -- or near-paging, such as free memory below 
> some user-configured threshold), only at boot time in /boot/loader.conf.
>
> This is something that, should I get myself a nice hunk of free time, 
> I may dive into and attempt to fix.  It would likely take me quite a 
> while to get up to speed on this as I've not gotten into the zfs code 
> at all -- and mistakes in there could easily corrupt files....  (in 
> other words definitely NOT something to play with on a production 
> system!)
>
> I have to assume there's a pretty-good reason why you can't change 
> arc_max while the system is running; it _*can*_ be changed on a 
> running system on some other implementations (e.g. Solaris.)  It is 
> marked with CTLFLAG_RDTUN in the arc management file which prohibits 
> run-time changes and the only place I see it referenced with a quick 
> look is in the arc_init code.
>
> Note that the test in arc.c for "arc_reclaim_needed" appears to be 
> pretty basic -- essentially the system will not aggressively try to 
> reclaim memory unless used kmem > 3/4 of its size.
>
> (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path 
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)
>
> #else   /* !sun */
>         if (kmem_used() > (kmem_size() * 3) / 4)
>                 return (1);
> #endif  /* sun */
>
> Up above that there's a test for "vm_paging_needed()" that would 
> (theoretically) appear to trigger first in these situations, but it 
> doesn't in many cases.
>
> IMHO this is too-basic of a test and leads to pathological situations 
> in that the system may wind up paging things off as opposed to paring 
> back the ARC cache.  As soon as the working set of something that's 
> actually getting cycles gets paged out in most cases system 
> performance goes straight in the trash.
>
> On sun machines (from reading the code) it will allegedly try to pare 
> any time the "lotsfree" (plus "needfree" + "extra") amount of free 
> memory is invaded.
>
> As an example this is what a server I own that is exhibiting this 
> behavior now shows:
> 20202500 wire
>   1414052 act
>   2323280 inact
>   110340 cache
>    414484 free
>  1694896 buf
>
> Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81, 
> so it's essentially right up against it.)
>
> That "free" number would be ok if it didn't result in the system 
> having trashy performance -- but it does on occasion. Incidentally the 
> allocated swap is about 195k blocks (~200 Megabytes) which isn't much 
> all-in, but it's enough to force actual fetches of recently-used 
> programs (e.g. your shell!) from paged-off space. The thing is that if 
> the test in the code (75% of kmem available consumed) was looking only 
> at "free" the system should be aggressively trying to free up ARC 
> cache.  It clearly is not; the included code calls this:
>
> uint64_t
> kmem_used(void)
> {
>
>         return (vmem_size(kmem_arena, VMEM_ALLOC));
> }
>
> I need to dig around and see exactly what that's measuring, because 
> what's quite clear is that the system _*thinks*_ it has plenty of free 
> memory when it very-clearly is essentially out!  In fact free memory 
> at the moment (~400MB) is 1.7% of the total, _*not*_ 25%.  From this I 
> surmise that the "vmem_size" call is not returning the sum of all the 
> above "in use" sizes (except perhaps "inact"); were it to do so that 
> would be essentially 100% of installed RAM and the ARC cache should be 
> actively under shrinkage, but it clearly is not.
>
> I'll keep this one on my "to-do" list somewhere and if I get the 
> chance see if I can come up with a better test.  What might be 
> interesting is to change the test to be "pare if free space less 
> (pagefile space in use plus some modest margin) < 0"
>
> Fixing this tidbit of code could potentially be pretty significant in 
> terms of resolving the occasional but very annoying "freeze" problems 
> that people sometimes run into, along with some mildly-pathological 
> but very-significant behavior in terms of how the ARC cache 
> auto-scales and its impact on performance.  I'm nowhere near 
> up-to-speed enough on the internals of the kernel when it comes to 
> figuring out what it has committed (e.g. how much swap is out, etc) 
> and thus there's going to be a lot of code-reading involved before I 
> can attempt something useful.
>

In the context of the above, here's a fix.  Enjoy.

http://www.freebsd.org/cgi/query-pr.cgi?pr=187572

> Category:       kern
> Responsible:    freebsd-bugs
> Synopsis:       ZFS ARC cache code does not properly handle low memory
> Arrival-Date:   Fri Mar 14 11:20:00 UTC 2014

-- 
-- Karl
karl@denninger.net



[-- Attachment #2 --]
0�	*�H��
��0�10	+0�	*�H��
��O0�K0�3�0
	*�H��
0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0-	*�H��
	 customer-service@cudasystems.net0
130824190344Z
180823190344Z0[10	UUS10UFlorida10UKarl Denninger1!0	*�H��
	karl@denninger.net0�"0
	*�H��
�0�
���b����i՞�]�����MN�Կ���aw�x�?�`�)'�Ҵ�cWg�R@Bl��Wh+	�u}ApdC���FJV�й���~�FOL��}��EW�^��bچY��p3�K&��ׂ��(R�
��l�xڝ.x��z?�����6��&n�s��J���+1v�9v/�(kq�Īp[�v�jc�K%f�ϻe���?iq]z����������
��lyzF��O'�ppdX//��Lw(���3����������J���IA�*��S�#�����՟��H��[f|CGqJK�o��oy��.oE�����u�Ow$��/��섀$삻J���9b�����������|����AP~8�]D1������Y�I<"�"�"Y^��T�2�iQ�2����b��	yH���)��]�	Ƶ�0y$�_N6X���q�M�C �9�՘	����X�gώj��G�T�P"�#��n���ˋ"B�k��1���0��0	U00	`�H��B�0U�0,	`�H��B
OpenSSL Generated Certificate0U|8�������˴�d�[2�0U#0�]��Af�4U3�x��&^"408	`�H��B+)https://cudasystems.net:11443/revoked.crl0
	*�H��
�gB���wH]����j�\�x�`(�&��g���W�3��2�"��Uf^�.��^Iϱ�
�k!�DQ��Ag{(�w������/��)\N��'[��oRW���@��C���HO���>��)X��r�TNɘ��!�u`��xt5(��=f\-l3��<����@C��6��mn��hv�#����#�����1Ń�b�H�͍���_N�q
a�ʷ?rk���$^�9�TI�a�!�k����h��,D���-ct��1�
0�0��0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0-	*�H��
	 customer-service@cudasystems.net0	+��;0	*�H��
	1	*�H��
0	*�H��
	1
140314112150Z0#	*�H��
	1ݳ��j(ύq�7`@@�� ���0l	*�H��
	1_0]0	`�He*0	`�He0
*�H��
0*�H��
�0
*�H��
@0+0
*�H��
(0��	+�71��0��0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0-	*�H��
	 customer-service@cudasystems.net0��*�H��
	1�����0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0-	*�H��
	 customer-service@cudasystems.net0
	*�H��
���U�8�R�p���,���E��GZ�Rk֟s�`N�1��hT�P�
��z��%��j
�
X=�F�#N�z�0�y+��M:�1������%=�xӺ��?�`o�ܞ
��Ffs�@�Cd
*�Svl�r��q�GKi�ȅ|А4�	��+xyR���V.�mQ�ї��q6J�~P�ol��,��p����l��^��M�>��_hL}�����c:��oÁ~!��B�9��9��
��@D	��ٵ�
D$g+�N���܍�TJ�Ҹ��LӴ�t���$��\��<������Ջ��3[4�~�V3�֌�=fP���ýe˦�5��w�1��3I��@�*`���0��!�A���`[R��Ŭ:��?�Z�[aN��f��4�C�by���Q�T�
A���"(9�1q�
3ȳ��J����H?I�����++ܼ;+�D��)A��=���P��~bCc0��qp��䫟H�ϗvQ����_L�g�YB�H�~&a7�S~<H�nM!��

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5322E64E.8020009>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation