FreeBSD Mail Archives

Date:      Wed, 12 Mar 2014 13:01:12 -0500
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-fs@freebsd.org
Subject:   Re: Reoccurring ZFS performance problems  [[Possible Analysis]]
Message-ID:  <5320A0E8.2070406@denninger.net>
In-Reply-To: <531E2406.8010301@denninger.net>
References:  <531E2406.8010301@denninger.net>

[-- Attachment #1 --]

On 3/10/2014 2:38 PM, Adrian Gschwend wrote:
> On 10.03.14 18:40, Adrian Gschwend wrote:
>
>> It looks like finally my MySQL process finished and now the system is
>> back to completely fine:
> ok it doesn't look it's only MySQL, stopped the process a while ago and
> while it got calmer, I still have the issue.
ZFS can be convinced to engage in what I can only surmise is 
pathological behavior, and I've seen no fix for it when it happens -- 
but there are things you can do to mitigate it.

What IMHO _*should*_ happen is that the ARC cache should shrink as 
necessary to prevent paging, subject to vfs.zfs.arc_min.  To prevent 
pathological problems with segments that have been paged off hours (or 
more!) ago and never get paged back in because that particular piece of 
code never executes again (but the process is also still alive so the 
system cannot reclaim it and thus it shows "committed" in pstat -s but 
unless it is paged back in has no impact on system performance) the 
policing on this would have to apply a "reasonableness" filter to those 
pages (e.g. if it has been out on the page file for longer than "X", 
ignore that particular allocation unit for this purpose.)

This would cause the ARC cache to flush itself down automatically as 
executable and data segment RAM commitments increase.

The documentation says that this is the case and how it should work but 
it doesn't appear to actually be this way in practice for many 
workloads.  I have seen "wired" RAM pinned at 20GB on one of my servers 
here with a fairly large DBMS running -- with pieces of its working set 
and even the a user's shell (!) getting paged off, yet the ARC cache is 
not pared down to release memory.  Indeed you can let the system run for 
hours under these conditions and the ARC wired memory will not 
decrease.  Cutting back the DBMS's internal buffering does not help.

What I've done here is restrict the ARC cache size in an attempt to 
prevent this particular bit of bogosity from biting me, and it appears 
to (sort of) work.  Unfortunately you cannot tune this while the system 
is running (otherwise a user daemon could conceivably slash away at the 
arc_max sysctl and force the deallocation of wired memory if it detected 
paging -- or near-paging, such as free memory below some user-configured 
threshold), only at boot time in /boot/loader.conf.

This is something that, should I get myself a nice hunk of free time, I 
may dive into and attempt to fix.  It would likely take me quite a while 
to get up to speed on this as I've not gotten into the zfs code at all 
-- and mistakes in there could easily corrupt files....  (in other words 
definitely NOT something to play with on a production system!)

I have to assume there's a pretty-good reason why you can't change 
arc_max while the system is running; it _*can*_ be changed on a running 
system on some other implementations (e.g. Solaris.)  It is marked with 
CTLFLAG_RDTUN in the arc management file which prohibits run-time 
changes and the only place I see it referenced with a quick look is in 
the arc_init code.

Note that the test in arc.c for "arc_reclaim_needed" appears to be 
pretty basic -- essentially the system will not aggressively try to 
reclaim memory unless used kmem > 3/4 of its size.

(snippet from arc.c around line 2494 of arc.c in 10-STABLE; path 
/usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs)

#else   /* !sun */
         if (kmem_used() > (kmem_size() * 3) / 4)
                 return (1);
#endif  /* sun */

Up above that there's a test for "vm_paging_needed()" that would 
(theoretically) appear to trigger first in these situations, but it 
doesn't in many cases.

IMHO this is too-basic of a test and leads to pathological situations in 
that the system may wind up paging things off as opposed to paring back 
the ARC cache.  As soon as the working set of something that's actually 
getting cycles gets paged out in most cases system performance goes 
straight in the trash.

On sun machines (from reading the code) it will allegedly try to pare 
any time the "lotsfree" (plus "needfree" + "extra") amount of free 
memory is invaded.

As an example this is what a server I own that is exhibiting this 
behavior now shows:
20202500 wire
   1414052 act
   2323280 inact
   110340 cache
    414484 free
  1694896 buf

Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81, so 
it's essentially right up against it.)

That "free" number would be ok if it didn't result in the system having 
trashy performance -- but it does on occasion. Incidentally the 
allocated swap is about 195k blocks (~200 Megabytes) which isn't much 
all-in, but it's enough to force actual fetches of recently-used 
programs (e.g. your shell!) from paged-off space.  The thing is that if 
the test in the code (75% of kmem available consumed) was looking only 
at "free" the system should be aggressively trying to free up ARC 
cache.  It clearly is not; the included code calls this:

uint64_t
kmem_used(void)
{

         return (vmem_size(kmem_arena, VMEM_ALLOC));
}

I need to dig around and see exactly what that's measuring, because 
what's quite clear is that the system _*thinks*_ it has plenty of free 
memory when it very-clearly is essentially out!  In fact free memory at 
the moment (~400MB) is 1.7% of the total, _*not*_ 25%.  From this I 
surmise that the "vmem_size" call is not returning the sum of all the 
above "in use" sizes (except perhaps "inact"); were it to do so that 
would be essentially 100% of installed RAM and the ARC cache should be 
actively under shrinkage, but it clearly is not.

I'll keep this one on my "to-do" list somewhere and if I get the chance 
see if I can come up with a better test.  What might be interesting is 
to change the test to be "pare if free space less (pagefile space in use 
plus some modest margin) < 0"

Fixing this tidbit of code could potentially be pretty significant in 
terms of resolving the occasional but very annoying "freeze" problems 
that people sometimes run into, along with some mildly-pathological but 
very-significant behavior in terms of how the ARC cache auto-scales and 
its impact on performance.  I'm nowhere near up-to-speed enough on the 
internals of the kernel when it comes to figuring out what it has 
committed (e.g. how much swap is out, etc) and thus there's going to be 
a lot of code-reading involved before I can attempt something useful.

-- 
-- Karl
karl@denninger.net

[-- Attachment #2 --]
0�	*�H��
��0�10	+0�	*�H��
��O0�K0�3�0
	*�H��
0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0-	*�H��
	 customer-service@cudasystems.net0
130824190344Z
180823190344Z0[10	UUS10UFlorida10UKarl Denninger1!0	*�H��
	karl@denninger.net0�"0
	*�H��
�0�
���b����i՞�]�����MN�Կ���aw�x�?�`�)'�Ҵ�cWg�R@Bl��Wh+	�u}ApdC���FJV�й���~�FOL��}��EW�^��bچY��p3�K&��ׂ��(R�
��l�xڝ.x��z?�����6��&n�s��J���+1v�9v/�(kq�Īp[�v�jc�K%f�ϻe���?iq]z����������
��lyzF��O'�ppdX//��Lw(���3����������J���IA�*��S�#�����՟��H��[f|CGqJK�o��oy��.oE�����u�Ow$��/��섀$삻J���9b�����������|����AP~8�]D1������Y�I<"�"�"Y^��T�2�iQ�2����b��	yH���)��]�	Ƶ�0y$�_N6X���q�M�C �9�՘	����X�gώj��G�T�P"�#��n���ˋ"B�k��1���0��0	U00	`�H��B�0U�0,	`�H��B
OpenSSL Generated Certificate0U|8�������˴�d�[2�0U#0�]��Af�4U3�x��&^"408	`�H��B+)https://cudasystems.net:11443/revoked.crl0
	*�H��
�gB���wH]����j�\�x�`(�&��g���W�3��2�"��Uf^�.��^Iϱ�
�k!�DQ��Ag{(�w������/��)\N��'[��oRW���@��C���HO���>��)X��r�TNɘ��!�u`��xt5(��=f\-l3��<����@C��6��mn��hv�#����#�����1Ń�b�H�͍���_N�q
a�ʷ?rk���$^�9�TI�a�!�k����h��,D���-ct��1�
0�0��0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0-	*�H��
	 customer-service@cudasystems.net0	+��;0	*�H��
	1	*�H��
0	*�H��
	1
140312180112Z0#	*�H��
	1ӰqͰ�Ĉ8*���Ŭ���0l	*�H��
	1_0]0	`�He*0	`�He0
*�H��
0*�H��
�0
*�H��
@0+0
*�H��
(0��	+�71��0��0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0-	*�H��
	 customer-service@cudasystems.net0��*�H��
	1�����0��10	UUS10UFlorida10U	Niceville10U
Cuda Systems LLC10UCuda Systems LLC CA1/0-	*�H��
	 customer-service@cudasystems.net0
	*�H��
��|%��Xq�@3w[BT�[�������J<a�f����*�"���/�V��v�B�Q�`Jy���OĆ#0t�sG�����E.M���;?�#l�o����k�0.���R�X�^ヲ�Y��0��h���K�^3���a����y��9��}{��a{f�hd �GtiҺ'�s���pÁtc�qm��� )Әg$-��j�h�(7jB,l������l��M�Έ>$��������5���ޥ�;�;�Mf��{��@�SW�����!���X��c�6��lr5[��S�)���"9��)1���~`*�Z	�r,-Ĥ�z��_�zqv$�uE�2�����i0T�ٍb���S�)�k��'���ˣ�LM��KYZ�;[�>�-k�3�S`�ow|2/��<�7�u�[��av+��a"N\f��Hn|��C���g��)�^��wPcˤ����u��XL�|\
�1���!]_އ�I���4`��v�M_�*h'�q�Th��@A!��&F����bԱ�g���

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5320A0E8.2070406>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation