Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 26 Mar 2014 12:30:03 GMT
From:      Karl Denninger <karl@denninger.net>
To:        freebsd-fs@FreeBSD.org
Subject:   Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
Message-ID:  <201403261230.s2QCU3vI095105@freefall.freebsd.org>

next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/187594; it has been noted by GNATS.

From: Karl Denninger <karl@denninger.net>
To: bug-followup@FreeBSD.org
Cc:  
Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix
Date: Wed, 26 Mar 2014 07:20:25 -0500

 This is a cryptographically signed message in MIME format.
 
 --------------ms080306070708080308040001
 Content-Type: text/plain; charset=ISO-8859-1; format=flowed
 Content-Transfer-Encoding: quoted-printable
 
 Updated to handle the change in <sys/vmmeter.h> that was recently=20
 committed to HEAD and slightly tweak the default reservation to be equal =
 
 to the VM system's "wakeup" level.
 
 This appears, after lots of use in multiple environments, to be the=20
 ideal default setting.  The knobs remain if you wish to twist then, and=20
 I have also exposed the return flag for shrinking being needed should=20
 you want to monitor it for some reason.
 
 This change to arc.c has made a tremendous (and positive) difference in=20
 system behavior and others that are running it have made similar comments=
 =2E
 
 For those having problems with the PR system mangling these patches you=20
 can get the below patch via direct fetch at=20
 http://www.denninger.net/FreeBSD-Patches/arc-patch
 
 *** arc.c.original	Sun Mar 23 14:56:01 2014
 --- arc.c	Tue Mar 25 09:24:14 2014
 ***************
 *** 18,23 ****
 --- 18,95 ----
     *
     * CDDL HEADER END
     */
 +
 + /* Karl Denninger (karl@denninger.net), 3/25/2014, FreeBSD-specific
 +  *
 +  * If "NEWRECLAIM" is defined, change the "low memory" warning that cau=
 ses
 +  * the ARC cache to be pared down.  The reason for the change is that t=
 he
 +  * apparent attempted algorithm is to start evicting ARC cache when fre=
 e
 +  * pages fall below 25% of installed RAM.  This maps reasonably well to=
  how
 +  * Solaris is documented to behave; when "lotsfree" is invaded ZFS is t=
 old
 +  * to pare down.
 +  *
 +  * The problem is that on FreeBSD machines the system doesn't appear to=
  be
 +  * getting what the authors of the original code thought they were look=
 ing at
 +  * with its test -- or at least not what Solaris did -- and as a result=
  that
 +  * test never triggers.  That leaves the only reclaim trigger as the "p=
 aging
 +  * needed" status flag, and by the time * that trips the system is alre=
 ady
 +  * in low-memory trouble.  This can lead to severe pathological behavio=
 r
 +  * under the following scenario:
 +  * - The system starts to page and ARC is evicted.
 +  * - The system stops paging as ARC's eviction drops wired RAM a bit.
 +  * - ARC starts increasing its allocation again, and wired memory grows=
 =2E
 +  * - A new image is activated, and the system once again attempts to pa=
 ge.
 +  * - ARC starts to be evicted again.
 +  * - Back to #2
 +  *
 +  * Note that ZFS's ARC default (unless you override it in /boot/loader.=
 conf)
 +  * is to allow the ARC cache to grab nearly all of free RAM, provided n=
 obody
 +  * else needs it.  That would be ok if we evicted cache when required.
 +  *
 +  * Unfortunately the system can get into a state where it never
 +  * manages to page anything of materiality back in, as if there is acti=
 ve
 +  * I/O the ARC will start grabbing space once again as soon as the memo=
 ry
 +  * contention state drops.  For this reason the "paging is occurring" f=
 lag
 +  * should be the **last resort** condition for ARC eviction; you want t=
 o
 +  * (as Solaris does) start when there is material free RAM left BUT the=
 
 +  * vm system thinks it needs to be active to steal pages back in the at=
 tempt
 +  * to never get into the condition where you're potentially paging off
 +  * executables in favor of leaving disk cache allocated.
 +  *
 +  * To fix this we change how we look at low memory, declaring two new
 +  * runtime tunables and one status.
 +  *
 +  * The new sysctls are:
 +  * vfs.zfs.arc_freepages (free pages required to call RAM "sufficient")=
 
 +  * vfs.zfs.arc_freepage_percent (additional reservation percentage, def=
 ault 0)
 +  * vfs.zfs.arc_shrink_needed (shows "1" if we're asking for shrinking t=
 he ARC)
 +  *
 +  * vfs.zfs.arc_freepages is initialized from vm.v_free_target.
 +  * This should insure that we allow the VM system to steal pages,
 +  * but pare the cache before we suspend processes attempting to get mor=
 e
 +  * memory, thereby avoiding "stalls."  You can set this higher if you w=
 ish,
 +  * or force a specific percentage reservation as well, but doing so may=
 
 +  * cause the cache to pare back while the VM system remains willing to
 +  * allow "inactive" pages to accumulate.  The challenge is that image
 +  * activation can force things into the page space on a repeated basis
 +  * if you allow this level to be too small (the above pathological
 +  * behavior); the defaults should avoid that behavior but the sysctls
 +  * are exposed should your workload require adjustment.
 +  *
 +  * If we're using this check for low memory we are replacing the previo=
 us
 +  * ones, including the oddball "random" reclaim that appears to fire fa=
 r
 +  * more often than it should.  We still trigger if the system pages.
 +  *
 +  * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the co=
 nsole
 +  * status messages when the reclaim status trips on and off, along with=
  the
 +  * page count aggregate that triggered it (and the free space) for each=
 
 +  * event.
 +  */
 +
 + #define	NEWRECLAIM
 + #undef	NEWRECLAIM_DEBUG
 +
 +
    /*
     * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights =
 reserved.
     * Copyright (c) 2013 by Delphix. All rights reserved.
 ***************
 *** 139,144 ****
 --- 211,230 ----
   =20
    #include <vm/vm_pageout.h>
   =20
 + #ifdef	NEWRECLAIM
 + #ifdef	__FreeBSD__
 + #include <sys/sysctl.h>
 + #include <sys/vmmeter.h>
 + /*
 +  * Struct cnt. was renamed in -head (11-current) at rev 110016; check f=
 or it
 +  */
 + #if __FreeBSD_version < 1100016
 + #define	vm_cnt	cnt
 + #endif	/* __FreeBSD_version */
 +
 + #endif	/* __FreeBSD__ */
 + #endif	/* NEWRECLAIM */
 +
    #ifdef illumos
    #ifndef _KERNEL
    /* set with ZFS_DEBUG=3Dwatch, to enable watchpoints on frozen buffers=
  */
 ***************
 *** 203,218 ****
 --- 289,327 ----
    int zfs_arc_shrink_shift =3D 0;
    int zfs_arc_p_min_shift =3D 0;
    int zfs_disable_dup_eviction =3D 0;
 + #ifdef	NEWRECLAIM
 + #ifdef  __FreeBSD__
 + static	int freepages =3D 0;	/* This much memory is considered critical =
 */
 + static	int percent_target =3D 0;	/* Additionally reserve "X" percent fr=
 ee RAM */
 + static	int shrink_needed =3D 0;	/* Shrinkage of ARC cache needed?	*/
 + #endif	/* __FreeBSD__ */
 + #endif	/* NEWRECLAIM */
   =20
    TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
    TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
    TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
 + #ifdef	NEWRECLAIM
 + #ifdef  __FreeBSD__
 + TUNABLE_INT("vfs.zfs.arc_freepages", &freepages);
 + TUNABLE_INT("vfs.zfs.arc_freepage_percent", &percent_target);
 + TUNABLE_INT("vfs.zfs.arc_shrink_needed", &shrink_needed);
 + #endif	/* __FreeBSD__ */
 + #endif	/* NEWRECLAIM */
 +
    SYSCTL_DECL(_vfs_zfs);
    SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max,=
  0,
        "Maximum ARC size");
    SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min,=
  0,
        "Minimum ARC size");
   =20
 + #ifdef	NEWRECLAIM
 + #ifdef  __FreeBSD__
 + SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepages, CTLFLAG_RWTUN, &freepages=
 , 0, "ARC Free RAM Pages Required");
 + SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent, CTLFLAG_RWTUN, &pe=
 rcent_target, 0, "ARC Free RAM Target percentage");
 + SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_shrink_needed, CTLFLAG_RD, &shrink_n=
 eeded, 0, "ARC Memory Constrained (0 =3D no, 1 =3D yes)");
 + #endif	/* __FreeBSD__ */
 + #endif	/* NEWRECLAIM */
 +
    /*
     * Note that buffers can be in one of 6 states:
     *	ARC_anon	- anonymous (discussed below)
 ***************
 *** 2438,2443 ****
 --- 2547,2557 ----
    {
   =20
    #ifdef _KERNEL
 + #ifdef	NEWRECLAIM_DEBUG
 + 	static	int	xval =3D -1;
 + 	static	int	oldpercent =3D 0;
 + 	static	int	oldfreepages =3D 0;
 + #endif	/* NEWRECLAIM_DEBUG */
   =20
    	if (needfree)
    		return (1);
 ***************
 *** 2476,2481 ****
 --- 2590,2596 ----
    		return (1);
   =20
    #if defined(__i386)
 +
    	/*
    	 * If we're on an i386 platform, it's possible that we'll exhaust the=
 
    	 * kernel heap space before we ever run out of available physical
 ***************
 *** 2492,2502 ****
    		return (1);
    #endif
    #else	/* !sun */
    	if (kmem_used() > (kmem_size() * 3) / 4)
    		return (1);
    #endif	/* sun */
   =20
 - #else
    	if (spa_get_random(100) =3D=3D 0)
    		return (1);
    #endif
 --- 2607,2671 ----
    		return (1);
    #endif
    #else	/* !sun */
 +
 + #ifdef	NEWRECLAIM
 + #ifdef  __FreeBSD__
 + /*
 +  * Implement the new tunable free RAM algorithm.  We check the free pag=
 es
 +  * against the minimum specified target and the percentage that should =
 be
 +  * free.  If we're low we ask for ARC cache shrinkage.  If this is defi=
 ned
 +  * on a FreeBSD system the older checks are not performed.
 +  *
 +  * Check first to see if we need to init freepages, then test.
 +  */
 + 	if (!freepages) {		/* If zero then (re)init */
 + 		freepages =3D vm_cnt.v_free_target;
 + #ifdef	NEWRECLAIM_DEBUG
 + 		printf("ZFS ARC: Default vfs.zfs.arc_freepages to [%u]\n", freepages)=
 ;
 + #endif	/* NEWRECLAIM_DEBUG */
 + 	}
 + #ifdef	NEWRECLAIM_DEBUG
 + 	if (percent_target !=3D oldpercent) {
 + 		printf("ZFS ARC: Reservation percent change to [%d], [%d] pages, [%d]=
  free\n", percent_target, vm_cnt.v_page_count, vm_cnt.v_free_count);
 + 		oldpercent =3D percent_target;
 + 	}
 + 	if (freepages !=3D oldfreepages) {
 + 		printf("ZFS ARC: Low RAM page change to [%d], [%d] pages, [%d] free\n=
 ", freepages, vm_cnt.v_page_count, vm_cnt.v_free_count);
 + 		oldfreepages =3D freepages;
 + 	}
 + #endif	/* NEWRECLAIM_DEBUG */
 + /*
 +  * Now figure out how much free RAM we require to call the ARC cache st=
 atus
 +  * "ok".  Add the percentage specified of the total to the base require=
 ment.
 +  */
 +
 + 	if (vm_cnt.v_free_count < (freepages + ((vm_cnt.v_page_count / 100) * =
 percent_target))) {
 + #ifdef	NEWRECLAIM_DEBUG
 + 		if (xval !=3D 1) {
 + 			printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), reserved =
 (%u), target pct (%u)\n", vm_cnt.v_page_count, vm_cnt.v_free_count, ((vm_=
 cnt.v_free_count * 100) / vm_cnt.v_page_count), freepages, percent_target=
 );
 + 			xval =3D 1;
 + 		}
 + #endif	/* NEWRECLAIM_DEBUG */
 + 		shrink_needed =3D 1;
 + 		return(1);
 + 	} else {
 + #ifdef	NEWRECLAIM_DEBUG
 + 		if (xval !=3D 0) {
 + 			printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), reserved (=
 %u), target pct (%u)\n", vm_cnt.v_page_count, vm_cnt.v_free_count, ((vm_c=
 nt.v_free_count * 100) / vm_cnt.v_page_count), freepages, percent_target)=
 ;
 + 			xval =3D 0;
 + 		}
 + #endif	/* NEWRECLAIM_DEBUG */
 + 		shrink_needed =3D 0;
 + 		return(0);
 + 	}
 +
 + #endif	/* __FreeBSD__ */
 + #endif	/* NEWRECLAIM */
 +
    	if (kmem_used() > (kmem_size() * 3) / 4)
    		return (1);
    #endif	/* sun */
   =20
    	if (spa_get_random(100) =3D=3D 0)
    		return (1);
    #endif
 
 --=20
 -- Karl
 karl@denninger.net
 
 
 
 --------------ms080306070708080308040001
 Content-Type: application/pkcs7-signature; name="smime.p7s"
 Content-Transfer-Encoding: base64
 Content-Disposition: attachment; filename="smime.p7s"
 Content-Description: S/MIME Cryptographic Signature
 
 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIFTzCC
 BUswggQzoAMCAQICAQgwDQYJKoZIhvcNAQEFBQAwgZ0xCzAJBgNVBAYTAlVTMRAwDgYDVQQI
 EwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoTEEN1ZGEgU3lzdGVtcyBM
 TEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkqhkiG9w0BCQEWIGN1c3Rv
 bWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0MB4XDTEzMDgyNDE5MDM0NFoXDTE4MDgyMzE5
 MDM0NFowWzELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExFzAVBgNVBAMTDkthcmwg
 RGVubmluZ2VyMSEwHwYJKoZIhvcNAQkBFhJrYXJsQGRlbm5pbmdlci5uZXQwggIiMA0GCSqG
 SIb3DQEBAQUAA4ICDwAwggIKAoICAQC5n2KBrBmG22nVntVdvgKCB9UcnapNThrW1L+dq6th
 d9l4mj+qYMUpJ+8I0rTbY1dn21IXQBoBQmy8t1doKwmTdQ59F0FwZEPt/fGbRgBKVt3Quf6W
 6n7kRk9MG6gdD7V9vPpFV41e+5MWYtqGWY3ScDP8SyYLjL/Xgr+5KFKkDfuubK8DeNqdLniV
 jHo/vqmIgO+6NgzPGPgmbutzFQXlxUqjiNAAKzF2+Tkddi+WKABrcc/EqnBb0X8GdqcIamO5
 SyVmuM+7Zdns7D9pcV16zMMQ8LfNFQCDvbCuuQKMDg2F22x5ekYXpwjqTyfjcHBkWC8vFNoY
 5aFMdyiN/Kkz0/kduP2ekYOgkRqcShfLEcG9SQ4LQZgqjMpTjSOGzBr3tOvVn5LkSJSHW2Z8
 Q0dxSkvFG2/lsOWFbwQeeZSaBi5vRZCYCOf5tRd1+E93FyQfpt4vsrXshIAk7IK7f0qXvxP4
 GDli5PKIEubD2Bn+gp3vB/DkfKySh5NBHVB+OPCoXRUWBkQxme65wBO02OZZt0k8Iq0i4Rci
 WV6z+lQHqDKtaVGgMsHn6PoeYhjf5Al5SP+U3imTjF2aCca1iDB5JOccX04MNljvifXgcbJN
 nkMgrzmm1ZgJ1PLur/ADWPlnz45quOhHg1TfUCLfI/DzgG7Z6u+oy4siQuFr9QT0MQIDAQAB
 o4HWMIHTMAkGA1UdEwQCMAAwEQYJYIZIAYb4QgEBBAQDAgWgMAsGA1UdDwQEAwIF4DAsBglg
 hkgBhvhCAQ0EHxYdT3BlblNTTCBHZW5lcmF0ZWQgQ2VydGlmaWNhdGUwHQYDVR0OBBYEFHw4
 +LnuALyLA5Cgy7T5ZAX1WzKPMB8GA1UdIwQYMBaAFF3U3hpBZq40HB5VM7B44/gmXiI0MDgG
 CWCGSAGG+EIBAwQrFilodHRwczovL2N1ZGFzeXN0ZW1zLm5ldDoxMTQ0My9yZXZva2VkLmNy
 bDANBgkqhkiG9w0BAQUFAAOCAQEAZ0L4tQbBd0hd4wuw/YVqEBDDXJ54q2AoqQAmsOlnoxLO
 31ehM/LvrTIP4yK2u1VmXtUumQ4Ao15JFM+xmwqtEGsh70RRrfVBAGd7KOZ3GB39FP2TgN/c
 L5fJKVxOqvEnW6cL9QtvUlcM3hXg8kDv60OB+LIcSE/P3/s+0tEpWPjxm3LHVE7JmPbZIcJ1
 YMoZvHh0NSjY5D0HZlwtbDO7pDz9sZf1QEOgjH828fhtborkaHaUI46pmrMjiBnY6ujXMcWD
 pxtikki0zY22nrxfTs5xDWGxyrc/cmucjxClJF6+OYVUSaZhiiHfa9Pr+41okLgsRB0AmNwE
 f6ItY3TI8DGCBQowggUGAgEBMIGjMIGdMQswCQYDVQQGEwJVUzEQMA4GA1UECBMHRmxvcmlk
 YTESMBAGA1UEBxMJTmljZXZpbGxlMRkwFwYDVQQKExBDdWRhIFN5c3RlbXMgTExDMRwwGgYD
 VQQDExNDdWRhIFN5c3RlbXMgTExDIENBMS8wLQYJKoZIhvcNAQkBFiBjdXN0b21lci1zZXJ2
 aWNlQGN1ZGFzeXN0ZW1zLm5ldAIBCDAJBgUrDgMCGgUAoIICOzAYBgkqhkiG9w0BCQMxCwYJ
 KoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNDAzMjYxMjIwMjVaMCMGCSqGSIb3DQEJBDEW
 BBSDwhwrsF5DQBRdl4eHVgPE1IGT8TBsBgkqhkiG9w0BCQ8xXzBdMAsGCWCGSAFlAwQBKjAL
 BglghkgBZQMEAQIwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFA
 MAcGBSsOAwIHMA0GCCqGSIb3DQMCAgEoMIG0BgkrBgEEAYI3EAQxgaYwgaMwgZ0xCzAJBgNV
 BAYTAlVTMRAwDgYDVQQIEwdGbG9yaWRhMRIwEAYDVQQHEwlOaWNldmlsbGUxGTAXBgNVBAoT
 EEN1ZGEgU3lzdGVtcyBMTEMxHDAaBgNVBAMTE0N1ZGEgU3lzdGVtcyBMTEMgQ0ExLzAtBgkq
 hkiG9w0BCQEWIGN1c3RvbWVyLXNlcnZpY2VAY3VkYXN5c3RlbXMubmV0AgEIMIG2BgsqhkiG
 9w0BCRACCzGBpqCBozCBnTELMAkGA1UEBhMCVVMxEDAOBgNVBAgTB0Zsb3JpZGExEjAQBgNV
 BAcTCU5pY2V2aWxsZTEZMBcGA1UEChMQQ3VkYSBTeXN0ZW1zIExMQzEcMBoGA1UEAxMTQ3Vk
 YSBTeXN0ZW1zIExMQyBDQTEvMC0GCSqGSIb3DQEJARYgY3VzdG9tZXItc2VydmljZUBjdWRh
 c3lzdGVtcy5uZXQCAQgwDQYJKoZIhvcNAQEBBQAEggIAYDUFBvfbzEjGO/S9bOndOoWVjRmc
 o2r5uLDYcA5Lcy04X65DZX0UdMRdhRpRjaKfTESjoGnxv6nLif1h3jA2K27oSNBZGCOStD62
 V+z7xj7k1Q1UDPwMIDHqKFhd6UCM5C6zFj8mLOeBMqKRzPIGZ98f8MN5/0zoQWLlJgXuvpFb
 O1LXUvaiY/2Y1nmFoKTpcF5Yql3pazCTz+O9usLPLKblRZn3INyxBhgcvP1tgZfJinr4nt9N
 KRC9//tuTpdFlzcqXBgkB/pyp5i+zUXqQp7cyKxk8DO2lJ54QJs6VMzkv/GV1Buo/fo4p0jb
 Kw4axJnB22LRpkV8b+O4hLB9yDAhsfBocQ1kY5d38wawzXgXxJxcLQjSvC0mECoXs1wY1lSr
 la0VQdjLCiBzX62uFSGr/WY/RDHgkgFnAHHy2qAgrvX/Uiw9zaxrXnUpte4fMhokAdwYFOY5
 7+XLtq+xawHfwrWanemu677V2ZC7e+UatDNZy7BjzRKq5vNg2OvlXNskE8WQJGgE7DSi2+cz
 8n905Ou66/EcARS20VHGjb+KA70f/BDO3Q7a5WOzUxxyUb4s95wnVV7ty9Vh8VuMxLSdE1wY
 oy7xGTxEHRffkqUSTG2r4zvoygsRnglRKXZjJs2AIMSuEwtzFgGM3nwF06I4NelJf0zkjrfy
 KOEtrekAAAAAAAA=
 --------------ms080306070708080308040001--
 
 



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201403261230.s2QCU3vI095105>