Date: Fri, 14 Mar 2014 06:12:17 -0500 (CDT) From: Karl Denninger <karl@fs.denninger.net> To: FreeBSD-gnats-submit@freebsd.org Subject: kern/187572: ZFS ARC cache code does not properly handle low memory Message-ID: <201403141112.s2EBCHEN080610@fs.denninger.net> Resent-Message-ID: <201403141120.s2EBK0P0048464@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
>Number: 187572 >Category: kern >Synopsis: ZFS ARC cache code does not properly handle low memory >Confidential: no >Severity: serious >Priority: high >Responsible: freebsd-bugs >State: open >Quarter: >Keywords: >Date-Required: >Class: sw-bug >Submitter-Id: current-users >Arrival-Date: Fri Mar 14 11:20:00 UTC 2014 >Closed-Date: >Last-Modified: >Originator: Karl Denninger >Release: FreeBSD 10.0-STABLE amd64 >Organization: Karls Sushi and Packet Smashers >Environment: System: FreeBSD NewFS.denninger.net 10.0-STABLE FreeBSD 10.0-STABLE #11 r263037M: Thu Mar 13 15:47:15 CDT 2014 karl@NewFS.denninger.net:/usr/obj/usr/src/sys/KSD-SMP amd64 Note: Also applies to previous releases >Description: ZFS can be convinced to engage in what I can only surmise is pathological behavior, and I've seen no fix for it when it happens -- but there are things you can do to mitigate it. What IMHO _*should*_ happen is that the ARC cache should shrink as necessary to prevent paging, subject to vfs.zfs.arc_min. To prevent pathological problems with segments that have been paged off hours (or more!) ago and never get paged back in because that particular piece of code never executes again (but the process is also still alive so the system cannot reclaim it and thus it shows "committed" in pstat -s but unless it is paged back in has no impact on system performance) the policing on this would have to apply a "reasonableness" filter to those pages (e.g. if it has been out on the page file for longer than "X", ignore that particular allocation unit for this purpose.) This would cause the ARC cache to flush itself down automatically as executable and data segment RAM commitments increase. The documentation says that this is the case and how it should work but it doesn't appear to actually be this way in practice for many workloads. I have seen "wired" RAM pinned at 20GB on one of my servers here with a fairly large DBMS running -- with pieces of its working set and even the a user's shell (!) getting paged off, yet the ARC cache is not pared down to release memory. Indeed you can let the system run for hours under these conditions and the ARC wired memory will not decrease. Cutting back the DBMS's internal buffering does not help. What I've done here is restrict the ARC cache size in an attempt to prevent this particular bit of bogosity from biting me, and it appears to (sort of) work. Unfortunately you cannot tune this while the system is running (otherwise a user daemon could conceivably slash away at the arc_max sysctl and force the deallocation of wired memory if it detected paging -- or near-paging, such as free memory below some user-configured threshold), only at boot time in /boot/loader.conf. This is something that, should I get myself a nice hunk of free time, I may dive into and attempt to fix. It would likely take me quite a while to get up to speed on this as I've not gotten into the zfs code at all -- and mistakes in there could easily corrupt files.... (in other words definitely NOT something to play with on a production system!) I have to assume there's a pretty-good reason why you can't change arc_max while the system is running; it _*can*_ be changed on a running system on some other implementations (e.g. Solaris.) It is marked with CTLFLAG_RDTUN in the arc management file which prohibits run-time changes and the only place I see it referenced with a quick look is in the arc_init code. Note that the test in arc.c for "arc_reclaim_needed" appears to be pretty basic -- essentially the system will not aggressively try to reclaim memory unless used kmem > 3/4 of its size. (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs) #else /* !sun */ if (kmem_used() > (kmem_size() * 3) / 4) return (1); #endif /* sun */ Up above that there's a test for "vm_paging_needed()" that would (theoretically) appear to trigger first in these situations, but it doesn't in many cases. IMHO this is too-basic of a test and leads to pathological situations in that the system may wind up paging things off as opposed to paring back the ARC cache. As soon as the working set of something that's actually getting cycles gets paged out in most cases system performance goes straight in the trash. On sun machines (from reading the code) it will allegedly try to pare any time the "lotsfree" (plus "needfree" + "extra") amount of free memory is invaded. As an example this is what a server I own that is exhibiting this behavior now shows: 20202500 wire 1414052 act 2323280 inact 110340 cache 414484 free 1694896 buf Of that "wired" mem 15.7G of it is ARC cache (with a target of 15.81, so it's essentially right up against it.) That "free" number would be ok if it didn't result in the system having trashy performance -- but it does on occasion. Incidentally the allocated swap is about 195k blocks (~200 Megabytes) which isn't much all-in, but it's enough to force actual fetches of recently-used programs (e.g. your shell!) from paged-off space. The thing is that if the test in the code (75% of kmem available consumed) was looking only at "free" the system should be aggressively trying to free up ARC cache. It clearly is not; the included code calls this: uint64_t kmem_used(void) { return (vmem_size(kmem_arena, VMEM_ALLOC)); } What's quite clear is that the system _*thinks*_ it has plenty of free memory when it very-clearly is essentially out! In fact free memory at the moment (~400MB) is 1.7% of the total, _*not*_ 25%. From this I surmise that the "vmem_size" call is not returning the sum of all the above "in use" sizes (except perhaps "inact"); were it to do so that would be essentially 100% of installed RAM and the ARC cache should be actively under shrinkage, but it clearly is not. >How-To-Repeat: Set up a cache-heavy workload on large (~terabyte sized or bigger) ZFS filesystems and note that free RAM drops to the point that starvation occurs, while "wired" memory pins at the maximum ARC cache size, even though you have other demands for RAM that should cause the ARC memory congestion control algorithm to evict some of the cache as demand rises. >Fix: The context diff below resolves the problem. We now add up wired, active, inactive, cache and free memory and compute the percentage of RAM that is free of that whole. If the percentage free drops below the selected value then the flag is set that asks the ARC cache to free RAM. This also introduces a runtime tunable that allows you to select the free RAM target for the ARC cache in real time, rather than forcing you to reboot to set the ARC's maximum size in /boot/loader.conf. The target is exported via sysctl as: vfs.zfs.arc_freepage_percent_target: 25 Changes to this value are effective immediately allowing runtime configuration to suit your workload. The default is set to 25% to match the original code's intent, but for large RAM sizes this is probably more conservative than required. Defining NEWRECLAIM_DEBUG will cause the code to print (on the console) state change status messages along with picked up changes in the reservation percentage. Note that on a busy system that is actively trying to invade the free space reservation these notices can get rather "busy" and as such it is turned off by default. *** arc.c.original Thu Mar 13 09:18:48 2014 --- /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c Thu Mar 13 15:43:38 2014 *************** *** 18,23 **** --- 18,84 ---- * * CDDL HEADER END */ + + /* Karl Denninger (karl@denninger.net), 3/13/2014, FreeBSD-specific + * + * If "NEWRECLAIM" is defined, change the "low memory" warning that causes + * the ARC cache to be pared down. The reason for the change is that the + * apparent attempted algorithm is to start evicting ARC cache when free + * pages fall below 25% of installed RAM. This maps reasonably well to how + * Solaris is documented to behave; when "lotsfree" is invaded ZFS is told + * to pare down. + * + * The problem is that on FreeBSD machines the system doesn't appear to be + * getting what the authors of the original code thought they were looking at + * with its test and as a result that test never triggers. That leaves the + * only reclaim trigger as the "paging needed" status flag, and by the time + * that trips the system is already in low-memory trouble. This can lead to + * severe pathological behavior under the following scenario: + * - The system starts to page and ARC is evicted. + * - The system stops paging as ARC's eviction drops wired RAM a bit. + * - ARC starts increasing its allocation again, and wired memory grows. + * - A new image is activated, and the system once again attempts to page. + * - ARC starts to be evicted again. + * - Back to #2 + * + * Note that ZFS's ARC default (unless you override it in /boot/loader.conf) + * is to allow the ARC cache to grab nearly all of free RAM, provided nobody + * else needs it. That would be ok if we evicted cache when required. + * + * Unfortunately the system can get into a state where it never + * manages to page anything of materiality back in, as if there is active + * I/O the ARC will start grabbing space once again as soon as the memory + * contention state drops. For this reason the "paging is occurring" flag + * should be the **last resort** condition for ARC eviction; you want to + * (as Solaris does) start when there is material free RAM left in the hope + * of never getting into the condition where you're potentially paging off + * executables in favor of leaving disk cache allocated. That's a recipe + * for terrible overall system performance. + * + * To fix this we instead grab four OIDs out of the sysctl status + * messages -- wired pages, active pages, inactive pages and cache (vnodes?) + * pages, sum those and compare against the free page count from the + * VM sysctl status OID, giving us a percentage of pages free. This + * is checked against a new tunable "vfs.zfs.arc_freepage_percent_target" + * and if less, we declare the system low on memory. + * + * Note that this sysctl variable is runtime tunable if you have reason + * to change it (e.g. you want more or less RAM free to be the "clean up" + * threshold.) + * + * If this test is enabled the previous algorithm is still checked in the + * event this test fails, although that previous test should be a no-op. + * + * If you turn on NEWRECLAIM_DEBUG then the kernel will print on the console + * status messages when the reclaim status trips on and off, along with the + * page count aggregate that triggered it (and the free space) for each + * event. + */ + + #define NEWRECLAIM + #undef NEWRECLAIM_DEBUG + + /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2013 by Delphix. All rights reserved. *************** *** 139,144 **** --- 200,211 ---- #include <vm/vm_pageout.h> + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + #include <sys/sysctl.h> + #endif + #endif /* NEWRECLAIM */ + #ifdef illumos #ifndef _KERNEL /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */ *************** *** 203,218 **** --- 270,302 ---- int zfs_arc_shrink_shift = 0; int zfs_arc_p_min_shift = 0; int zfs_disable_dup_eviction = 0; + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + static int percent_target = 25; + #endif + #endif TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max); TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min); TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit); + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + TUNABLE_INT("vfs.zfs.arc_freepage_percent_target", &percent_target); + #endif + #endif + SYSCTL_DECL(_vfs_zfs); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0, "Maximum ARC size"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0, "Minimum ARC size"); + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + SYSCTL_INT(_vfs_zfs, OID_AUTO, arc_freepage_percent_target, CTLFLAG_RWTUN, &percent_target, 0, "ARC Free RAM Target percentage"); + #endif + #endif + /* * Note that buffers can be in one of 6 states: * ARC_anon - anonymous (discussed below) *************** *** 2438,2443 **** --- 2522,2543 ---- { #ifdef _KERNEL + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + u_int vmwire = 0; + u_int vmactive = 0; + u_int vminactive = 0; + u_int vmcache = 0; + u_int vmfree = 0; + u_int vmtotal = 0; + int percent = 25; + size_t vmsize; + #ifdef NEWRECLAIM_DEBUG + static int xval = -1; + static int oldpercent = 0; + #endif /* NEWRECLAIM_DEBUG */ + #endif /* NEWRECLAIM */ + #endif if (needfree) return (1); *************** *** 2492,2502 **** return (1); #endif #else /* !sun */ if (kmem_used() > (kmem_size() * 3) / 4) return (1); #endif /* sun */ - #else if (spa_get_random(100) == 0) return (1); #endif --- 2592,2656 ---- return (1); #endif #else /* !sun */ + + #ifdef NEWRECLAIM + #ifdef __FreeBSD__ + /* + * Implement the new tunable free RAM algorithm. We check the various page + * VM stats and add them up, then check the free count percentage against + * the specified target. If we're under the target we are memory constrained + * and ask for ARC cache shrinkage. + */ + vmsize = sizeof(vmwire); + kernel_sysctlbyname(curthread, "vm.stats.vm.v_wire_count", &vmwire, &vmsize, NULL, 0, NULL, 0); + vmsize = sizeof(vmactive); + kernel_sysctlbyname(curthread, "vm.stats.vm.v_active_count", &vmactive, &vmsize, NULL, 0, NULL, 0); + vmsize = sizeof(vminactive); + kernel_sysctlbyname(curthread, "vm.stats.vm.v_inactive_count", &vminactive, &vmsize, NULL, 0, NULL, 0); + vmsize = sizeof(vmcache); + kernel_sysctlbyname(curthread, "vm.stats.vm.v_cache_count", &vmcache, &vmsize, NULL, 0, NULL, 0); + vmsize = sizeof(vmfree); + kernel_sysctlbyname(curthread, "vm.stats.vm.v_free_count", &vmfree, &vmsize, NULL, 0, NULL, 0); + vmsize = sizeof(percent); + kernel_sysctlbyname(curthread, "vfs.zfs.arc_freepage_percent_target", &percent, &vmsize, NULL, 0, NULL, 0); + vmtotal = vmwire + vmactive + vminactive + vmcache + vmfree; + #ifdef NEWRECLAIM_DEBUG + if (percent != oldpercent) { + printf("ZFS ARC: Reservation change to [%d], [%d] pages, [%d] free\n", percent, vmtotal, vmfree); + oldpercent = percent; + } + #endif + + if (!vmtotal) { + vmtotal = 1; /* Protect against divide by zero */ + /* (should be impossible, but...) */ + } + + if (((vmfree * 100) / vmtotal) < percent) { + #ifdef NEWRECLAIM_DEBUG + if (xval != 1) { + printf("ZFS ARC: RECLAIM total %u, free %u, free pct (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), percent); + xval = 1; + } + #endif /* NEWRECLAIM_DEBUG */ + return(1); + #ifdef NEWRECLAIM_DEBUG + } else { + if (xval != 0) { + printf("ZFS ARC: NORMAL total %u, free %u, free pct (%u), target pct (%u)\n", vmtotal, vmfree, ((vmfree * 100) / vmtotal), percent); + xval = 0; + } + #endif /* NEWRECLAIM_DEBUG */ + } + + + #endif /* __FreeBSD__ */ + #endif /* NEWRECLAIM */ + if (kmem_used() > (kmem_size() * 3) / 4) return (1); #endif /* sun */ if (spa_get_random(100) == 0) return (1); #endif >Release-Note: >Audit-Trail: >Unformatted:
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201403141112.s2EBCHEN080610>