Date: Tue, 18 Mar 2014 18:55:21 +0200 From: Volodymyr Kostyrko <c.kworr@gmail.com> To: Andriy Gapon <avg@FreeBSD.org>, freebsd-fs@FreeBSD.org Subject: Re: kern/187594: [zfs] [patch] ZFS ARC behavior problem and fix Message-ID: <53287A79.9060807@b1t.name> In-Reply-To: <201403181520.s2IFK1M3069036@freefall.freebsd.org> References: <201403181520.s2IFK1M3069036@freefall.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
18.03.2014 17:20, Andriy Gapon wrote: > Karl Denninger <karl@fs.denninger.net> wrote: > > ZFS can be convinced to engage in pathological behavior due to a bad > > low-memory test in arc.c > > > > The offending file is at > > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c; it allegedly > > checks for 25% free memory, and if it is less asks for the cache to shrink. > > > > (snippet from arc.c around line 2494 of arc.c in 10-STABLE; path > > /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs) > > > > #else /* !sun */ > > if (kmem_used() > (kmem_size() * 3) / 4) > > return (1); > > #endif /* sun */ > > > > Unfortunately these two functions do not return what the authors thought > > they did. It's clear what they're trying to do from the Solaris-specific > > code up above this test. > > No, these functions do return what the authors think they do. > The check is for KVA usage (kernel virtual address space), not for physical memory. > > > The result is that the cache only shrinks when vm_paging_needed() tests > > true, but by that time the system is in serious memory trouble and by > > No, it is not. > The description and numbers here are a little bit outdated but they should give > an idea of how paging works in general: > https://wiki.freebsd.org/AvgPageoutAlgorithm > > > triggering only there it actually drives the system further into paging, > > How does ARC eviction drives the system further into paging? > > > because the pager will not recall pages from the swap until they are next > > executed. This leads the ARC to try to fill in all the available RAM even > > though pages have been pushed off onto swap. Not good. > > Unused physical memory is a waste. It is true that ARC tries to use as much of > memory as it is allowed. The same applies to the page cache (Active, Inactive). > Memory management is a dynamic system and there are a few competing agents. I'd better like it to be a maximum of 500M or 5% memory. On a loaded server this wouldn't hurt performance but will provide a good window for VM system to stay reasonable. > It is hard to correctly tune that system using a large hummer such as your > patch. I believe that with your patch ARC will get shrunk to its minimum size > in due time. Active + Inactive will grow to use the memory that you are denying > to ARC driving Free below a threshold, which will reduce ARC. Repeated enough > times this will drive ARC to its minimum. But what is worse - having programs memory paged to the disk or some random data from the disk to be cached? Yes, I know that there are situations where a big amount of inactive memory would hurt performance. But putting file cache above inactive memory is bad too. I see no benefit in having 4G ARC cache but 2G inactive memory swapped out leaving inactive at 50M. Any Java service can hold a number of memory and it will require it occasionally so most of this memory would be swapped out so the process would be slow but we can browse the disk faster... The only solution for this is making pages of ARC and inactive even in their odds to evict. > Also, there are a few technical problems with the patch: > - you don't need to use sysctl interface in kernel, the values you need are > available directly, just take a look at e.g. implementation of vm_paging_needed() > - similarly, querying vfs.zfs.arc_freepage_percent_target value via > kernel_sysctlbyname is just bogus; you can use percent_target directly > - you don't need to sum various page counters to get a total count, there is > v_page_count > > Lastly, can you try to test reverting your patch and instead setting > vm.lowmem_period=0 ? Actually I already tried that patch and compared it to lowmem_period. The patch works much better despite actually been a crutch... The whole thing is because of two issues: 1. Kernel cannot reorder memory when some process (like VirtualBox) needs to allocate a big hunk at once. Right now the only working solution for kernel is to push inactive to the swap even when there is enough free memory to hold whole allocation. There's no in-memory reordering. And as ARC is shrinking only when free memory is low it completely ignores this condition and doesn't return a single page to the vm. 2. What ARC takes can't be freed because there's no simple opposite interface to get X blocks from ARC. It would be much better if ARC whould be arranged in a way that system can shrink it with a simple syscall, like cache. Without this we are already taking this route: * systems needs space; * arc starts shrinking; * while arc shrinks some mem is cached to swap and becomes available; * mem freed from swapping is taken and process starts working; * arc completes shrinking and starts to grow because of a disk activity. As far as I understand our VM systems tries to maintain a predefined percent of mem clean or at least cached to swap so this mem can be quickly claimed. So swapping wins, ARC losts and swap is never read back again unless explicitly required. This is because it's too late to evict anything from ARC when we need memory. If there would be a way for ARC to mark some pages as freely purgeable (probably with a callback to tell ARC which pages where purged) I think this problem would be gone. -- Sphinx of black quartz, judge my vow.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?53287A79.9060807>