From owner-freebsd-hackers@FreeBSD.ORG  Tue Jan  7 18:15:42 2014
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 251E8F7E
 for <hackers@freebsd.org>; Tue,  7 Jan 2014 18:15:42 +0000 (UTC)
Received: from mail-ea0-x22e.google.com (mail-ea0-x22e.google.com
 [IPv6:2a00:1450:4013:c01::22e])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (No client certificate requested)
 by mx1.freebsd.org (Postfix) with ESMTPS id A5B7310C5
 for <hackers@freebsd.org>; Tue,  7 Jan 2014 18:15:41 +0000 (UTC)
Received: by mail-ea0-f174.google.com with SMTP id b10so371823eae.33
 for <hackers@freebsd.org>; Tue, 07 Jan 2014 10:15:40 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject
 :references:in-reply-to:content-type:content-transfer-encoding;
 bh=+gaKGRyXWG4LjztInHt6SnzOIsW4sdOD2/yUBz3ifwg=;
 b=okI71P4KYM/178SZyo2hBvEL17Agauco0rB0CrFQwfKQriejVM8+zrCjjD7577kNtA
 6o2kTKsDI3uxP0zAg11WvlcgERe6HiKN8kpnJ/KDWJmeO3JWnjQuqiSQXng4V3MoFIXb
 I5G9DX5BfVomoXpsrHimf60rR8zoz5khnBsa6ZpIiAChUwlGixJZ4BWOmWZBCK9sK+B+
 zx6D1ME7iJVFS8LDz/T4oMPjRJfuDULOH3Q6Z/t4M5Z6a7nCXb+aHt8V7OP6orfkOmug
 16gAgT0EHOmKJ3oUkDrcwSh8yRBN+f/5K+elvG/Ag4seDYSDQ0uQ/3mT50wHAnzkzCPH
 gd4g==
X-Received: by 10.14.37.131 with SMTP id y3mr93679395eea.1.1389118540037;
 Tue, 07 Jan 2014 10:15:40 -0800 (PST)
Received: from mavbook.mavhome.dp.ua ([134.249.139.101])
 by mx.google.com with ESMTPSA id m1sm182151278eeg.0.2014.01.07.10.15.37
 for <multiple recipients>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Tue, 07 Jan 2014 10:15:39 -0800 (PST)
Sender: Alexander Motin <mavbsd@gmail.com>
Message-ID: <52CC4448.90405@FreeBSD.org>
Date: Tue, 07 Jan 2014 20:15:36 +0200
From: Alexander Motin <mav@FreeBSD.org>
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64;
 rv:24.0) Gecko/20100101 Thunderbird/24.1.0
MIME-Version: 1.0
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: UMA caches draining
References: <52CB4F7D.2080909@FreeBSD.org>
 <20140107054825.GI59496@kib.kiev.ua> <52CBCC4F.8020900@FreeBSD.org>
 <20140107172025.GP59496@kib.kiev.ua>
In-Reply-To: <20140107172025.GP59496@kib.kiev.ua>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: hackers@freebsd.org
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.17
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jan 2014 18:15:42 -0000

On 07.01.2014 19:20, Konstantin Belousov wrote:
> On Tue, Jan 07, 2014 at 11:43:43AM +0200, Alexander Motin wrote:
>> On 07.01.2014 07:48, Konstantin Belousov wrote:
>>> On Tue, Jan 07, 2014 at 02:51:09AM +0200, Alexander Motin wrote:
>>>> I have some questions about memory allocation. At this moment our UMA
>>>> never returns freed memory back to the system until it is explicitly
>>>> asked by pageout daemon via uma_reclaim() call, that happens only when
>>>> system is quite low on memory. How does that coexist with buffer cache
>>>> and other consumers? Will, for example, buffer cache allocate buffers
>>>> and be functional when most of system's memory uselessly consumed by UMA
>>>> caches? Is there some design how that supposed to work?
>>> Allocation of the pages which consitute a new buffer creates the pressure
>>> and causes pagedaemon wakeup if amount of free pages is too low.  Look
>>> at the vm_page_grab() call in allocbuf().  Also note that buffer cache
>>> is not shrinked in response to the low memory events, and buffers pages
>>> are excluded from the page daemon scans since pages are wired.
>>
>> Thanks. I indeed can't see respective vm_lowmem handler. But how does it
>> adapt then? It should have some sort of back pressure. And since it
>> can't know about UMA internals, it should probably just see that system
>> is getting low on physical memory. Won't it shrink itself first in such
>> case before pagedaemon start its reclaimage?
> Buffer cache only caches buffers, it is not supposed to provide the file
> content cache, at least for VMIO. The buffer cache size is capped during
> system configuration, the algorithm to calculate the cache size is not
> easy to re-type, but look at the vfs_bio.c:kern_vfs_bio_buffer_alloc().
> On the modern machines with 512MB of RAM of more, it is essentially 10%
> of the RAM which is dedicated to buffer cache.

So it is hard capped and never returns that memory in any case? 10% is 
not much, but it still doesn't sound perfect.

>> When vm_lowmem condition finally fire, it will purge different data from
>> different subsystems, potentially still usable. UMA caches though have
>> no valid data, only an allocation optimization. Shouldn't they be freed
>> first somehow, at least an unused part, as in my patch? Also I guess
>> having more really free memory should make M_NOWAIT allocations to fail
>> less often.
> IMO this is not a right direction. My opinion is that M_NOWAIT
> allocation should be mosty banned from the top-level of the kernel, and
> then interrupt threads and i/o path should try hard to avoid allocations
> at all.

OK, M_NOWAIT possibly was bad example, though we have a lot of M_NOWAIT 
allocations in many important areas of the kernel. But still, making 
M_WAITOK allocation process to wait in case where system could already 
be prepared at hopefully low cost is possibly not perfect too.

> Purging UMA caches on first sign of low memory condition would make UMA
> slower, possibly much slower for many workloads which are routinely
> handled now. Our code is accustomed to fast allocators, look at how many
> allocations typical syscall makes for temp buffers. Such change
> requires profiling of varying workloads to prove that it does not cause
> regressions.

Full purging on low memory is what present implementation actually does. 
I was proposing much softer alternative, purging only caches unused for 
last 20 seconds, that in some situations could allow to avoid full 
purges completely.

> I suspect that what you do is tailored for single (ab)user of UMA. You
> might try to split UMA low memory handler into two, one for abuser, and
> one for the rest of caches.

IMO the only "abuse" of ZFS is that it takes UMA and tries to use it for 
serious things, size of which is significant in total amount of RAM. And 
obviously it wants to do it fast too. But general problem of UMA is not 
new: with increasing number of zones, fluctuating load pattern will make 
different zones grow in different time, that at some point will 
inevitably create memory pressure, even if each consumer or even all of 
them together are size-capped. ZFS just brings that up to the limit, 
actively using up to 90 different zones. But the problem is not new.

If you prefer to see UMA consumers divided into some classes -- fine 
(though IMO that is very non-obvious how to decide that in every case), 
but what would you see logic there? Should there be another memory 
limit, like low and high watermarks? Aren't there any benefits of 
freeing RAM preventively, when there still "enough" free? Shouldn't we 
mix "soft" and "hard" purges at some rate between low and high 
watermarks to keep all consumers feeling fair amount of pressure, 
depending on their class?

>>>> I've made an experimental patch for UMA
>>>> (http://people.freebsd.org/~mav/drain_unused.patch) to make it every 20
>>>> seconds return back to the system cached memory, unused for the last 20
>>>> seconds. Algorithm is quite simple and patch seems like working, but I
>>>> am not sure whether I am approaching problem from the right side. Any
>>>> thoughts?

-- 
Alexander Motin