From owner-freebsd-hackers@FreeBSD.ORG  Tue Nov 13 20:04:05 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id BAA22994;
 Tue, 13 Nov 2012 20:04:05 +0000 (UTC) (envelope-from alc@rice.edu)
Received: from mh11.mail.rice.edu (mh11.mail.rice.edu [128.42.199.30])
 by mx1.freebsd.org (Postfix) with ESMTP id 833568FC13;
 Tue, 13 Nov 2012 20:04:05 +0000 (UTC)
Received: from mh11.mail.rice.edu (localhost.localdomain [127.0.0.1])
 by mh11.mail.rice.edu (Postfix) with ESMTP id 764044C0243;
 Tue, 13 Nov 2012 14:04:04 -0600 (CST)
Received: from mh11.mail.rice.edu (localhost.localdomain [127.0.0.1])
 by mh11.mail.rice.edu (Postfix) with ESMTP id 74D984C0235;
 Tue, 13 Nov 2012 14:04:04 -0600 (CST)
X-Virus-Scanned: by amavis-2.7.0 at mh11.mail.rice.edu, auth channel
Received: from mh11.mail.rice.edu ([127.0.0.1])
 by mh11.mail.rice.edu (mh11.mail.rice.edu [127.0.0.1]) (amavis, port 10026)
 with ESMTP id R8hsA5VhGzKj; Tue, 13 Nov 2012 14:04:04 -0600 (CST)
Received: from adsl-216-63-78-18.dsl.hstntx.swbell.net
 (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18])
 (using TLSv1 with cipher RC4-MD5 (128/128 bits))
 (No client certificate requested) (Authenticated sender: alc)
 by mh11.mail.rice.edu (Postfix) with ESMTPSA id A6BC84C0248;
 Tue, 13 Nov 2012 14:04:03 -0600 (CST)
Message-ID: <50A2A7B2.2020302@rice.edu>
Date: Tue, 13 Nov 2012 14:04:02 -0600
From: Alan Cox <alc@rice.edu>
User-Agent: Mozilla/5.0 (X11; FreeBSD i386;
 rv:16.0) Gecko/20121111 Thunderbird/16.0.2
MIME-Version: 1.0
To: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject: Re: Memory reserves or lack thereof
References: <A6DE036C6A90C949A25CE89E844237FB2086970A@SACEXCMBX01-PRD.hq.netapp.com>
 <20121110132019.GP73505@kib.kiev.ua>
 <CAJUyCcOKHH3TO6qaK9V7UY2HW+p6T74DUUdmbSi4eeGyofrTdQ@mail.gmail.com>
 <20121112133638.GZ73505@kib.kiev.ua> <50A1336E.5040401@rice.edu>
In-Reply-To: <50A1336E.5040401@rice.edu>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Cc: Konstantin Belousov <kostikbel@gmail.com>, alc@freebsd.org,
 mips@freebsd.org, "Sears, Steven" <Steven.Sears@netapp.com>, pho@freebsd.org
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 13 Nov 2012 20:04:05 -0000

On 11/12/2012 11:35, Alan Cox wrote:
> On 11/12/2012 07:36, Konstantin Belousov wrote:
>> On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote:
>>> On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov <kostikbel@gmail.com>wrote:
>>>
>>>> On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, Steven wrote:
>>>>> I have a memory subsystem design question that I'm hoping someone can
>>>> answer.
>>>>> I've been looking at a machine that is completely out of memory, as in
>>>>>
>>>>>  v_free_count = 0,
>>>>>  v_cache_count = 0,
>>>>>
>>>>> I wondered how a machine could completely run out of memory like this,
>>>> especially after finding a lack of interrupt storms or other pathologies
>>>> that would tend to overcommit memory. So I started investigating.
>>>>> Most allocators come down to vm_page_alloc(), which has this guard:
>>>>>
>>>>>       if ((curproc == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) {
>>>>>               page_req = VM_ALLOC_SYSTEM;
>>>>>       };
>>>>>
>>>>>       if (cnt.v_free_count + cnt.v_cache_count > cnt.v_free_reserved ||
>>>>>           (page_req == VM_ALLOC_SYSTEM &&
>>>>>           cnt.v_free_count + cnt.v_cache_count >
>>>> cnt.v_interrupt_free_min) ||
>>>>>           (page_req == VM_ALLOC_INTERRUPT &&
>>>>>           cnt.v_free_count + cnt.v_cache_count > 0)) {
>>>>>
>>>>> The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate
>>>> every last page.
>>>>> >From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare,
>>>> perhaps only used from interrupt threads. Not so, see kmem_malloc() or
>>>> uma_small_alloc() which both contain this mapping:
>>>>>       if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>>>>>               pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>>>>>       else
>>>>>               pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
>>>>>
>>>>> Note that M_USE_RESERVE has been deprecated and is used in just a
>>>> handful of places. Also note that lots of code paths come through these
>>>> routines.
>>>>> What this means is essentially _any_ allocation using M_NOWAIT will
>>>> bypass whatever reserves have been held back and will take every last page
>>>> available.
>>>>> There is no documentation stating M_NOWAIT has this side effect of
>>>> essentially being privileged, so any innocuous piece of code that can't
>>>> block will use it. And of course M_NOWAIT is literally used all over.
>>>>> It looks to me like the design goal of the BSD allocators is on
>>>> recovery; it will give all pages away knowing it can recover.
>>>>> Am I missing anything? I would have expected some small number of pages
>>>> to be held in reserve just in case. And I didn't expect M_NOWAIT to be a
>>>> sort of back door for grabbing memory.
>>>> Your analysis is right, there is nothing to add or correct.
>>>> This is the reason to strongly prefer M_WAITOK.
>>>>
>>> Agreed.  Once upon time, before SMPng, M_NOWAIT was rarely used.  It was
>>> well understand that it should only be used by interrupt handlers.
>>>
>>> The trouble is that M_NOWAIT conflates two orthogonal things.  The obvious
>>> being that the allocation shouldn't sleep.  The other being how far we're
>>> willing to deplete the cache/free page queues.
>>>
>>> When fine-grained locking got sprinkled throughout the kernel, we all to
>>> often found ourselves wanting to do allocations without the possibility of
>>> blocking.  So, M_NOWAIT became commonplace, where it wasn't before.
>>>
>>> This had the unintended consequence of introducing a lot of memory
>>> allocations in the top-half of the kernel, i.e., non-interrupt handling
>>> code, that were digging deep into the cache/free page queues.
>>>
>>> Also, ironically, in today's kernel an "M_NOWAIT | M_USE_RESERVE"
>>> allocation is less likely to succeed than an "M_NOWAIT" allocation.
>>> However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it
>>> could only allocate a free page.  M_USE_RESERVE said that it ok to allocate
>>> a cached page even though M_NOWAIT was specified.  Consequently, the system
>>> wouldn't dig as far into the free page queue if M_USE_RESERVE was
>>> specified, because it was allowed to reclaim a cached page.
>>>
>>> In conclusion, I think it's time that we change M_NOWAIT so that it doesn't
>>> dig any deeper into the cache/free page queues than M_WAITOK does and
>>> reintroduce a M_USE_RESERVE-like flag that says dig deep into the
>>> cache/free page queues.  The trouble is that we then need to identify all
>>> of those places that are implicitly depending on the current behavior of
>>> M_NOWAIT also digging deep into the cache/free page queues so that we can
>>> add an explicit M_USE_RESERVE.
>>>
>>> Alan
>>>
>>> P.S. I suspect that we should also increase the size of the "page reserve"
>>> that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*().  How
>>> many legitimate users of a new M_USE_RESERVE-like flag in today's kernel
>>> could actually be satisfied by two pages?
>> I am almost sure that most of people who put the M_NOWAIT flag, do not
>> know the 'allow the deeper drain of free queue' effect. As such, I believe
>> we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations
>> of the problematic places would be in the swapout path.
>>
>> I found a single explicit use of M_USE_RESERVE in the kernel,
>> so the flip is relatively simple.
> Agreed.  Most recently I eliminated several uses from the arm pmap
> implementations.  There is, however, one other use:
>
> ofed/include/linux/gfp.h:#define        GFP_ATOMIC      (M_NOWAIT |
> M_USE_RESERVE)
>
>> Below is the patch which I only compile-tested on amd64, and which booted
>> fine.
>>
>> Peter, could you, please, give it a run, to see obvious deadlocks, if any ?
>>
>> diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c
>> index dc9c307..ab1e869 100644
>> --- a/sys/amd64/amd64/uma_machdep.c
>> +++ b/sys/amd64/amd64/uma_machdep.c
>> @@ -29,6 +29,7 @@ __FBSDID("$FreeBSD$");
>>  
>>  #include <sys/param.h>
>>  #include <sys/lock.h>
>> +#include <sys/malloc.h>
>>  #include <sys/mutex.h>
>>  #include <sys/systm.h>
>>  #include <vm/vm.h>
>> @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>>  	int pflags;
>>  
>>  	*flags = UMA_SLAB_PRIV;
>> -	if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>> -		pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
>> -	else
>> -		pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
>> -	if (wait & M_ZERO)
>> -		pflags |= VM_ALLOC_ZERO;
>> +	pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED);
>>  	for (;;) {
>>  		m = vm_page_alloc(NULL, 0, pflags);
>>  		if (m == NULL) {
>> diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c
>> index f60cdb1..75366e3 100644
>> --- a/sys/arm/arm/vm_machdep.c
>> +++ b/sys/arm/arm/vm_machdep.c
>> @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>>  			ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT));
>>  			return (ret);
>>  		}
>> -		if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>> -			pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>> -		else
>> -			pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
>> -		if (wait & M_ZERO)
>> -			pflags |= VM_ALLOC_ZERO;
>> +		pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>>  		for (;;) {
>>  			m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
>>  			if (m == NULL) {
>> diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c
>> index 71caa29..2ce1ca6 100644
>> --- a/sys/fs/devfs/devfs_devs.c
>> +++ b/sys/fs/devfs/devfs_devs.c
>> @@ -121,7 +121,7 @@ devfs_alloc(int flags)
>>  	struct cdev *cdev;
>>  	struct timespec ts;
>>  
>> -	cdp = malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO |
>> +	cdp = malloc(sizeof *cdp, M_CDEVP, M_ZERO |
>>  	    ((flags & MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK));
>>  	if (cdp == NULL)
>>  		return (NULL);
>> diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c
>> index 37353ff..9f77762 100644
>> --- a/sys/ia64/ia64/uma_machdep.c
>> +++ b/sys/ia64/ia64/uma_machdep.c
>> @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>>  	int pflags;
>>  
>>  	*flags = UMA_SLAB_PRIV;
>> -	if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>> -		pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>> -	else
>> -		pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
>> -	if (wait & M_ZERO)
>> -		pflags |= VM_ALLOC_ZERO;
>> +	pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>>  
>>  	for (;;) {
>>  		m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
>> diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c
>> index 798e632..24baef0 100644
>> --- a/sys/mips/mips/uma_machdep.c
>> +++ b/sys/mips/mips/uma_machdep.c
>> @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>>  	void *va;
>>  
>>  	*flags = UMA_SLAB_PRIV;
>> -
>> -	if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>> -		pflags = VM_ALLOC_INTERRUPT;
>> -	else
>> -		pflags = VM_ALLOC_SYSTEM;
>> +	pflags = m2vm_flags(wait, 0);
>>  
>>  	for (;;) {
>>  		m = pmap_alloc_direct_page(0, pflags);
>
> This smells fishy, but not because of anything that you did.  It appears
> that the mips uma_small_alloc() is unconditionally asking for a
> pre-zeroed page.  I'll take a look at this later today.
>

I verified this.  The current implementation of uma_small_alloc() on
MIPS unconditionally zeroes the page.  Moreover, if M_ZERO is specified
to uma_small_alloc(), the same page is zeroed twice, once in
pmap_alloc_direct_page() and again in uma_small_alloc().

I expect to commit the following patch tomorrow.  Kostik, it will
trivially conflict with your current patch.

Index: mips/include/pmap.h
===================================================================
--- mips/include/pmap.h (revision 242939)
+++ mips/include/pmap.h (working copy)
@@ -179,7 +179,6 @@ void pmap_kenter_temporary_free(vm_paddr_t pa);
 void pmap_flush_pvcache(vm_page_t m);
 int pmap_emulate_modified(pmap_t pmap, vm_offset_t va);
 void pmap_grow_direct_page_cache(void);
-vm_page_t pmap_alloc_direct_page(unsigned int index, int req);
 
 #endif                         /* _KERNEL */
 
Index: mips/mips/pmap.c
===================================================================
--- mips/mips/pmap.c    (revision 242939)
+++ mips/mips/pmap.c    (working copy)
@@ -163,6 +163,7 @@ static vm_page_t pmap_pv_reclaim(pmap_t locked_pma
 static void pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t
va);
 static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap,
     vm_offset_t va);
+static vm_page_t pmap_alloc_direct_page(unsigned int index, int req);
 static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va,
     vm_page_t m, vm_prot_t prot, vm_page_t mpte);
 static int pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq,
vm_offset_t va,
@@ -1041,7 +1042,7 @@ pmap_grow_direct_page_cache()
 #endif
 }
 
-vm_page_t
+static vm_page_t
 pmap_alloc_direct_page(unsigned int index, int req)
 {
        vm_page_t m;
Index: mips/mips/uma_machdep.c
===================================================================
--- mips/mips/uma_machdep.c     (revision 242939)
+++ mips/mips/uma_machdep.c     (working copy)
@@ -50,12 +50,14 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8
        *flags = UMA_SLAB_PRIV;
 
        if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
-               pflags = VM_ALLOC_INTERRUPT;
+               pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
        else
-               pflags = VM_ALLOC_SYSTEM;
+               pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
+       if (wait & M_ZERO)
+               pflags |= VM_ALLOC_ZERO;
 
        for (;;) {
-               m = pmap_alloc_direct_page(0, pflags);
+               m = vm_page_alloc_freelist(VM_FREELIST_DIRECT, pflags);
                if (m == NULL) {
                        if (wait & M_NOWAIT)
                                return (NULL);