Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Nov 2012 11:35:42 -0600
From:      Alan Cox <alc@rice.edu>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        alc@freebsd.org, pho@freebsd.org, "Sears, Steven" <Steven.Sears@netapp.com>, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: Memory reserves or lack thereof
Message-ID:  <50A1336E.5040401@rice.edu>
In-Reply-To: <20121112133638.GZ73505@kib.kiev.ua>
References:  <A6DE036C6A90C949A25CE89E844237FB2086970A@SACEXCMBX01-PRD.hq.netapp.com> <20121110132019.GP73505@kib.kiev.ua> <CAJUyCcOKHH3TO6qaK9V7UY2HW%2Bp6T74DUUdmbSi4eeGyofrTdQ@mail.gmail.com> <20121112133638.GZ73505@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On 11/12/2012 07:36, Konstantin Belousov wrote:
> On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote:
>> On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov <kostikbel@gmail.com>wrote:
>>
>>> On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, Steven wrote:
>>>> I have a memory subsystem design question that I'm hoping someone can
>>> answer.
>>>> I've been looking at a machine that is completely out of memory, as in
>>>>
>>>>  v_free_count = 0,
>>>>  v_cache_count = 0,
>>>>
>>>> I wondered how a machine could completely run out of memory like this,
>>> especially after finding a lack of interrupt storms or other pathologies
>>> that would tend to overcommit memory. So I started investigating.
>>>> Most allocators come down to vm_page_alloc(), which has this guard:
>>>>
>>>>       if ((curproc == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) {
>>>>               page_req = VM_ALLOC_SYSTEM;
>>>>       };
>>>>
>>>>       if (cnt.v_free_count + cnt.v_cache_count > cnt.v_free_reserved ||
>>>>           (page_req == VM_ALLOC_SYSTEM &&
>>>>           cnt.v_free_count + cnt.v_cache_count >
>>> cnt.v_interrupt_free_min) ||
>>>>           (page_req == VM_ALLOC_INTERRUPT &&
>>>>           cnt.v_free_count + cnt.v_cache_count > 0)) {
>>>>
>>>> The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate
>>> every last page.
>>>> >From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare,
>>> perhaps only used from interrupt threads. Not so, see kmem_malloc() or
>>> uma_small_alloc() which both contain this mapping:
>>>>       if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>>>>               pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>>>>       else
>>>>               pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
>>>>
>>>> Note that M_USE_RESERVE has been deprecated and is used in just a
>>> handful of places. Also note that lots of code paths come through these
>>> routines.
>>>> What this means is essentially _any_ allocation using M_NOWAIT will
>>> bypass whatever reserves have been held back and will take every last page
>>> available.
>>>> There is no documentation stating M_NOWAIT has this side effect of
>>> essentially being privileged, so any innocuous piece of code that can't
>>> block will use it. And of course M_NOWAIT is literally used all over.
>>>> It looks to me like the design goal of the BSD allocators is on
>>> recovery; it will give all pages away knowing it can recover.
>>>> Am I missing anything? I would have expected some small number of pages
>>> to be held in reserve just in case. And I didn't expect M_NOWAIT to be a
>>> sort of back door for grabbing memory.
>>> Your analysis is right, there is nothing to add or correct.
>>> This is the reason to strongly prefer M_WAITOK.
>>>
>> Agreed.  Once upon time, before SMPng, M_NOWAIT was rarely used.  It was
>> well understand that it should only be used by interrupt handlers.
>>
>> The trouble is that M_NOWAIT conflates two orthogonal things.  The obvious
>> being that the allocation shouldn't sleep.  The other being how far we're
>> willing to deplete the cache/free page queues.
>>
>> When fine-grained locking got sprinkled throughout the kernel, we all to
>> often found ourselves wanting to do allocations without the possibility of
>> blocking.  So, M_NOWAIT became commonplace, where it wasn't before.
>>
>> This had the unintended consequence of introducing a lot of memory
>> allocations in the top-half of the kernel, i.e., non-interrupt handling
>> code, that were digging deep into the cache/free page queues.
>>
>> Also, ironically, in today's kernel an "M_NOWAIT | M_USE_RESERVE"
>> allocation is less likely to succeed than an "M_NOWAIT" allocation.
>> However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it
>> could only allocate a free page.  M_USE_RESERVE said that it ok to allocate
>> a cached page even though M_NOWAIT was specified.  Consequently, the system
>> wouldn't dig as far into the free page queue if M_USE_RESERVE was
>> specified, because it was allowed to reclaim a cached page.
>>
>> In conclusion, I think it's time that we change M_NOWAIT so that it doesn't
>> dig any deeper into the cache/free page queues than M_WAITOK does and
>> reintroduce a M_USE_RESERVE-like flag that says dig deep into the
>> cache/free page queues.  The trouble is that we then need to identify all
>> of those places that are implicitly depending on the current behavior of
>> M_NOWAIT also digging deep into the cache/free page queues so that we can
>> add an explicit M_USE_RESERVE.
>>
>> Alan
>>
>> P.S. I suspect that we should also increase the size of the "page reserve"
>> that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*().  How
>> many legitimate users of a new M_USE_RESERVE-like flag in today's kernel
>> could actually be satisfied by two pages?
> I am almost sure that most of people who put the M_NOWAIT flag, do not
> know the 'allow the deeper drain of free queue' effect. As such, I believe
> we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations
> of the problematic places would be in the swapout path.
>
> I found a single explicit use of M_USE_RESERVE in the kernel,
> so the flip is relatively simple.

Agreed.  Most recently I eliminated several uses from the arm pmap
implementations.  There is, however, one other use:

ofed/include/linux/gfp.h:#define        GFP_ATOMIC      (M_NOWAIT |
M_USE_RESERVE)

> Below is the patch which I only compile-tested on amd64, and which booted
> fine.
>
> Peter, could you, please, give it a run, to see obvious deadlocks, if any ?
>
> diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c
> index dc9c307..ab1e869 100644
> --- a/sys/amd64/amd64/uma_machdep.c
> +++ b/sys/amd64/amd64/uma_machdep.c
> @@ -29,6 +29,7 @@ __FBSDID("$FreeBSD$");
>  
>  #include <sys/param.h>
>  #include <sys/lock.h>
> +#include <sys/malloc.h>
>  #include <sys/mutex.h>
>  #include <sys/systm.h>
>  #include <vm/vm.h>
> @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>  	int pflags;
>  
>  	*flags = UMA_SLAB_PRIV;
> -	if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -		pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
> -	else
> -		pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
> -	if (wait & M_ZERO)
> -		pflags |= VM_ALLOC_ZERO;
> +	pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED);
>  	for (;;) {
>  		m = vm_page_alloc(NULL, 0, pflags);
>  		if (m == NULL) {
> diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c
> index f60cdb1..75366e3 100644
> --- a/sys/arm/arm/vm_machdep.c
> +++ b/sys/arm/arm/vm_machdep.c
> @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>  			ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT));
>  			return (ret);
>  		}
> -		if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -			pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -		else
> -			pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -		if (wait & M_ZERO)
> -			pflags |= VM_ALLOC_ZERO;
> +		pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>  		for (;;) {
>  			m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
>  			if (m == NULL) {
> diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c
> index 71caa29..2ce1ca6 100644
> --- a/sys/fs/devfs/devfs_devs.c
> +++ b/sys/fs/devfs/devfs_devs.c
> @@ -121,7 +121,7 @@ devfs_alloc(int flags)
>  	struct cdev *cdev;
>  	struct timespec ts;
>  
> -	cdp = malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO |
> +	cdp = malloc(sizeof *cdp, M_CDEVP, M_ZERO |
>  	    ((flags & MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK));
>  	if (cdp == NULL)
>  		return (NULL);
> diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c
> index 37353ff..9f77762 100644
> --- a/sys/ia64/ia64/uma_machdep.c
> +++ b/sys/ia64/ia64/uma_machdep.c
> @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>  	int pflags;
>  
>  	*flags = UMA_SLAB_PRIV;
> -	if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -		pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -	else
> -		pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -	if (wait & M_ZERO)
> -		pflags |= VM_ALLOC_ZERO;
> +	pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>  
>  	for (;;) {
>  		m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
> diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c
> index 798e632..24baef0 100644
> --- a/sys/mips/mips/uma_machdep.c
> +++ b/sys/mips/mips/uma_machdep.c
> @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>  	void *va;
>  
>  	*flags = UMA_SLAB_PRIV;
> -
> -	if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -		pflags = VM_ALLOC_INTERRUPT;
> -	else
> -		pflags = VM_ALLOC_SYSTEM;
> +	pflags = m2vm_flags(wait, 0);
>  
>  	for (;;) {
>  		m = pmap_alloc_direct_page(0, pflags);


This smells fishy, but not because of anything that you did.  It appears
that the mips uma_small_alloc() is unconditionally asking for a
pre-zeroed page.  I'll take a look at this later today.


> diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c
> index a491680..3e320b9 100644
> --- a/sys/powerpc/aim/mmu_oea64.c
> +++ b/sys/powerpc/aim/mmu_oea64.c
> @@ -1369,12 +1369,7 @@ moea64_uma_page_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>  	*flags = UMA_SLAB_PRIV;
>  	needed_lock = !PMAP_LOCKED(kernel_pmap);
>  
> -        if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -                pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -        else
> -                pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -        if (wait & M_ZERO)
> -                pflags |= VM_ALLOC_ZERO;
> +	pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>  
>          for (;;) {
>                  m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
> diff --git a/sys/powerpc/aim/slb.c b/sys/powerpc/aim/slb.c
> index 162c7fb..3882bfa 100644
> --- a/sys/powerpc/aim/slb.c
> +++ b/sys/powerpc/aim/slb.c
> @@ -483,12 +483,7 @@ slb_uma_real_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>  		realmax = platform_real_maxaddr();
>  
>  	*flags = UMA_SLAB_PRIV;
> -	if ((wait & (M_NOWAIT | M_USE_RESERVE)) == M_NOWAIT)
> -		pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
> -	else
> -		pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
> -	if (wait & M_ZERO)
> -		pflags |= VM_ALLOC_ZERO;
> +	pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED);
>  
>  	for (;;) {
>  		m = vm_page_alloc_contig(NULL, 0, pflags, 1, 0, realmax,
> diff --git a/sys/powerpc/aim/uma_machdep.c b/sys/powerpc/aim/uma_machdep.c
> index 39deb43..23a333f 100644
> --- a/sys/powerpc/aim/uma_machdep.c
> +++ b/sys/powerpc/aim/uma_machdep.c
> @@ -56,12 +56,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>  	int pflags;
>  	
>  	*flags = UMA_SLAB_PRIV;
> -	if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -		pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -	else
> -		pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -	if (wait & M_ZERO)
> -		pflags |= VM_ALLOC_ZERO;
> +	pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>  
>  	for (;;) {
>  		m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
> diff --git a/sys/sparc64/sparc64/vm_machdep.c b/sys/sparc64/sparc64/vm_machdep.c
> index cdb94c7..573ab3a 100644
> --- a/sys/sparc64/sparc64/vm_machdep.c
> +++ b/sys/sparc64/sparc64/vm_machdep.c
> @@ -501,14 +501,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>  	PMAP_STATS_INC(uma_nsmall_alloc);
>  
>  	*flags = UMA_SLAB_PRIV;
> -
> -	if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -		pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -	else
> -		pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -
> -	if (wait & M_ZERO)
> -		pflags |= VM_ALLOC_ZERO;
> +	pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>  
>  	for (;;) {
>  		m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
> diff --git a/sys/vm/vm_kern.c b/sys/vm/vm_kern.c
> index 46e7f1c..e4c3704 100644
> --- a/sys/vm/vm_kern.c
> +++ b/sys/vm/vm_kern.c
> @@ -222,12 +222,7 @@ kmem_alloc_attr(vm_map_t map, vm_size_t size, int flags, vm_paddr_t low,
>  	vm_object_reference(object);
>  	vm_map_insert(map, object, offset, addr, addr + size, VM_PROT_ALL,
>  	    VM_PROT_ALL, 0);
> -	if ((flags & (M_NOWAIT | M_USE_RESERVE)) == M_NOWAIT)
> -		pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOBUSY;
> -	else
> -		pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOBUSY;
> -	if (flags & M_ZERO)
> -		pflags |= VM_ALLOC_ZERO;
> +	pflags = m2vm_flags(flags, VM_ALLOC_NOBUSY);
>  	VM_OBJECT_LOCK(object);
>  	end_offset = offset + size;
>  	for (; offset < end_offset; offset += PAGE_SIZE) {
> @@ -296,14 +291,7 @@ kmem_alloc_contig(vm_map_t map, vm_size_t size, int flags, vm_paddr_t low,
>  	vm_object_reference(object);
>  	vm_map_insert(map, object, offset, addr, addr + size, VM_PROT_ALL,
>  	    VM_PROT_ALL, 0);
> -	if ((flags & (M_NOWAIT | M_USE_RESERVE)) == M_NOWAIT)
> -		pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOBUSY;
> -	else
> -		pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOBUSY;
> -	if (flags & M_ZERO)
> -		pflags |= VM_ALLOC_ZERO;
> -	if (flags & M_NODUMP)
> -		pflags |= VM_ALLOC_NODUMP;
> +	pflags = m2vm_flags(flags, VM_ALLOC_NOBUSY);
>  	VM_OBJECT_LOCK(object);
>  	tries = 0;
>  retry:
> @@ -487,11 +475,7 @@ kmem_back(vm_map_t map, vm_offset_t addr, vm_size_t size, int flags)
>  	    entry->wired_count == 0 && (entry->eflags & MAP_ENTRY_IN_TRANSITION)
>  	    == 0, ("kmem_back: entry not found or misaligned"));
>  
> -	if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -		pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> -	else
> -		pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> -
> +	pflags = m2vm_flags(flags, VM_ALLOC_WIRED);
>  	if (flags & M_ZERO)
>  		pflags |= VM_ALLOC_ZERO;
>  	if (flags & M_NODUMP)


The conversion of these flags is already handled by m2vm_flags().  You
can eliminate these lines.

 
> diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
> index 70b8416..0286a6d 100644
> --- a/sys/vm/vm_page.h
> +++ b/sys/vm/vm_page.h
> @@ -344,6 +344,24 @@ extern struct mtx_padalign vm_page_queue_mtx;
>  #define	VM_ALLOC_COUNT_SHIFT	16
>  #define	VM_ALLOC_COUNT(count)	((count) << VM_ALLOC_COUNT_SHIFT)
>  
> +#ifdef M_NOWAIT
> +static inline int
> +m2vm_flags(int malloc_flags, int alloc_flags)
> +{
> +	int pflags;
> +
> +	if ((malloc_flags & (M_NOWAIT | M_USE_RESERVE)) == M_NOWAIT)
> +		pflags = VM_ALLOC_SYSTEM | alloc_flags;
> +	else
> +		pflags = VM_ALLOC_INTERRUPT | alloc_flags;


This can be simplified to:

    if ((malloc_flags & M_USE_RESERVE) != 0)
        pflags = VM_ALLOC_INTERRUPT | alloc_flags;
    else
        pflags = VM_ALLOC_SYSTEM | alloc_flags;


> +	if (malloc_flags & M_ZERO)
> +		pflags |= VM_ALLOC_ZERO;
> +	if (malloc_flags & M_NODUMP)
> +		pflags |= VM_ALLOC_NODUMP;
> +	return (pflags);
> +}
> +#endif
> +
>  void vm_page_busy(vm_page_t m);
>  void vm_page_flash(vm_page_t m);
>  void vm_page_io_start(vm_page_t m);




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50A1336E.5040401>