Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 13 Nov 2012 12:13:33 -0800
From:      Adrian Chadd <adrian@freebsd.org>
To:        Alan Cox <alc@rice.edu>
Cc:        mips@freebsd.org, "Sears, Steven" <Steven.Sears@netapp.com>, alc@freebsd.org, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, pho@freebsd.org, Konstantin Belousov <kostikbel@gmail.com>
Subject:   Re: Memory reserves or lack thereof
Message-ID:  <CAJ-VmonAAXXm90wbcVi=TLv-XEDkzAY7aNc=WWXh5KHDiCvaQQ@mail.gmail.com>
In-Reply-To: <50A2A7B2.2020302@rice.edu>
References:  <A6DE036C6A90C949A25CE89E844237FB2086970A@SACEXCMBX01-PRD.hq.netapp.com> <20121110132019.GP73505@kib.kiev.ua> <CAJUyCcOKHH3TO6qaK9V7UY2HW%2Bp6T74DUUdmbSi4eeGyofrTdQ@mail.gmail.com> <20121112133638.GZ73505@kib.kiev.ua> <50A1336E.5040401@rice.edu> <50A2A7B2.2020302@rice.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
Hey, great catch!



adrian

On 13 November 2012 12:04, Alan Cox <alc@rice.edu> wrote:
> On 11/12/2012 11:35, Alan Cox wrote:
>> On 11/12/2012 07:36, Konstantin Belousov wrote:
>>> On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote:
>>>> On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov <kostikbel@gmail.com>wrote:
>>>>
>>>>> On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, Steven wrote:
>>>>>> I have a memory subsystem design question that I'm hoping someone can
>>>>> answer.
>>>>>> I've been looking at a machine that is completely out of memory, as in
>>>>>>
>>>>>>  v_free_count = 0,
>>>>>>  v_cache_count = 0,
>>>>>>
>>>>>> I wondered how a machine could completely run out of memory like this,
>>>>> especially after finding a lack of interrupt storms or other pathologies
>>>>> that would tend to overcommit memory. So I started investigating.
>>>>>> Most allocators come down to vm_page_alloc(), which has this guard:
>>>>>>
>>>>>>       if ((curproc == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) {
>>>>>>               page_req = VM_ALLOC_SYSTEM;
>>>>>>       };
>>>>>>
>>>>>>       if (cnt.v_free_count + cnt.v_cache_count > cnt.v_free_reserved ||
>>>>>>           (page_req == VM_ALLOC_SYSTEM &&
>>>>>>           cnt.v_free_count + cnt.v_cache_count >
>>>>> cnt.v_interrupt_free_min) ||
>>>>>>           (page_req == VM_ALLOC_INTERRUPT &&
>>>>>>           cnt.v_free_count + cnt.v_cache_count > 0)) {
>>>>>>
>>>>>> The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate
>>>>> every last page.
>>>>>> >From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare,
>>>>> perhaps only used from interrupt threads. Not so, see kmem_malloc() or
>>>>> uma_small_alloc() which both contain this mapping:
>>>>>>       if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>>>>>>               pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>>>>>>       else
>>>>>>               pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
>>>>>>
>>>>>> Note that M_USE_RESERVE has been deprecated and is used in just a
>>>>> handful of places. Also note that lots of code paths come through these
>>>>> routines.
>>>>>> What this means is essentially _any_ allocation using M_NOWAIT will
>>>>> bypass whatever reserves have been held back and will take every last page
>>>>> available.
>>>>>> There is no documentation stating M_NOWAIT has this side effect of
>>>>> essentially being privileged, so any innocuous piece of code that can't
>>>>> block will use it. And of course M_NOWAIT is literally used all over.
>>>>>> It looks to me like the design goal of the BSD allocators is on
>>>>> recovery; it will give all pages away knowing it can recover.
>>>>>> Am I missing anything? I would have expected some small number of pages
>>>>> to be held in reserve just in case. And I didn't expect M_NOWAIT to be a
>>>>> sort of back door for grabbing memory.
>>>>> Your analysis is right, there is nothing to add or correct.
>>>>> This is the reason to strongly prefer M_WAITOK.
>>>>>
>>>> Agreed.  Once upon time, before SMPng, M_NOWAIT was rarely used.  It was
>>>> well understand that it should only be used by interrupt handlers.
>>>>
>>>> The trouble is that M_NOWAIT conflates two orthogonal things.  The obvious
>>>> being that the allocation shouldn't sleep.  The other being how far we're
>>>> willing to deplete the cache/free page queues.
>>>>
>>>> When fine-grained locking got sprinkled throughout the kernel, we all to
>>>> often found ourselves wanting to do allocations without the possibility of
>>>> blocking.  So, M_NOWAIT became commonplace, where it wasn't before.
>>>>
>>>> This had the unintended consequence of introducing a lot of memory
>>>> allocations in the top-half of the kernel, i.e., non-interrupt handling
>>>> code, that were digging deep into the cache/free page queues.
>>>>
>>>> Also, ironically, in today's kernel an "M_NOWAIT | M_USE_RESERVE"
>>>> allocation is less likely to succeed than an "M_NOWAIT" allocation.
>>>> However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it
>>>> could only allocate a free page.  M_USE_RESERVE said that it ok to allocate
>>>> a cached page even though M_NOWAIT was specified.  Consequently, the system
>>>> wouldn't dig as far into the free page queue if M_USE_RESERVE was
>>>> specified, because it was allowed to reclaim a cached page.
>>>>
>>>> In conclusion, I think it's time that we change M_NOWAIT so that it doesn't
>>>> dig any deeper into the cache/free page queues than M_WAITOK does and
>>>> reintroduce a M_USE_RESERVE-like flag that says dig deep into the
>>>> cache/free page queues.  The trouble is that we then need to identify all
>>>> of those places that are implicitly depending on the current behavior of
>>>> M_NOWAIT also digging deep into the cache/free page queues so that we can
>>>> add an explicit M_USE_RESERVE.
>>>>
>>>> Alan
>>>>
>>>> P.S. I suspect that we should also increase the size of the "page reserve"
>>>> that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*().  How
>>>> many legitimate users of a new M_USE_RESERVE-like flag in today's kernel
>>>> could actually be satisfied by two pages?
>>> I am almost sure that most of people who put the M_NOWAIT flag, do not
>>> know the 'allow the deeper drain of free queue' effect. As such, I believe
>>> we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations
>>> of the problematic places would be in the swapout path.
>>>
>>> I found a single explicit use of M_USE_RESERVE in the kernel,
>>> so the flip is relatively simple.
>> Agreed.  Most recently I eliminated several uses from the arm pmap
>> implementations.  There is, however, one other use:
>>
>> ofed/include/linux/gfp.h:#define        GFP_ATOMIC      (M_NOWAIT |
>> M_USE_RESERVE)
>>
>>> Below is the patch which I only compile-tested on amd64, and which booted
>>> fine.
>>>
>>> Peter, could you, please, give it a run, to see obvious deadlocks, if any ?
>>>
>>> diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c
>>> index dc9c307..ab1e869 100644
>>> --- a/sys/amd64/amd64/uma_machdep.c
>>> +++ b/sys/amd64/amd64/uma_machdep.c
>>> @@ -29,6 +29,7 @@ __FBSDID("$FreeBSD$");
>>>
>>>  #include <sys/param.h>
>>>  #include <sys/lock.h>
>>> +#include <sys/malloc.h>
>>>  #include <sys/mutex.h>
>>>  #include <sys/systm.h>
>>>  #include <vm/vm.h>
>>> @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>>>      int pflags;
>>>
>>>      *flags = UMA_SLAB_PRIV;
>>> -    if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>>> -            pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
>>> -    else
>>> -            pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED;
>>> -    if (wait & M_ZERO)
>>> -            pflags |= VM_ALLOC_ZERO;
>>> +    pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED);
>>>      for (;;) {
>>>              m = vm_page_alloc(NULL, 0, pflags);
>>>              if (m == NULL) {
>>> diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c
>>> index f60cdb1..75366e3 100644
>>> --- a/sys/arm/arm/vm_machdep.c
>>> +++ b/sys/arm/arm/vm_machdep.c
>>> @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>>>                      ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT));
>>>                      return (ret);
>>>              }
>>> -            if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>>> -                    pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>>> -            else
>>> -                    pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
>>> -            if (wait & M_ZERO)
>>> -                    pflags |= VM_ALLOC_ZERO;
>>> +            pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>>>              for (;;) {
>>>                      m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
>>>                      if (m == NULL) {
>>> diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c
>>> index 71caa29..2ce1ca6 100644
>>> --- a/sys/fs/devfs/devfs_devs.c
>>> +++ b/sys/fs/devfs/devfs_devs.c
>>> @@ -121,7 +121,7 @@ devfs_alloc(int flags)
>>>      struct cdev *cdev;
>>>      struct timespec ts;
>>>
>>> -    cdp = malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO |
>>> +    cdp = malloc(sizeof *cdp, M_CDEVP, M_ZERO |
>>>          ((flags & MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK));
>>>      if (cdp == NULL)
>>>              return (NULL);
>>> diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c
>>> index 37353ff..9f77762 100644
>>> --- a/sys/ia64/ia64/uma_machdep.c
>>> +++ b/sys/ia64/ia64/uma_machdep.c
>>> @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>>>      int pflags;
>>>
>>>      *flags = UMA_SLAB_PRIV;
>>> -    if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>>> -            pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>>> -    else
>>> -            pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
>>> -    if (wait & M_ZERO)
>>> -            pflags |= VM_ALLOC_ZERO;
>>> +    pflags = m2vm_flags(wait, VM_ALLOC_WIRED);
>>>
>>>      for (;;) {
>>>              m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ);
>>> diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c
>>> index 798e632..24baef0 100644
>>> --- a/sys/mips/mips/uma_machdep.c
>>> +++ b/sys/mips/mips/uma_machdep.c
>>> @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait)
>>>      void *va;
>>>
>>>      *flags = UMA_SLAB_PRIV;
>>> -
>>> -    if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
>>> -            pflags = VM_ALLOC_INTERRUPT;
>>> -    else
>>> -            pflags = VM_ALLOC_SYSTEM;
>>> +    pflags = m2vm_flags(wait, 0);
>>>
>>>      for (;;) {
>>>              m = pmap_alloc_direct_page(0, pflags);
>>
>> This smells fishy, but not because of anything that you did.  It appears
>> that the mips uma_small_alloc() is unconditionally asking for a
>> pre-zeroed page.  I'll take a look at this later today.
>>
>
> I verified this.  The current implementation of uma_small_alloc() on
> MIPS unconditionally zeroes the page.  Moreover, if M_ZERO is specified
> to uma_small_alloc(), the same page is zeroed twice, once in
> pmap_alloc_direct_page() and again in uma_small_alloc().
>
> I expect to commit the following patch tomorrow.  Kostik, it will
> trivially conflict with your current patch.
>
> Index: mips/include/pmap.h
> ===================================================================
> --- mips/include/pmap.h (revision 242939)
> +++ mips/include/pmap.h (working copy)
> @@ -179,7 +179,6 @@ void pmap_kenter_temporary_free(vm_paddr_t pa);
>  void pmap_flush_pvcache(vm_page_t m);
>  int pmap_emulate_modified(pmap_t pmap, vm_offset_t va);
>  void pmap_grow_direct_page_cache(void);
> -vm_page_t pmap_alloc_direct_page(unsigned int index, int req);
>
>  #endif                         /* _KERNEL */
>
> Index: mips/mips/pmap.c
> ===================================================================
> --- mips/mips/pmap.c    (revision 242939)
> +++ mips/mips/pmap.c    (working copy)
> @@ -163,6 +163,7 @@ static vm_page_t pmap_pv_reclaim(pmap_t locked_pma
>  static void pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t
> va);
>  static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap,
>      vm_offset_t va);
> +static vm_page_t pmap_alloc_direct_page(unsigned int index, int req);
>  static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va,
>      vm_page_t m, vm_prot_t prot, vm_page_t mpte);
>  static int pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq,
> vm_offset_t va,
> @@ -1041,7 +1042,7 @@ pmap_grow_direct_page_cache()
>  #endif
>  }
>
> -vm_page_t
> +static vm_page_t
>  pmap_alloc_direct_page(unsigned int index, int req)
>  {
>         vm_page_t m;
> Index: mips/mips/uma_machdep.c
> ===================================================================
> --- mips/mips/uma_machdep.c     (revision 242939)
> +++ mips/mips/uma_machdep.c     (working copy)
> @@ -50,12 +50,14 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8
>         *flags = UMA_SLAB_PRIV;
>
>         if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> -               pflags = VM_ALLOC_INTERRUPT;
> +               pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
>         else
> -               pflags = VM_ALLOC_SYSTEM;
> +               pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> +       if (wait & M_ZERO)
> +               pflags |= VM_ALLOC_ZERO;
>
>         for (;;) {
> -               m = pmap_alloc_direct_page(0, pflags);
> +               m = vm_page_alloc_freelist(VM_FREELIST_DIRECT, pflags);
>                 if (m == NULL) {
>                         if (wait & M_NOWAIT)
>                                 return (NULL);
>
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-VmonAAXXm90wbcVi=TLv-XEDkzAY7aNc=WWXh5KHDiCvaQQ>