Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Nov 2012 14:58:38 +0100
From:      Peter Holm <peter@holm.cc>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        alc@freebsd.org, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, "Sears, Steven" <Steven.Sears@netapp.com>
Subject:   Re: Memory reserves or lack thereof
Message-ID:  <20121112135838.GA80041@x2.osted.lan>
In-Reply-To: <20121112133638.GZ73505@kib.kiev.ua>
References:  <A6DE036C6A90C949A25CE89E844237FB2086970A@SACEXCMBX01-PRD.hq.netapp.com> <20121110132019.GP73505@kib.kiev.ua> <CAJUyCcOKHH3TO6qaK9V7UY2HW%2Bp6T74DUUdmbSi4eeGyofrTdQ@mail.gmail.com> <20121112133638.GZ73505@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Nov 12, 2012 at 03:36:38PM +0200, Konstantin Belousov wrote:
> On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote:
> > On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov <kostikbel@gmail.com>wrote:
> > 
> > > On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, Steven wrote:
> > > > I have a memory subsystem design question that I'm hoping someone can
> > > answer.
> > > >
> > > > I've been looking at a machine that is completely out of memory, as in
> > > >
> > > >  v_free_count = 0,
> > > >  v_cache_count = 0,
> > > >
> > > > I wondered how a machine could completely run out of memory like this,
> > > especially after finding a lack of interrupt storms or other pathologies
> > > that would tend to overcommit memory. So I started investigating.
> > > >
> > > > Most allocators come down to vm_page_alloc(), which has this guard:
> > > >
> > > >       if ((curproc == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) {
> > > >               page_req = VM_ALLOC_SYSTEM;
> > > >       };
> > > >
> > > >       if (cnt.v_free_count + cnt.v_cache_count > cnt.v_free_reserved ||
> > > >           (page_req == VM_ALLOC_SYSTEM &&
> > > >           cnt.v_free_count + cnt.v_cache_count >
> > > cnt.v_interrupt_free_min) ||
> > > >           (page_req == VM_ALLOC_INTERRUPT &&
> > > >           cnt.v_free_count + cnt.v_cache_count > 0)) {
> > > >
> > > > The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate
> > > every last page.
> > > >
> > > > >From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare,
> > > perhaps only used from interrupt threads. Not so, see kmem_malloc() or
> > > uma_small_alloc() which both contain this mapping:
> > > >
> > > >       if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
> > > >               pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
> > > >       else
> > > >               pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
> > > >
> > > > Note that M_USE_RESERVE has been deprecated and is used in just a
> > > handful of places. Also note that lots of code paths come through these
> > > routines.
> > > >
> > > > What this means is essentially _any_ allocation using M_NOWAIT will
> > > bypass whatever reserves have been held back and will take every last page
> > > available.
> > > >
> > > > There is no documentation stating M_NOWAIT has this side effect of
> > > essentially being privileged, so any innocuous piece of code that can't
> > > block will use it. And of course M_NOWAIT is literally used all over.
> > > >
> > > > It looks to me like the design goal of the BSD allocators is on
> > > recovery; it will give all pages away knowing it can recover.
> > > >
> > > > Am I missing anything? I would have expected some small number of pages
> > > to be held in reserve just in case. And I didn't expect M_NOWAIT to be a
> > > sort of back door for grabbing memory.
> > > >
> > >
> > > Your analysis is right, there is nothing to add or correct.
> > > This is the reason to strongly prefer M_WAITOK.
> > >
> > 
> > Agreed.  Once upon time, before SMPng, M_NOWAIT was rarely used.  It was
> > well understand that it should only be used by interrupt handlers.
> > 
> > The trouble is that M_NOWAIT conflates two orthogonal things.  The obvious
> > being that the allocation shouldn't sleep.  The other being how far we're
> > willing to deplete the cache/free page queues.
> > 
> > When fine-grained locking got sprinkled throughout the kernel, we all to
> > often found ourselves wanting to do allocations without the possibility of
> > blocking.  So, M_NOWAIT became commonplace, where it wasn't before.
> > 
> > This had the unintended consequence of introducing a lot of memory
> > allocations in the top-half of the kernel, i.e., non-interrupt handling
> > code, that were digging deep into the cache/free page queues.
> > 
> > Also, ironically, in today's kernel an "M_NOWAIT | M_USE_RESERVE"
> > allocation is less likely to succeed than an "M_NOWAIT" allocation.
> > However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it
> > could only allocate a free page.  M_USE_RESERVE said that it ok to allocate
> > a cached page even though M_NOWAIT was specified.  Consequently, the system
> > wouldn't dig as far into the free page queue if M_USE_RESERVE was
> > specified, because it was allowed to reclaim a cached page.
> > 
> > In conclusion, I think it's time that we change M_NOWAIT so that it doesn't
> > dig any deeper into the cache/free page queues than M_WAITOK does and
> > reintroduce a M_USE_RESERVE-like flag that says dig deep into the
> > cache/free page queues.  The trouble is that we then need to identify all
> > of those places that are implicitly depending on the current behavior of
> > M_NOWAIT also digging deep into the cache/free page queues so that we can
> > add an explicit M_USE_RESERVE.
> > 
> > Alan
> > 
> > P.S. I suspect that we should also increase the size of the "page reserve"
> > that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*().  How
> > many legitimate users of a new M_USE_RESERVE-like flag in today's kernel
> > could actually be satisfied by two pages?
> 
> I am almost sure that most of people who put the M_NOWAIT flag, do not
> know the 'allow the deeper drain of free queue' effect. As such, I believe
> we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations
> of the problematic places would be in the swapout path.
> 
> I found a single explicit use of M_USE_RESERVE in the kernel,
> so the flip is relatively simple.
> 
> Below is the patch which I only compile-tested on amd64, and which booted
> fine.
> 
> Peter, could you, please, give it a run, to see obvious deadlocks, if any ?
> 

Glad to.

- Peter



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121112135838.GA80041>