Date: Tue, 13 Nov 2012 12:13:33 -0800 From: Adrian Chadd <adrian@freebsd.org> To: Alan Cox <alc@rice.edu> Cc: mips@freebsd.org, "Sears, Steven" <Steven.Sears@netapp.com>, alc@freebsd.org, "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, pho@freebsd.org, Konstantin Belousov <kostikbel@gmail.com> Subject: Re: Memory reserves or lack thereof Message-ID: <CAJ-VmonAAXXm90wbcVi=TLv-XEDkzAY7aNc=WWXh5KHDiCvaQQ@mail.gmail.com> In-Reply-To: <50A2A7B2.2020302@rice.edu> References: <A6DE036C6A90C949A25CE89E844237FB2086970A@SACEXCMBX01-PRD.hq.netapp.com> <20121110132019.GP73505@kib.kiev.ua> <CAJUyCcOKHH3TO6qaK9V7UY2HW%2Bp6T74DUUdmbSi4eeGyofrTdQ@mail.gmail.com> <20121112133638.GZ73505@kib.kiev.ua> <50A1336E.5040401@rice.edu> <50A2A7B2.2020302@rice.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
Hey, great catch! adrian On 13 November 2012 12:04, Alan Cox <alc@rice.edu> wrote: > On 11/12/2012 11:35, Alan Cox wrote: >> On 11/12/2012 07:36, Konstantin Belousov wrote: >>> On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: >>>> On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov <kostikbel@gmail.com>wrote: >>>> >>>>> On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, Steven wrote: >>>>>> I have a memory subsystem design question that I'm hoping someone can >>>>> answer. >>>>>> I've been looking at a machine that is completely out of memory, as in >>>>>> >>>>>> v_free_count = 0, >>>>>> v_cache_count = 0, >>>>>> >>>>>> I wondered how a machine could completely run out of memory like this, >>>>> especially after finding a lack of interrupt storms or other pathologies >>>>> that would tend to overcommit memory. So I started investigating. >>>>>> Most allocators come down to vm_page_alloc(), which has this guard: >>>>>> >>>>>> if ((curproc == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) { >>>>>> page_req = VM_ALLOC_SYSTEM; >>>>>> }; >>>>>> >>>>>> if (cnt.v_free_count + cnt.v_cache_count > cnt.v_free_reserved || >>>>>> (page_req == VM_ALLOC_SYSTEM && >>>>>> cnt.v_free_count + cnt.v_cache_count > >>>>> cnt.v_interrupt_free_min) || >>>>>> (page_req == VM_ALLOC_INTERRUPT && >>>>>> cnt.v_free_count + cnt.v_cache_count > 0)) { >>>>>> >>>>>> The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate >>>>> every last page. >>>>>> >From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, >>>>> perhaps only used from interrupt threads. Not so, see kmem_malloc() or >>>>> uma_small_alloc() which both contain this mapping: >>>>>> if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) >>>>>> pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; >>>>>> else >>>>>> pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; >>>>>> >>>>>> Note that M_USE_RESERVE has been deprecated and is used in just a >>>>> handful of places. Also note that lots of code paths come through these >>>>> routines. >>>>>> What this means is essentially _any_ allocation using M_NOWAIT will >>>>> bypass whatever reserves have been held back and will take every last page >>>>> available. >>>>>> There is no documentation stating M_NOWAIT has this side effect of >>>>> essentially being privileged, so any innocuous piece of code that can't >>>>> block will use it. And of course M_NOWAIT is literally used all over. >>>>>> It looks to me like the design goal of the BSD allocators is on >>>>> recovery; it will give all pages away knowing it can recover. >>>>>> Am I missing anything? I would have expected some small number of pages >>>>> to be held in reserve just in case. And I didn't expect M_NOWAIT to be a >>>>> sort of back door for grabbing memory. >>>>> Your analysis is right, there is nothing to add or correct. >>>>> This is the reason to strongly prefer M_WAITOK. >>>>> >>>> Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was >>>> well understand that it should only be used by interrupt handlers. >>>> >>>> The trouble is that M_NOWAIT conflates two orthogonal things. The obvious >>>> being that the allocation shouldn't sleep. The other being how far we're >>>> willing to deplete the cache/free page queues. >>>> >>>> When fine-grained locking got sprinkled throughout the kernel, we all to >>>> often found ourselves wanting to do allocations without the possibility of >>>> blocking. So, M_NOWAIT became commonplace, where it wasn't before. >>>> >>>> This had the unintended consequence of introducing a lot of memory >>>> allocations in the top-half of the kernel, i.e., non-interrupt handling >>>> code, that were digging deep into the cache/free page queues. >>>> >>>> Also, ironically, in today's kernel an "M_NOWAIT | M_USE_RESERVE" >>>> allocation is less likely to succeed than an "M_NOWAIT" allocation. >>>> However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it >>>> could only allocate a free page. M_USE_RESERVE said that it ok to allocate >>>> a cached page even though M_NOWAIT was specified. Consequently, the system >>>> wouldn't dig as far into the free page queue if M_USE_RESERVE was >>>> specified, because it was allowed to reclaim a cached page. >>>> >>>> In conclusion, I think it's time that we change M_NOWAIT so that it doesn't >>>> dig any deeper into the cache/free page queues than M_WAITOK does and >>>> reintroduce a M_USE_RESERVE-like flag that says dig deep into the >>>> cache/free page queues. The trouble is that we then need to identify all >>>> of those places that are implicitly depending on the current behavior of >>>> M_NOWAIT also digging deep into the cache/free page queues so that we can >>>> add an explicit M_USE_RESERVE. >>>> >>>> Alan >>>> >>>> P.S. I suspect that we should also increase the size of the "page reserve" >>>> that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How >>>> many legitimate users of a new M_USE_RESERVE-like flag in today's kernel >>>> could actually be satisfied by two pages? >>> I am almost sure that most of people who put the M_NOWAIT flag, do not >>> know the 'allow the deeper drain of free queue' effect. As such, I believe >>> we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations >>> of the problematic places would be in the swapout path. >>> >>> I found a single explicit use of M_USE_RESERVE in the kernel, >>> so the flip is relatively simple. >> Agreed. Most recently I eliminated several uses from the arm pmap >> implementations. There is, however, one other use: >> >> ofed/include/linux/gfp.h:#define GFP_ATOMIC (M_NOWAIT | >> M_USE_RESERVE) >> >>> Below is the patch which I only compile-tested on amd64, and which booted >>> fine. >>> >>> Peter, could you, please, give it a run, to see obvious deadlocks, if any ? >>> >>> diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c >>> index dc9c307..ab1e869 100644 >>> --- a/sys/amd64/amd64/uma_machdep.c >>> +++ b/sys/amd64/amd64/uma_machdep.c >>> @@ -29,6 +29,7 @@ __FBSDID("$FreeBSD$"); >>> >>> #include <sys/param.h> >>> #include <sys/lock.h> >>> +#include <sys/malloc.h> >>> #include <sys/mutex.h> >>> #include <sys/systm.h> >>> #include <vm/vm.h> >>> @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) >>> int pflags; >>> >>> *flags = UMA_SLAB_PRIV; >>> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) >>> - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; >>> - else >>> - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; >>> - if (wait & M_ZERO) >>> - pflags |= VM_ALLOC_ZERO; >>> + pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED); >>> for (;;) { >>> m = vm_page_alloc(NULL, 0, pflags); >>> if (m == NULL) { >>> diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c >>> index f60cdb1..75366e3 100644 >>> --- a/sys/arm/arm/vm_machdep.c >>> +++ b/sys/arm/arm/vm_machdep.c >>> @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) >>> ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT)); >>> return (ret); >>> } >>> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) >>> - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; >>> - else >>> - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; >>> - if (wait & M_ZERO) >>> - pflags |= VM_ALLOC_ZERO; >>> + pflags = m2vm_flags(wait, VM_ALLOC_WIRED); >>> for (;;) { >>> m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); >>> if (m == NULL) { >>> diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c >>> index 71caa29..2ce1ca6 100644 >>> --- a/sys/fs/devfs/devfs_devs.c >>> +++ b/sys/fs/devfs/devfs_devs.c >>> @@ -121,7 +121,7 @@ devfs_alloc(int flags) >>> struct cdev *cdev; >>> struct timespec ts; >>> >>> - cdp = malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO | >>> + cdp = malloc(sizeof *cdp, M_CDEVP, M_ZERO | >>> ((flags & MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK)); >>> if (cdp == NULL) >>> return (NULL); >>> diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c >>> index 37353ff..9f77762 100644 >>> --- a/sys/ia64/ia64/uma_machdep.c >>> +++ b/sys/ia64/ia64/uma_machdep.c >>> @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) >>> int pflags; >>> >>> *flags = UMA_SLAB_PRIV; >>> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) >>> - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; >>> - else >>> - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; >>> - if (wait & M_ZERO) >>> - pflags |= VM_ALLOC_ZERO; >>> + pflags = m2vm_flags(wait, VM_ALLOC_WIRED); >>> >>> for (;;) { >>> m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); >>> diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c >>> index 798e632..24baef0 100644 >>> --- a/sys/mips/mips/uma_machdep.c >>> +++ b/sys/mips/mips/uma_machdep.c >>> @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) >>> void *va; >>> >>> *flags = UMA_SLAB_PRIV; >>> - >>> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) >>> - pflags = VM_ALLOC_INTERRUPT; >>> - else >>> - pflags = VM_ALLOC_SYSTEM; >>> + pflags = m2vm_flags(wait, 0); >>> >>> for (;;) { >>> m = pmap_alloc_direct_page(0, pflags); >> >> This smells fishy, but not because of anything that you did. It appears >> that the mips uma_small_alloc() is unconditionally asking for a >> pre-zeroed page. I'll take a look at this later today. >> > > I verified this. The current implementation of uma_small_alloc() on > MIPS unconditionally zeroes the page. Moreover, if M_ZERO is specified > to uma_small_alloc(), the same page is zeroed twice, once in > pmap_alloc_direct_page() and again in uma_small_alloc(). > > I expect to commit the following patch tomorrow. Kostik, it will > trivially conflict with your current patch. > > Index: mips/include/pmap.h > =================================================================== > --- mips/include/pmap.h (revision 242939) > +++ mips/include/pmap.h (working copy) > @@ -179,7 +179,6 @@ void pmap_kenter_temporary_free(vm_paddr_t pa); > void pmap_flush_pvcache(vm_page_t m); > int pmap_emulate_modified(pmap_t pmap, vm_offset_t va); > void pmap_grow_direct_page_cache(void); > -vm_page_t pmap_alloc_direct_page(unsigned int index, int req); > > #endif /* _KERNEL */ > > Index: mips/mips/pmap.c > =================================================================== > --- mips/mips/pmap.c (revision 242939) > +++ mips/mips/pmap.c (working copy) > @@ -163,6 +163,7 @@ static vm_page_t pmap_pv_reclaim(pmap_t locked_pma > static void pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t > va); > static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, > vm_offset_t va); > +static vm_page_t pmap_alloc_direct_page(unsigned int index, int req); > static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va, > vm_page_t m, vm_prot_t prot, vm_page_t mpte); > static int pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq, > vm_offset_t va, > @@ -1041,7 +1042,7 @@ pmap_grow_direct_page_cache() > #endif > } > > -vm_page_t > +static vm_page_t > pmap_alloc_direct_page(unsigned int index, int req) > { > vm_page_t m; > Index: mips/mips/uma_machdep.c > =================================================================== > --- mips/mips/uma_machdep.c (revision 242939) > +++ mips/mips/uma_machdep.c (working copy) > @@ -50,12 +50,14 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8 > *flags = UMA_SLAB_PRIV; > > if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > - pflags = VM_ALLOC_INTERRUPT; > + pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; > else > - pflags = VM_ALLOC_SYSTEM; > + pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; > + if (wait & M_ZERO) > + pflags |= VM_ALLOC_ZERO; > > for (;;) { > - m = pmap_alloc_direct_page(0, pflags); > + m = vm_page_alloc_freelist(VM_FREELIST_DIRECT, pflags); > if (m == NULL) { > if (wait & M_NOWAIT) > return (NULL); > > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-VmonAAXXm90wbcVi=TLv-XEDkzAY7aNc=WWXh5KHDiCvaQQ>