From owner-freebsd-hackers@FreeBSD.ORG Tue Nov 13 20:04:05 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id BAA22994; Tue, 13 Nov 2012 20:04:05 +0000 (UTC) (envelope-from alc@rice.edu) Received: from mh11.mail.rice.edu (mh11.mail.rice.edu [128.42.199.30]) by mx1.freebsd.org (Postfix) with ESMTP id 833568FC13; Tue, 13 Nov 2012 20:04:05 +0000 (UTC) Received: from mh11.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh11.mail.rice.edu (Postfix) with ESMTP id 764044C0243; Tue, 13 Nov 2012 14:04:04 -0600 (CST) Received: from mh11.mail.rice.edu (localhost.localdomain [127.0.0.1]) by mh11.mail.rice.edu (Postfix) with ESMTP id 74D984C0235; Tue, 13 Nov 2012 14:04:04 -0600 (CST) X-Virus-Scanned: by amavis-2.7.0 at mh11.mail.rice.edu, auth channel Received: from mh11.mail.rice.edu ([127.0.0.1]) by mh11.mail.rice.edu (mh11.mail.rice.edu [127.0.0.1]) (amavis, port 10026) with ESMTP id R8hsA5VhGzKj; Tue, 13 Nov 2012 14:04:04 -0600 (CST) Received: from adsl-216-63-78-18.dsl.hstntx.swbell.net (adsl-216-63-78-18.dsl.hstntx.swbell.net [216.63.78.18]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh11.mail.rice.edu (Postfix) with ESMTPSA id A6BC84C0248; Tue, 13 Nov 2012 14:04:03 -0600 (CST) Message-ID: <50A2A7B2.2020302@rice.edu> Date: Tue, 13 Nov 2012 14:04:02 -0600 From: Alan Cox User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:16.0) Gecko/20121111 Thunderbird/16.0.2 MIME-Version: 1.0 To: "freebsd-hackers@freebsd.org" Subject: Re: Memory reserves or lack thereof References: <20121110132019.GP73505@kib.kiev.ua> <20121112133638.GZ73505@kib.kiev.ua> <50A1336E.5040401@rice.edu> In-Reply-To: <50A1336E.5040401@rice.edu> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Konstantin Belousov , alc@freebsd.org, mips@freebsd.org, "Sears, Steven" , pho@freebsd.org X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Nov 2012 20:04:05 -0000 On 11/12/2012 11:35, Alan Cox wrote: > On 11/12/2012 07:36, Konstantin Belousov wrote: >> On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: >>> On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov wrote: >>> >>>> On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, Steven wrote: >>>>> I have a memory subsystem design question that I'm hoping someone can >>>> answer. >>>>> I've been looking at a machine that is completely out of memory, as in >>>>> >>>>> v_free_count = 0, >>>>> v_cache_count = 0, >>>>> >>>>> I wondered how a machine could completely run out of memory like this, >>>> especially after finding a lack of interrupt storms or other pathologies >>>> that would tend to overcommit memory. So I started investigating. >>>>> Most allocators come down to vm_page_alloc(), which has this guard: >>>>> >>>>> if ((curproc == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) { >>>>> page_req = VM_ALLOC_SYSTEM; >>>>> }; >>>>> >>>>> if (cnt.v_free_count + cnt.v_cache_count > cnt.v_free_reserved || >>>>> (page_req == VM_ALLOC_SYSTEM && >>>>> cnt.v_free_count + cnt.v_cache_count > >>>> cnt.v_interrupt_free_min) || >>>>> (page_req == VM_ALLOC_INTERRUPT && >>>>> cnt.v_free_count + cnt.v_cache_count > 0)) { >>>>> >>>>> The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate >>>> every last page. >>>>> >From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, >>>> perhaps only used from interrupt threads. Not so, see kmem_malloc() or >>>> uma_small_alloc() which both contain this mapping: >>>>> if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) >>>>> pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; >>>>> else >>>>> pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; >>>>> >>>>> Note that M_USE_RESERVE has been deprecated and is used in just a >>>> handful of places. Also note that lots of code paths come through these >>>> routines. >>>>> What this means is essentially _any_ allocation using M_NOWAIT will >>>> bypass whatever reserves have been held back and will take every last page >>>> available. >>>>> There is no documentation stating M_NOWAIT has this side effect of >>>> essentially being privileged, so any innocuous piece of code that can't >>>> block will use it. And of course M_NOWAIT is literally used all over. >>>>> It looks to me like the design goal of the BSD allocators is on >>>> recovery; it will give all pages away knowing it can recover. >>>>> Am I missing anything? I would have expected some small number of pages >>>> to be held in reserve just in case. And I didn't expect M_NOWAIT to be a >>>> sort of back door for grabbing memory. >>>> Your analysis is right, there is nothing to add or correct. >>>> This is the reason to strongly prefer M_WAITOK. >>>> >>> Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was >>> well understand that it should only be used by interrupt handlers. >>> >>> The trouble is that M_NOWAIT conflates two orthogonal things. The obvious >>> being that the allocation shouldn't sleep. The other being how far we're >>> willing to deplete the cache/free page queues. >>> >>> When fine-grained locking got sprinkled throughout the kernel, we all to >>> often found ourselves wanting to do allocations without the possibility of >>> blocking. So, M_NOWAIT became commonplace, where it wasn't before. >>> >>> This had the unintended consequence of introducing a lot of memory >>> allocations in the top-half of the kernel, i.e., non-interrupt handling >>> code, that were digging deep into the cache/free page queues. >>> >>> Also, ironically, in today's kernel an "M_NOWAIT | M_USE_RESERVE" >>> allocation is less likely to succeed than an "M_NOWAIT" allocation. >>> However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it >>> could only allocate a free page. M_USE_RESERVE said that it ok to allocate >>> a cached page even though M_NOWAIT was specified. Consequently, the system >>> wouldn't dig as far into the free page queue if M_USE_RESERVE was >>> specified, because it was allowed to reclaim a cached page. >>> >>> In conclusion, I think it's time that we change M_NOWAIT so that it doesn't >>> dig any deeper into the cache/free page queues than M_WAITOK does and >>> reintroduce a M_USE_RESERVE-like flag that says dig deep into the >>> cache/free page queues. The trouble is that we then need to identify all >>> of those places that are implicitly depending on the current behavior of >>> M_NOWAIT also digging deep into the cache/free page queues so that we can >>> add an explicit M_USE_RESERVE. >>> >>> Alan >>> >>> P.S. I suspect that we should also increase the size of the "page reserve" >>> that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How >>> many legitimate users of a new M_USE_RESERVE-like flag in today's kernel >>> could actually be satisfied by two pages? >> I am almost sure that most of people who put the M_NOWAIT flag, do not >> know the 'allow the deeper drain of free queue' effect. As such, I believe >> we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations >> of the problematic places would be in the swapout path. >> >> I found a single explicit use of M_USE_RESERVE in the kernel, >> so the flip is relatively simple. > Agreed. Most recently I eliminated several uses from the arm pmap > implementations. There is, however, one other use: > > ofed/include/linux/gfp.h:#define GFP_ATOMIC (M_NOWAIT | > M_USE_RESERVE) > >> Below is the patch which I only compile-tested on amd64, and which booted >> fine. >> >> Peter, could you, please, give it a run, to see obvious deadlocks, if any ? >> >> diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c >> index dc9c307..ab1e869 100644 >> --- a/sys/amd64/amd64/uma_machdep.c >> +++ b/sys/amd64/amd64/uma_machdep.c >> @@ -29,6 +29,7 @@ __FBSDID("$FreeBSD$"); >> >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) >> int pflags; >> >> *flags = UMA_SLAB_PRIV; >> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) >> - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; >> - else >> - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; >> - if (wait & M_ZERO) >> - pflags |= VM_ALLOC_ZERO; >> + pflags = m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED); >> for (;;) { >> m = vm_page_alloc(NULL, 0, pflags); >> if (m == NULL) { >> diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c >> index f60cdb1..75366e3 100644 >> --- a/sys/arm/arm/vm_machdep.c >> +++ b/sys/arm/arm/vm_machdep.c >> @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) >> ret = ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT)); >> return (ret); >> } >> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) >> - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; >> - else >> - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; >> - if (wait & M_ZERO) >> - pflags |= VM_ALLOC_ZERO; >> + pflags = m2vm_flags(wait, VM_ALLOC_WIRED); >> for (;;) { >> m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); >> if (m == NULL) { >> diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c >> index 71caa29..2ce1ca6 100644 >> --- a/sys/fs/devfs/devfs_devs.c >> +++ b/sys/fs/devfs/devfs_devs.c >> @@ -121,7 +121,7 @@ devfs_alloc(int flags) >> struct cdev *cdev; >> struct timespec ts; >> >> - cdp = malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO | >> + cdp = malloc(sizeof *cdp, M_CDEVP, M_ZERO | >> ((flags & MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK)); >> if (cdp == NULL) >> return (NULL); >> diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c >> index 37353ff..9f77762 100644 >> --- a/sys/ia64/ia64/uma_machdep.c >> +++ b/sys/ia64/ia64/uma_machdep.c >> @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) >> int pflags; >> >> *flags = UMA_SLAB_PRIV; >> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) >> - pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; >> - else >> - pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; >> - if (wait & M_ZERO) >> - pflags |= VM_ALLOC_ZERO; >> + pflags = m2vm_flags(wait, VM_ALLOC_WIRED); >> >> for (;;) { >> m = vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); >> diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c >> index 798e632..24baef0 100644 >> --- a/sys/mips/mips/uma_machdep.c >> +++ b/sys/mips/mips/uma_machdep.c >> @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *flags, int wait) >> void *va; >> >> *flags = UMA_SLAB_PRIV; >> - >> - if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) >> - pflags = VM_ALLOC_INTERRUPT; >> - else >> - pflags = VM_ALLOC_SYSTEM; >> + pflags = m2vm_flags(wait, 0); >> >> for (;;) { >> m = pmap_alloc_direct_page(0, pflags); > > This smells fishy, but not because of anything that you did. It appears > that the mips uma_small_alloc() is unconditionally asking for a > pre-zeroed page. I'll take a look at this later today. > I verified this. The current implementation of uma_small_alloc() on MIPS unconditionally zeroes the page. Moreover, if M_ZERO is specified to uma_small_alloc(), the same page is zeroed twice, once in pmap_alloc_direct_page() and again in uma_small_alloc(). I expect to commit the following patch tomorrow. Kostik, it will trivially conflict with your current patch. Index: mips/include/pmap.h =================================================================== --- mips/include/pmap.h (revision 242939) +++ mips/include/pmap.h (working copy) @@ -179,7 +179,6 @@ void pmap_kenter_temporary_free(vm_paddr_t pa); void pmap_flush_pvcache(vm_page_t m); int pmap_emulate_modified(pmap_t pmap, vm_offset_t va); void pmap_grow_direct_page_cache(void); -vm_page_t pmap_alloc_direct_page(unsigned int index, int req); #endif /* _KERNEL */ Index: mips/mips/pmap.c =================================================================== --- mips/mips/pmap.c (revision 242939) +++ mips/mips/pmap.c (working copy) @@ -163,6 +163,7 @@ static vm_page_t pmap_pv_reclaim(pmap_t locked_pma static void pmap_pvh_free(struct md_page *pvh, pmap_t pmap, vm_offset_t va); static pv_entry_t pmap_pvh_remove(struct md_page *pvh, pmap_t pmap, vm_offset_t va); +static vm_page_t pmap_alloc_direct_page(unsigned int index, int req); static vm_page_t pmap_enter_quick_locked(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot, vm_page_t mpte); static int pmap_remove_pte(struct pmap *pmap, pt_entry_t *ptq, vm_offset_t va, @@ -1041,7 +1042,7 @@ pmap_grow_direct_page_cache() #endif } -vm_page_t +static vm_page_t pmap_alloc_direct_page(unsigned int index, int req) { vm_page_t m; Index: mips/mips/uma_machdep.c =================================================================== --- mips/mips/uma_machdep.c (revision 242939) +++ mips/mips/uma_machdep.c (working copy) @@ -50,12 +50,14 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8 *flags = UMA_SLAB_PRIV; if ((wait & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) - pflags = VM_ALLOC_INTERRUPT; + pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else - pflags = VM_ALLOC_SYSTEM; + pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; + if (wait & M_ZERO) + pflags |= VM_ALLOC_ZERO; for (;;) { - m = pmap_alloc_direct_page(0, pflags); + m = vm_page_alloc_freelist(VM_FREELIST_DIRECT, pflags); if (m == NULL) { if (wait & M_NOWAIT) return (NULL);