From owner-freebsd-hackers@FreeBSD.ORG Sun Nov 11 21:40:26 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 85693D4F for ; Sun, 11 Nov 2012 21:40:26 +0000 (UTC) (envelope-from alan.l.cox@gmail.com) Received: from mail-la0-f54.google.com (mail-la0-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id EDDF98FC15 for ; Sun, 11 Nov 2012 21:40:25 +0000 (UTC) Received: by mail-la0-f54.google.com with SMTP id e12so5255816lag.13 for ; Sun, 11 Nov 2012 13:40:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=HLk8ubLil2s4zpDEwfxCwMvphaCvos32bD/SwF8yNbo=; b=MGcvzX5E7QxO/2bQBEa/zNpCfKbG2nFaqv3i5wCxQvvC5POkrNcZ/kKxoNzYAn0Azf DPEVbXwmgDvxQGM708whGh3iZgXLeOomCLuLFhOtt4EkWlIG8ILXxoDNSiJ+U7cuGQPp Z0N10TGs0P+lQPZnd5kGVj2qSdK0Kdo4Q6qNERJ2Er9UGD8lw4AsFrKgM0lgSSXiMi3v cX0bFfhRjesrIhawbJDnL+SBDfdq2euFxtG2B5JGvD3ziExub30t1lT6X3UlfU8XWIDO C8zZsUSpTuu2PWuA7n3Fr9riRLQzs503k/jG+rrnuhzjMHYAznGy8ivG2XW9kG2dioYI WXPw== MIME-Version: 1.0 Received: by 10.112.98.37 with SMTP id ef5mr7191435lbb.84.1352670024314; Sun, 11 Nov 2012 13:40:24 -0800 (PST) Received: by 10.114.61.103 with HTTP; Sun, 11 Nov 2012 13:40:24 -0800 (PST) In-Reply-To: <20121110132019.GP73505@kib.kiev.ua> References: <20121110132019.GP73505@kib.kiev.ua> Date: Sun, 11 Nov 2012 15:40:24 -0600 Message-ID: Subject: Re: Memory reserves or lack thereof From: Alan Cox To: Konstantin Belousov Content-Type: text/plain; charset=ISO-8859-1 X-Content-Filtered-By: Mailman/MimeDel 2.1.14 Cc: "freebsd-hackers@freebsd.org" , "Sears, Steven" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: alc@freebsd.org List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Nov 2012 21:40:26 -0000 On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov wrote: > On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, Steven wrote: > > I have a memory subsystem design question that I'm hoping someone can > answer. > > > > I've been looking at a machine that is completely out of memory, as in > > > > v_free_count = 0, > > v_cache_count = 0, > > > > I wondered how a machine could completely run out of memory like this, > especially after finding a lack of interrupt storms or other pathologies > that would tend to overcommit memory. So I started investigating. > > > > Most allocators come down to vm_page_alloc(), which has this guard: > > > > if ((curproc == pageproc) && (page_req != VM_ALLOC_INTERRUPT)) { > > page_req = VM_ALLOC_SYSTEM; > > }; > > > > if (cnt.v_free_count + cnt.v_cache_count > cnt.v_free_reserved || > > (page_req == VM_ALLOC_SYSTEM && > > cnt.v_free_count + cnt.v_cache_count > > cnt.v_interrupt_free_min) || > > (page_req == VM_ALLOC_INTERRUPT && > > cnt.v_free_count + cnt.v_cache_count > 0)) { > > > > The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate > every last page. > > > > >From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, > perhaps only used from interrupt threads. Not so, see kmem_malloc() or > uma_small_alloc() which both contain this mapping: > > > > if ((flags & (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) > > pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; > > else > > pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; > > > > Note that M_USE_RESERVE has been deprecated and is used in just a > handful of places. Also note that lots of code paths come through these > routines. > > > > What this means is essentially _any_ allocation using M_NOWAIT will > bypass whatever reserves have been held back and will take every last page > available. > > > > There is no documentation stating M_NOWAIT has this side effect of > essentially being privileged, so any innocuous piece of code that can't > block will use it. And of course M_NOWAIT is literally used all over. > > > > It looks to me like the design goal of the BSD allocators is on > recovery; it will give all pages away knowing it can recover. > > > > Am I missing anything? I would have expected some small number of pages > to be held in reserve just in case. And I didn't expect M_NOWAIT to be a > sort of back door for grabbing memory. > > > > Your analysis is right, there is nothing to add or correct. > This is the reason to strongly prefer M_WAITOK. > Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an "M_NOWAIT | M_USE_RESERVE" allocation is less likely to succeed than an "M_NOWAIT" allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the "page reserve" that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages?