From owner-freebsd-hackers@FreeBSD.ORG Mon Nov 12 13:36:54 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 73A14232; Mon, 12 Nov 2012 13:36:54 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) by mx1.freebsd.org (Postfix) with ESMTP id 908D88FC08; Mon, 12 Nov 2012 13:36:53 +0000 (UTC) Received: from tom.home (localhost [127.0.0.1]) by kib.kiev.ua (8.14.5/8.14.5) with ESMTP id qACDadMB007137; Mon, 12 Nov 2012 15:36:39 +0200 (EET) (envelope-from kostikbel@gmail.com) X-DKIM: OpenDKIM Filter v2.5.2 kib.kiev.ua qACDadMB007137 Received: (from kostik@localhost) by tom.home (8.14.5/8.14.5/Submit) id qACDacqC007136; Mon, 12 Nov 2012 15:36:38 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 12 Nov 2012 15:36:38 +0200 From: Konstantin Belousov To: alc@freebsd.org Subject: Re: Memory reserves or lack thereof Message-ID: <20121112133638.GZ73505@kib.kiev.ua> References: <20121110132019.GP73505@kib.kiev.ua> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="Epw7rTp1fwvhE3T0" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Status: No, score=0.2 required=5.0 tests=ALL_TRUSTED, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no version=3.3.2 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on tom.home Cc: pho@freebsd.org, "Sears, Steven" , "freebsd-hackers@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Nov 2012 13:36:54 -0000 --Epw7rTp1fwvhE3T0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: > On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov wrote: >=20 > > On Fri, Nov 09, 2012 at 07:10:04PM +0000, Sears, Steven wrote: > > > I have a memory subsystem design question that I'm hoping someone can > > answer. > > > > > > I've been looking at a machine that is completely out of memory, as in > > > > > > v_free_count =3D 0, > > > v_cache_count =3D 0, > > > > > > I wondered how a machine could completely run out of memory like this, > > especially after finding a lack of interrupt storms or other pathologies > > that would tend to overcommit memory. So I started investigating. > > > > > > Most allocators come down to vm_page_alloc(), which has this guard: > > > > > > if ((curproc =3D=3D pageproc) && (page_req !=3D VM_ALLOC_INTERR= UPT)) { > > > page_req =3D VM_ALLOC_SYSTEM; > > > }; > > > > > > if (cnt.v_free_count + cnt.v_cache_count > cnt.v_free_reserved = || > > > (page_req =3D=3D VM_ALLOC_SYSTEM && > > > cnt.v_free_count + cnt.v_cache_count > > > cnt.v_interrupt_free_min) || > > > (page_req =3D=3D VM_ALLOC_INTERRUPT && > > > cnt.v_free_count + cnt.v_cache_count > 0)) { > > > > > > The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate > > every last page. > > > > > > >From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rar= e, > > perhaps only used from interrupt threads. Not so, see kmem_malloc() or > > uma_small_alloc() which both contain this mapping: > > > > > > if ((flags & (M_NOWAIT|M_USE_RESERVE)) =3D=3D M_NOWAIT) > > > pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; > > > else > > > pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; > > > > > > Note that M_USE_RESERVE has been deprecated and is used in just a > > handful of places. Also note that lots of code paths come through these > > routines. > > > > > > What this means is essentially _any_ allocation using M_NOWAIT will > > bypass whatever reserves have been held back and will take every last p= age > > available. > > > > > > There is no documentation stating M_NOWAIT has this side effect of > > essentially being privileged, so any innocuous piece of code that can't > > block will use it. And of course M_NOWAIT is literally used all over. > > > > > > It looks to me like the design goal of the BSD allocators is on > > recovery; it will give all pages away knowing it can recover. > > > > > > Am I missing anything? I would have expected some small number of pag= es > > to be held in reserve just in case. And I didn't expect M_NOWAIT to be a > > sort of back door for grabbing memory. > > > > > > > Your analysis is right, there is nothing to add or correct. > > This is the reason to strongly prefer M_WAITOK. > > >=20 > Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was > well understand that it should only be used by interrupt handlers. >=20 > The trouble is that M_NOWAIT conflates two orthogonal things. The obvious > being that the allocation shouldn't sleep. The other being how far we're > willing to deplete the cache/free page queues. >=20 > When fine-grained locking got sprinkled throughout the kernel, we all to > often found ourselves wanting to do allocations without the possibility of > blocking. So, M_NOWAIT became commonplace, where it wasn't before. >=20 > This had the unintended consequence of introducing a lot of memory > allocations in the top-half of the kernel, i.e., non-interrupt handling > code, that were digging deep into the cache/free page queues. >=20 > Also, ironically, in today's kernel an "M_NOWAIT | M_USE_RESERVE" > allocation is less likely to succeed than an "M_NOWAIT" allocation. > However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; = it > could only allocate a free page. M_USE_RESERVE said that it ok to alloca= te > a cached page even though M_NOWAIT was specified. Consequently, the syst= em > wouldn't dig as far into the free page queue if M_USE_RESERVE was > specified, because it was allowed to reclaim a cached page. >=20 > In conclusion, I think it's time that we change M_NOWAIT so that it doesn= 't > dig any deeper into the cache/free page queues than M_WAITOK does and > reintroduce a M_USE_RESERVE-like flag that says dig deep into the > cache/free page queues. The trouble is that we then need to identify all > of those places that are implicitly depending on the current behavior of > M_NOWAIT also digging deep into the cache/free page queues so that we can > add an explicit M_USE_RESERVE. >=20 > Alan >=20 > P.S. I suspect that we should also increase the size of the "page reserve" > that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How > many legitimate users of a new M_USE_RESERVE-like flag in today's kernel > could actually be satisfied by two pages? I am almost sure that most of people who put the M_NOWAIT flag, do not know the 'allow the deeper drain of free queue' effect. As such, I believe we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations of the problematic places would be in the swapout path. I found a single explicit use of M_USE_RESERVE in the kernel, so the flip is relatively simple. Below is the patch which I only compile-tested on amd64, and which booted fine. Peter, could you, please, give it a run, to see obvious deadlocks, if any ? diff --git a/sys/amd64/amd64/uma_machdep.c b/sys/amd64/amd64/uma_machdep.c index dc9c307..ab1e869 100644 --- a/sys/amd64/amd64/uma_machdep.c +++ b/sys/amd64/amd64/uma_machdep.c @@ -29,6 +29,7 @@ __FBSDID("$FreeBSD$"); =20 #include #include +#include #include #include #include @@ -48,12 +49,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *fl= ags, int wait) int pflags; =20 *flags =3D UMA_SLAB_PRIV; - if ((wait & (M_NOWAIT|M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; - else - pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; - if (wait & M_ZERO) - pflags |=3D VM_ALLOC_ZERO; + pflags =3D m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED); for (;;) { m =3D vm_page_alloc(NULL, 0, pflags); if (m =3D=3D NULL) { diff --git a/sys/arm/arm/vm_machdep.c b/sys/arm/arm/vm_machdep.c index f60cdb1..75366e3 100644 --- a/sys/arm/arm/vm_machdep.c +++ b/sys/arm/arm/vm_machdep.c @@ -651,12 +651,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *= flags, int wait) ret =3D ((void *)kmem_malloc(kmem_map, bytes, M_NOWAIT)); return (ret); } - if ((wait & (M_NOWAIT|M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; - else - pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; - if (wait & M_ZERO) - pflags |=3D VM_ALLOC_ZERO; + pflags =3D m2vm_flags(wait, VM_ALLOC_WIRED); for (;;) { m =3D vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); if (m =3D=3D NULL) { diff --git a/sys/fs/devfs/devfs_devs.c b/sys/fs/devfs/devfs_devs.c index 71caa29..2ce1ca6 100644 --- a/sys/fs/devfs/devfs_devs.c +++ b/sys/fs/devfs/devfs_devs.c @@ -121,7 +121,7 @@ devfs_alloc(int flags) struct cdev *cdev; struct timespec ts; =20 - cdp =3D malloc(sizeof *cdp, M_CDEVP, M_USE_RESERVE | M_ZERO | + cdp =3D malloc(sizeof *cdp, M_CDEVP, M_ZERO | ((flags & MAKEDEV_NOWAIT) ? M_NOWAIT : M_WAITOK)); if (cdp =3D=3D NULL) return (NULL); diff --git a/sys/ia64/ia64/uma_machdep.c b/sys/ia64/ia64/uma_machdep.c index 37353ff..9f77762 100644 --- a/sys/ia64/ia64/uma_machdep.c +++ b/sys/ia64/ia64/uma_machdep.c @@ -46,12 +46,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *fl= ags, int wait) int pflags; =20 *flags =3D UMA_SLAB_PRIV; - if ((wait & (M_NOWAIT|M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; - else - pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; - if (wait & M_ZERO) - pflags |=3D VM_ALLOC_ZERO; + pflags =3D m2vm_flags(wait, VM_ALLOC_WIRED); =20 for (;;) { m =3D vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); diff --git a/sys/mips/mips/uma_machdep.c b/sys/mips/mips/uma_machdep.c index 798e632..24baef0 100644 --- a/sys/mips/mips/uma_machdep.c +++ b/sys/mips/mips/uma_machdep.c @@ -48,11 +48,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *fl= ags, int wait) void *va; =20 *flags =3D UMA_SLAB_PRIV; - - if ((wait & (M_NOWAIT|M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT; - else - pflags =3D VM_ALLOC_SYSTEM; + pflags =3D m2vm_flags(wait, 0); =20 for (;;) { m =3D pmap_alloc_direct_page(0, pflags); diff --git a/sys/powerpc/aim/mmu_oea64.c b/sys/powerpc/aim/mmu_oea64.c index a491680..3e320b9 100644 --- a/sys/powerpc/aim/mmu_oea64.c +++ b/sys/powerpc/aim/mmu_oea64.c @@ -1369,12 +1369,7 @@ moea64_uma_page_alloc(uma_zone_t zone, int bytes, u_= int8_t *flags, int wait) *flags =3D UMA_SLAB_PRIV; needed_lock =3D !PMAP_LOCKED(kernel_pmap); =20 - if ((wait & (M_NOWAIT|M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; - else - pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; - if (wait & M_ZERO) - pflags |=3D VM_ALLOC_ZERO; + pflags =3D m2vm_flags(wait, VM_ALLOC_WIRED); =20 for (;;) { m =3D vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); diff --git a/sys/powerpc/aim/slb.c b/sys/powerpc/aim/slb.c index 162c7fb..3882bfa 100644 --- a/sys/powerpc/aim/slb.c +++ b/sys/powerpc/aim/slb.c @@ -483,12 +483,7 @@ slb_uma_real_alloc(uma_zone_t zone, int bytes, u_int8_= t *flags, int wait) realmax =3D platform_real_maxaddr(); =20 *flags =3D UMA_SLAB_PRIV; - if ((wait & (M_NOWAIT | M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; - else - pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_NOOBJ | VM_ALLOC_WIRED; - if (wait & M_ZERO) - pflags |=3D VM_ALLOC_ZERO; + pflags =3D m2vm_flags(wait, VM_ALLOC_NOOBJ | VM_ALLOC_WIRED); =20 for (;;) { m =3D vm_page_alloc_contig(NULL, 0, pflags, 1, 0, realmax, diff --git a/sys/powerpc/aim/uma_machdep.c b/sys/powerpc/aim/uma_machdep.c index 39deb43..23a333f 100644 --- a/sys/powerpc/aim/uma_machdep.c +++ b/sys/powerpc/aim/uma_machdep.c @@ -56,12 +56,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *fl= ags, int wait) int pflags; =09 *flags =3D UMA_SLAB_PRIV; - if ((wait & (M_NOWAIT|M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; - else - pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; - if (wait & M_ZERO) - pflags |=3D VM_ALLOC_ZERO; + pflags =3D m2vm_flags(wait, VM_ALLOC_WIRED); =20 for (;;) { m =3D vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); diff --git a/sys/sparc64/sparc64/vm_machdep.c b/sys/sparc64/sparc64/vm_mach= dep.c index cdb94c7..573ab3a 100644 --- a/sys/sparc64/sparc64/vm_machdep.c +++ b/sys/sparc64/sparc64/vm_machdep.c @@ -501,14 +501,7 @@ uma_small_alloc(uma_zone_t zone, int bytes, u_int8_t *= flags, int wait) PMAP_STATS_INC(uma_nsmall_alloc); =20 *flags =3D UMA_SLAB_PRIV; - - if ((wait & (M_NOWAIT|M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; - else - pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; - - if (wait & M_ZERO) - pflags |=3D VM_ALLOC_ZERO; + pflags =3D m2vm_flags(wait, VM_ALLOC_WIRED); =20 for (;;) { m =3D vm_page_alloc(NULL, 0, pflags | VM_ALLOC_NOOBJ); diff --git a/sys/vm/vm_kern.c b/sys/vm/vm_kern.c index 46e7f1c..e4c3704 100644 --- a/sys/vm/vm_kern.c +++ b/sys/vm/vm_kern.c @@ -222,12 +222,7 @@ kmem_alloc_attr(vm_map_t map, vm_size_t size, int flag= s, vm_paddr_t low, vm_object_reference(object); vm_map_insert(map, object, offset, addr, addr + size, VM_PROT_ALL, VM_PROT_ALL, 0); - if ((flags & (M_NOWAIT | M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_NOBUSY; - else - pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_NOBUSY; - if (flags & M_ZERO) - pflags |=3D VM_ALLOC_ZERO; + pflags =3D m2vm_flags(flags, VM_ALLOC_NOBUSY); VM_OBJECT_LOCK(object); end_offset =3D offset + size; for (; offset < end_offset; offset +=3D PAGE_SIZE) { @@ -296,14 +291,7 @@ kmem_alloc_contig(vm_map_t map, vm_size_t size, int fl= ags, vm_paddr_t low, vm_object_reference(object); vm_map_insert(map, object, offset, addr, addr + size, VM_PROT_ALL, VM_PROT_ALL, 0); - if ((flags & (M_NOWAIT | M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_NOBUSY; - else - pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_NOBUSY; - if (flags & M_ZERO) - pflags |=3D VM_ALLOC_ZERO; - if (flags & M_NODUMP) - pflags |=3D VM_ALLOC_NODUMP; + pflags =3D m2vm_flags(flags, VM_ALLOC_NOBUSY); VM_OBJECT_LOCK(object); tries =3D 0; retry: @@ -487,11 +475,7 @@ kmem_back(vm_map_t map, vm_offset_t addr, vm_size_t si= ze, int flags) entry->wired_count =3D=3D 0 && (entry->eflags & MAP_ENTRY_IN_TRANSITI= ON) =3D=3D 0, ("kmem_back: entry not found or misaligned")); =20 - if ((flags & (M_NOWAIT|M_USE_RESERVE)) =3D=3D M_NOWAIT) - pflags =3D VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; - else - pflags =3D VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; - + pflags =3D m2vm_flags(flags, VM_ALLOC_WIRED); if (flags & M_ZERO) pflags |=3D VM_ALLOC_ZERO; if (flags & M_NODUMP) diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index 70b8416..0286a6d 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -344,6 +344,24 @@ extern struct mtx_padalign vm_page_queue_mtx; #define VM_ALLOC_COUNT_SHIFT 16 #define VM_ALLOC_COUNT(count) ((count) << VM_ALLOC_COUNT_SHIFT) =20 +#ifdef M_NOWAIT +static inline int +m2vm_flags(int malloc_flags, int alloc_flags) +{ + int pflags; + + if ((malloc_flags & (M_NOWAIT | M_USE_RESERVE)) =3D=3D M_NOWAIT) + pflags =3D VM_ALLOC_SYSTEM | alloc_flags; + else + pflags =3D VM_ALLOC_INTERRUPT | alloc_flags; + if (malloc_flags & M_ZERO) + pflags |=3D VM_ALLOC_ZERO; + if (malloc_flags & M_NODUMP) + pflags |=3D VM_ALLOC_NODUMP; + return (pflags); +} +#endif + void vm_page_busy(vm_page_t m); void vm_page_flash(vm_page_t m); void vm_page_io_start(vm_page_t m); --Epw7rTp1fwvhE3T0 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (FreeBSD) iEYEARECAAYFAlCg+2YACgkQC3+MBN1Mb4jquwCgzO2KOt+by4oQxAyMXj+Ly4yA oAQAoNRcAkSA2OIC9wAQ+5pvrIWQib1l =J4Rs -----END PGP SIGNATURE----- --Epw7rTp1fwvhE3T0--