From owner-svn-src-all@freebsd.org Fri Feb 15 07:13:36 2019 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8D24914F1F45; Fri, 15 Feb 2019 07:13:36 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id D8F77821B5; Fri, 15 Feb 2019 07:13:35 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 75AB710550F7; Fri, 15 Feb 2019 18:13:26 +1100 (AEDT) Date: Fri, 15 Feb 2019 18:13:24 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Gleb Smirnoff cc: Justin Hibbits , src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r343030 - in head/sys: cam conf dev/md dev/nvme fs/fuse fs/nfsclient fs/smbfs kern sys ufs/ffs vm In-Reply-To: <20190214233410.GJ83215@FreeBSD.org> Message-ID: <20190215162826.Q1105@besplex.bde.org> References: <201901150102.x0F12Hlt025856@repo.freebsd.org> <20190213192450.32343d6a@ralga.knownspace> <20190214233410.GJ83215@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=kj9zAlcOel0A:10 a=nqnIfDeU5hiArQiILCYA:9 a=CjuIK1q_8ugA:10 X-Rspamd-Queue-Id: D8F77821B5 X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-6.97 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-0.999,0]; NEURAL_HAM_SHORT(-0.97)[-0.971,0]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; REPLY(-4.00)[] X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 15 Feb 2019 07:13:36 -0000 On Thu, 14 Feb 2019, Gleb Smirnoff wrote: > On Wed, Feb 13, 2019 at 07:24:50PM -0600, Justin Hibbits wrote: > J> This seems to break 32-bit platforms, or at least 32-bit book-e > J> powerpc, which has a limited KVA space (~500MB). It preallocates I've > J> seen over 2500 pbufs, at 128kB each, eating up over 300MB KVA, > J> leaving very little left for the rest of runtime. > J> > J> I spent a couple hours earlier today debugging with Mark Johnston, and > J> his consensus is that the vnode_pbuf_zone is too big on 32-bit > J> platforms. Unfortunately I know very little about this area, so can't > J> provide much extra insight, but can readily reproduce the issues I see > J> triggered by this change, so am willing to help where I can. > > Ok, let's roll back to old default on 32-bit platforms and somewhat > reduce the default on 64-bits. This reduces the largest allocation by a factor of 16 on 32-bit arches, (back to where it was), but it leves the other allocations unchanged, so the total allocation is still almost 5 times larger than before (down from 20 times larger). E.g., with the usual limit of 256 on nswbuf, the total allocation was 32MB with overcommit by a factor of about 5/2 on all systems, but it is now almost 80MB with no overcommit on 32-bit systems. Approximately 0MB of the extras are available on systems with 1GB kva, and less on systems with 512MB kva. > Can you please confirm that the patch attached works for you? I don't have any systems affected by the bug, except when I boot with small hw.physmem or large kmem to test things. hw.physmem=72m leaves about 2MB afailable to map into buffers, and doesn't properly reduce nswbuf, so almost 80MB of kva is still used for pbufs. Allocating these must fail due to the RAM shortage. The old value of 32MB gives much the same failures (in practice, a larger operation like fork or exec tends to fail first). Limiting available kva is more interesting, and I haven't tested reducing it intentionally, except once I expanded kmem a lot to put a maximal md malloc()-backed disk in it). Expanding kmem steals from residual kva, and residual kva is not properly scaled except in my version. Large allocations then to cause panics at boot time, except for ones that crash because they don't check for errors. Here is debugging output for large allocations (1MB or more) at boot time on i386: XX pae_mode=0 with ~2.7 GB mapped RAM: XX kva_alloc: large allocation: 7490 pages: 0x5800000[0x1d42000] vm radix XX kva_alloc: large allocation: 6164 pages: 0x8400000[0x1814000] pmap init XX kva_alloc: large allocation: 28876 pages: 0xa000000[0x70cc000] buf XX kmem_suballoc: large allocation: 1364 pages: 0x11400000[0x554000] exec XX kmem_suballoc: large allocation: 10986 pages: 0x11954000[0x2aea000] pipe XX kva_alloc: large allocation: 6656 pages: 0x14800000[0x1a00000] sfbuf It went far above the old size of 1GB to nearly 1.5GB, but there is plenty to spare out of 4GB. Versions that fitted in 1GB started these allocations about 256MB lower and were otherwise similar. XX pae_mode=1 with 16 GB mapped RAM: XX kva_alloc: large allocation: 43832 pages: 0x14e00000[0xab38000] vm radix XX kva_alloc: large allocation: 15668 pages: 0x20000000[0x3d34000] pmap init XX kva_alloc: large allocation: 28876 pages: 0x23e00000[0x70cc000] buf XX kmem_suballoc: large allocation: 1364 pages: 0x2b000000[0x554000] exec XX kmem_suballoc: large allocation: 16320 pages: 0x2b554000[0x3fc0000] pipe XX kva_alloc: large allocation: 6656 pages: 0x2f600000[0x1a00000] sfbuf Only the vm radix and pmap init allocations are different, and they start much higher. The allocations now go over 3GB without any useful expansion except for the page tables. PAE was didn't work with 16 GB RAM and 1 GB kva, except in my version. PAE needed to be configured with 2 GB of kva to work with 16 GB RAM, but that was not the default or clearly documented. XX old PAE fixed fit work with 16GB RAM in 1GB KVA: XX kva_alloc: large allocation: 15691 pages: 0xd2c00000[0x3d4b000] pmap init XX kva_alloc: large allocation: 43917 pages: 0xd6a00000[0xab8d000] vm radix XX kva_alloc: large allocation: 27300 pages: 0xe1600000[0x6aa4000] buf XX kmem_suballoc: large allocation: 1364 pages: 0xe8200000[0x554000] exec XX kmem_suballoc: large allocation: 2291 pages: 0xe8754000[0x8f3000] pipe XX kva_alloc: large allocation: 6336 pages: 0xe9200000[0x18c0000] sfbuf PAE uses much more kva (almost 256MB extra) before the pmap and radix initializations here too. This is page table metadata before kva allocations are available. The fixes start by keeping track of this amout. It is about 1/16 of the address space for PAE in 1GB, so all later scaling was off by a factor of 16/15 (too high), and since there was less than 1/16 of 1GB to spare, PAE didn't fit. Only 'pipe' is reduced significantly to fit. swzone is reduced to 1 page in all cases, so it doesn't show here. It is about the same as sfbuf IIRC. The fixes were developed before reducing swzone and needed to squeeze harder to fit. Otherwise, panics tended to occur in the swzone allocation. sfbuf is the most mis-scaled and must be reduced significantly when RAM is small, and could be reduced under kva pressure too. It was the hardest to debug since it doesn't check for allocation failures. The above leaves more than 256MB at the end. This is mostly reserved for kmem. kmem ends up at about 200MB (down from 341MB). XX old non-PAE with fixes needed for old PAE, ~2.7 GB RAM in 1GB KVA: XX kva_alloc: large allocation: 7517 pages: 0xc4c00000[0x1d5d000] pmap init XX kva_alloc: large allocation: 6164 pages: 0xc7000000[0x1814000] vm radix XX kva_alloc: large allocation: 42848 pages: 0xc8c00000[0xa760000] buf XX kmem_suballoc: large allocation: 1364 pages: 0xd3400000[0x554000] exec XX kmem_suballoc: large allocation: 4120 pages: 0xd3954000[0x1018000] pipe XX kva_alloc: large allocation: 6656 pages: 0xd5000000[0x1a00000] sfbuf Since pmap starts almost 256MB lower and the pmap radix allocations are naturally much smaller, and I still shrink 'pipe', there is plenty of space for useful expansion. I only expand 'buf' back to a value that gives the historical maxufspace, and kmem a lot, and vnode space in kmem a lot. The space at the end is about 700MB. kmem is 527MB (up from 341MB). Back to -current. The 128KB allocations go somewhere in gaps between the reported allocations (left by smaller aligned uma allocations?), then at the end. dmesg is not spammed by printing such small allocations, but combined they are 279MB without this patch. pbuf_prealloc() is called towards the end of the boot, long after all the allocations reported above. It uses space that is supposed to be reserved for kmem when kva is small. It allocates many buffers (perhaps 100) in gaps before starting a contiguous range of allocations at the end. Using the gaps is good for minimizing fragmentation provided these buffers are never freed. Bruce