From owner-freebsd-current@freebsd.org Sun Nov 25 10:16:39 2018 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9B633114E830 for ; Sun, 25 Nov 2018 10:16:39 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id F04837D888 for ; Sun, 25 Nov 2018 10:16:38 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id ADD42114E82D; Sun, 25 Nov 2018 10:16:38 +0000 (UTC) Delivered-To: current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8AD20114E82C; Sun, 25 Nov 2018 10:16:38 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D6A1C7D885; Sun, 25 Nov 2018 10:16:37 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id wAPAGQdq059731 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sun, 25 Nov 2018 12:16:29 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua wAPAGQdq059731 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id wAPAGQQK059730; Sun, 25 Nov 2018 12:16:26 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 25 Nov 2018 12:16:26 +0200 From: Konstantin Belousov To: current@FreeBSD.org, freebsd-arm@FreeBSD.org Subject: Re: maxswzone NOT used correctly and defaults incorrect? Message-ID: <20181125101626.GX2378@kib.kiev.ua> References: <20181124090429.GI10067@funkthat.com> <20181124104032.GV2378@kib.kiev.ua> <20181124200934.GJ10067@funkthat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181124200934.GJ10067@funkthat.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on tom.home X-Rspamd-Queue-Id: F04837D888 X-Spamd-Result: default: False [-5.74 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; RCVD_COUNT_FIVE(0.00)[6]; FROM_HAS_DN(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; MIME_GOOD(-0.10)[text/plain]; TO_DN_NONE(0.00)[]; HAS_XAW(0.00)[]; R_SPF_SOFTFAIL(0.00)[~all]; SUBJECT_ENDS_QUESTION(1.00)[]; RCVD_IN_DNSWL_MED(-0.20)[5.0.0.0.0.5.0.0.0.0.0.0.0.0.0.0.a.6.0.2.4.5.2.2.0.0.9.1.1.0.0.2.list.dnswl.org : 127.0.9.2]; RCPT_COUNT_TWO(0.00)[2]; MX_GOOD(-0.01)[cached: alt3.gmail-smtp-in.l.google.com]; IP_SCORE(-3.65)[ip: (-9.87), ipnet: 2001:1900:2254::/48(-4.72), asn: 10310(-3.58), country: US(-0.09)]; NEURAL_HAM_SHORT(-0.98)[-0.979,0]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:10310, ipnet:2001:1900:2254::/48, country:US]; FORGED_RECIPIENTS(0.00)[current@FreeBSD.org ..,freebsd-current@freebsd.org]; DMARC_POLICY_SOFTFAIL(0.10)[gmail.com : No valid SPF, No valid DKIM,none] X-Rspamd-Server: mx1.freebsd.org X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 25 Nov 2018 10:16:39 -0000 On Sat, Nov 24, 2018 at 12:09:34PM -0800, John-Mark Gurney wrote: > Konstantin Belousov wrote this message on Sat, Nov 24, 2018 at 12:40 +0200: > > On Sat, Nov 24, 2018 at 01:04:29AM -0800, John-Mark Gurney wrote: > > > I have an BeagleBoard Black. I'm running a recent snapshot: > > > FreeBSD generic 13.0-CURRENT FreeBSD 13.0-CURRENT r340239 GENERIC arm > > > > > > aka: > > > FreeBSD-13.0-CURRENT-arm-armv7-BEAGLEBONE-20181107-r340239.img.xz > > > > > > It has 512MB of memory on board. I created a 4GB swap file. According > > > to loader(8), this should be the default capable: > > > in bytes of KVA space. If no value is provided, the system > > > allocates enough memory to handle an amount of swap that > > > corresponds to eight times the amount of physical memory > > > present in the system. > > > > > > avail memory = 505909248 (482 MB) > > > > > > but I get this: > > > warning: total configured swap (1048576 pages) exceeds maximum recommended amount (248160 pages). > > > warning: increase kern.maxswzone or reduce amount of swap. > > > > > > So, this appears that it's only 2x amount of memory, NOT 8x like the > > > documentation says. > > > > > > When running make in sbin/ggate/ggated, make consumes a large amount > > > of memory. Before the OOM killer just kicked in, top showed: > > > Mem: 224M Active, 4096 Inact, 141M Laundry, 121M Wired, 57M Buf, 2688K Free > > > Swap: 1939M Total, 249M Used, 1689M Free, 12% Inuse, 1196K Out > > > > > > PID UID THR PRI NICE SIZE RES STATE TIME WCPU COMMAND > > > 1029 1001 1 44 0 594M 3848K RUN 2:03 38.12% make > > > > > > swapinfo -k showed: > > > /dev/md99 4194304 254392 3939912 6% > > > > > > sysctl: > > > vm.swzone: 4466880 > > > vm.swap_maxpages: 496320 > > > kern.maxswzone: 0 > > > > > > dmesg when OOM strikes: > > > swap blk zone exhausted, increase kern.maxswzone > > > pid 1029 (make), uid 1001, was killed: out of swap space > > > pid 984 (bash), uid 1001, was killed: out of swap space > > > pid 956 (bash), uid 1001, was killed: out of swap space > > > pid 952 (sshd), uid 0, was killed: out of swap space > > > pid 1043 (bash), uid 1001, was killed: out of swap space > > > pid 626 (dhclient), uid 65, was killed: out of swap space > > > pid 955 (sshd), uid 1001, was killed: out of swap space > > > pid 1025 (bash), uid 1001, was killed: out of swap space > > > swblk zone ok > > > lock order reversal: > > > 1st 0xd374d028 filedesc structure (filedesc structure) @ /usr/src/sys/kern/sys_generic.c:1451 > > > 2nd 0xd41a5bc4 devfs (devfs) @ /usr/src/sys/kern/vfs_vnops.c:1513 > > > stack backtrace: > > > swap blk zone exhausted, increase kern.maxswzone > > > pid 981 (tmux), uid 1001, was killed: out of swap space > > > pid 983 (tmux), uid 1001, was killed: out of swap space > > > pid 1031 (bash), uid 1001, was killed: out of swap space > > > pid 580 (dhclient), uid 0, was killed: out of swap space > > > swblk zone ok > > > swap blk zone exhausted, increase kern.maxswzone > > > pid 577 (dhclient), uid 0, was killed: out of swap space > > > pid 627 (devd), uid 0, was killed: out of swap space > > > swblk zone ok > > > swap blk zone exhausted, increase kern.maxswzone > > > pid 942 (getty), uid 0, was killed: out of swap space > > > swblk zone ok > > > swap blk zone exhausted, increase kern.maxswzone > > > pid 1205 (init), uid 0, was killed: out of swap space > > > swblk zone ok > > > swap blk zone exhausted, increase kern.maxswzone > > > pid 1206 (init), uid 0, was killed: out of swap space > > > swblk zone ok > > > swap blk zone exhausted, increase kern.maxswzone > > > swblk zone ok > > > swap blk zone exhausted, increase kern.maxswzone > > > swblk zone ok > > > > > > So, as you can see, despite having plenty of swap, and swap usage being > > > well below any of the maximums, the OOM killer kicked in, and killed off > > > a bunch of processes. > > OOM is guided by the pagedaemon progress, not by the swap amount left. > > If the system cannot meet the pagedaemon targetp by doing > > $(sysctl vm.pageout_oom_seq) back-to-back page daemon passes, > > it declares OOM condition. E.g. if you have very active process which > > keeps a lot of active memory by referencing the pages, and simultenously > > a slow or stuck swap device, then you get into this state. > > > > Just by looking at the top stats, you have a single page in the inactive > > queue, which means that pagedaemon desperately frees clean pages and > > moves dirty pages into the laundry. Also, you have relatively large > > laundry queue, which supports the theory about slow swap. > > Yes, swap is "slow" by modern standards, but not really that slow... I'm > swapping out at over 10MB/sec... For such a system, this is quite > fast... > > Though maybe I wasn't explicit, it's very clear that I'm running out > of the swap blk zone, per the very first message, and the vmstat -z > stats below (and the resulting failures): > swap blk zone exhausted > > > You may try to increase vm.pageout_oom_seq to move OOM trigger furhter > > after the system is overloaded with swapping. > > > > > > > > It also looks like the algorithm for calculating kern.maxswzone is not > > > correct. > > > > > > I just tried to run the system w/: > > > kern.maxswzone: 21474836 > > > > > > and it again died w/ plenty of swap free: > > > /dev/md99 4194304 238148 3956156 6% > > > > > > This time I had vmstat -z | grep sw running, and saw: > > > swpctrie: 48, 62084, 145, 270, 203, 0, 0 > > > swblk: 72, 62040, 56357, 18, 56587, 0, 0 > > > > > > after the system died, I logged back in as see: > > > swpctrie: 48, 62084, 28, 387, 240, 0, 0 > > > swblk: 72, 62040, 175, 61865, 62957, 16, 0 > > > > > > so, it clearly ran out of swblk space VERY early, when only consuming > > > around 232MB of swap... > > > > > > Hmm... it looks like swblk and swpctrie are not affected by the setting > > > of kern.maxswzone... I just set it to: > > > kern.maxswzone: 85899344 > > > > > > and the limits for the zones did not increase at ALL: > > > swpctrie: 48, 62084, 0, 0, 0, 0, 0 > > > swblk: 72, 62040, 0, 0, 0, 0, 0 > > The swap metadata zones must have all the KVA reserved in advance, > > because we cannot wait for AS or memory while we try to free some > > memory. At boot, the swap init code allocates KVA starting with the > > requested amount. If the allocation fails, it reduces the amount by > > 2/3 and retries, until the allocation succeeds. What you see in limits > > is the actual amount of KVA that your platform is able to provide for > > reserve, so increasing the maxswzone only results in more iterations to > > allocate. > > Except that I don't see the warning "Swap blk zone entries reduced > from" in the dmesg which I'd expect to see that code is triggered... > > I find it hard to believe that it can't allocate more than 5MB of KVA > at boot... per above, 72*62040 ~= 4.26MB... > > It does look like the calculation is correct for swblk assuming maxswzone > is not set (0), as: > vm.stats.vm.v_page_count: 124041 > > and: > n = vm_cnt.v_page_count / 2; > > I'll be adding a print for maxswzone to make sure it's getting set, > though it'll take me a while to get a kernel built... > > and kenv does show it set: > [freebsd@generic ~]$ sysctl kern.maxswzone > kern.maxswzone: 85899344 > [freebsd@generic ~]$ kenv | grep kern.maxswzone > kern.maxswzone="85899344" > > so how that code isn't being triggered is quite strange... Try this diff --git a/sys/vm/swap_pager.c b/sys/vm/swap_pager.c index 54370523086..b5e92bc97ee 100644 --- a/sys/vm/swap_pager.c +++ b/sys/vm/swap_pager.c @@ -547,12 +547,12 @@ swap_pager_swap_init(void) mtx_unlock(&pbuf_mtx); /* - * Initialize our zone, guessing on the number we need based - * on the number of pages in the system. + * Initialize our zone, taking the user sizing or guessing on + * the number we need based on the number of pages in the + * system. */ - n = vm_cnt.v_page_count / 2; - if (maxswzone && n > maxswzone / sizeof(struct swblk)) - n = maxswzone / sizeof(struct swblk); + n = maxswzone != 0 ? maxswzone / sizeof(struct swblk) : + vm_cnt.v_page_count / 2; swpctrie_zone = uma_zcreate("swpctrie", pctrie_node_size(), NULL, NULL, pctrie_zone_init, NULL, UMA_ALIGN_PTR, UMA_ZONE_VM); if (swpctrie_zone == NULL) @@ -580,7 +580,7 @@ swap_pager_swap_init(void) n = uma_zone_get_max(swblk_zone); if (n < n2) - printf("Swap blk zone entries reduced from %lu to %lu.\n", + printf("Swap blk zone entries changed from %lu to %lu.\n", n2, n); swap_maxpages = n * SWAP_META_PAGES; swzone = n * sizeof(struct swblk);