From owner-freebsd-arm@freebsd.org  Sat Nov 24 20:09:38 2018
Return-Path: <owner-freebsd-arm@freebsd.org>
Delivered-To: freebsd-arm@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 30B6D1137D43;
 Sat, 24 Nov 2018 20:09:38 +0000 (UTC)
 (envelope-from jmg@gold.funkthat.com)
Received: from gold.funkthat.com (gate2.funkthat.com [208.87.223.18])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "gate2.funkthat.com",
 Issuer "Let's Encrypt Authority X3" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 1AD108A4E9;
 Sat, 24 Nov 2018 20:09:36 +0000 (UTC)
 (envelope-from jmg@gold.funkthat.com)
Received: from gold.funkthat.com (localhost [127.0.0.1])
 by gold.funkthat.com (8.15.2/8.15.2) with ESMTPS id wAOK9YGX050381
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Sat, 24 Nov 2018 12:09:35 -0800 (PST)
 (envelope-from jmg@gold.funkthat.com)
Received: (from jmg@localhost)
 by gold.funkthat.com (8.15.2/8.15.2/Submit) id wAOK9YrE050380;
 Sat, 24 Nov 2018 12:09:34 -0800 (PST) (envelope-from jmg)
Date: Sat, 24 Nov 2018 12:09:34 -0800
From: John-Mark Gurney <jmg@funkthat.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: current@FreeBSD.org, freebsd-arm@FreeBSD.org
Subject: Re: maxswzone NOT used correctly and defaults incorrect?
Message-ID: <20181124200934.GJ10067@funkthat.com>
Mail-Followup-To: Konstantin Belousov <kostikbel@gmail.com>,
 current@FreeBSD.org, freebsd-arm@FreeBSD.org
References: <20181124090429.GI10067@funkthat.com>
 <20181124104032.GV2378@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20181124104032.GV2378@kib.kiev.ua>
X-Operating-System: FreeBSD 11.0-RELEASE-p7 amd64
X-PGP-Fingerprint: D87A 235F FB71 1F3F 55B7  ED9B D5FF 5A51 C0AC 3D65
X-Files: The truth is out there
X-URL: https://www.funkthat.com/
X-Resume: https://www.funkthat.com/~jmg/resume.html
X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE
X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger?
User-Agent: Mutt/1.6.1 (2016-04-27)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3
 (gold.funkthat.com [127.0.0.1]); Sat, 24 Nov 2018 12:09:35 -0800 (PST)
X-Rspamd-Queue-Id: 1AD108A4E9
X-Spamd-Result: default: False [-1.12 / 15.00]; ARC_NA(0.00)[];
 NEURAL_HAM_MEDIUM(-0.67)[-0.668,0]; FROM_HAS_DN(0.00)[];
 RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+a];
 NEURAL_HAM_LONG(-0.67)[-0.667,0]; MIME_GOOD(-0.10)[text/plain];
 RCVD_TLS_LAST(0.00)[]; DMARC_NA(0.00)[funkthat.com];
 TO_DN_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3];
 TO_MATCH_ENVRCPT_SOME(0.00)[];
 MX_GOOD(-0.01)[cached: gold.funkthat.com];
 NEURAL_HAM_SHORT(-0.75)[-0.753,0];
 IP_SCORE(-0.02)[country: US(-0.09)];
 FORGED_SENDER(0.30)[jmg@funkthat.com,jmg@gold.funkthat.com];
 FREEMAIL_TO(0.00)[gmail.com]; R_DKIM_NA(0.00)[];
 SUBJECT_ENDS_QUESTION(1.00)[];
 ASN(0.00)[asn:32354, ipnet:208.87.216.0/21, country:US];
 FROM_NEQ_ENVFROM(0.00)[jmg@funkthat.com,jmg@gold.funkthat.com];
 MID_RHS_MATCH_FROM(0.00)[]
X-Rspamd-Server: mx1.freebsd.org
X-BeenThere: freebsd-arm@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Porting FreeBSD to ARM processors." <freebsd-arm.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arm/>
List-Post: <mailto:freebsd-arm@freebsd.org>
List-Help: <mailto:freebsd-arm-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 24 Nov 2018 20:09:38 -0000

Konstantin Belousov wrote this message on Sat, Nov 24, 2018 at 12:40 +0200:
> On Sat, Nov 24, 2018 at 01:04:29AM -0800, John-Mark Gurney wrote:
> > I have an BeagleBoard Black.  I'm running a recent snapshot:
> > FreeBSD generic 13.0-CURRENT FreeBSD 13.0-CURRENT r340239 GENERIC  arm
> > 
> > aka:
> > FreeBSD-13.0-CURRENT-arm-armv7-BEAGLEBONE-20181107-r340239.img.xz
> > 
> > It has 512MB of memory on board.  I created a 4GB swap file.  According
> > to loader(8), this should be the default capable:
> >                    in bytes of KVA space.  If no value is provided, the system
> >                    allocates enough memory to handle an amount of swap that
> >                    corresponds to eight times the amount of physical memory
> >                    present in the system.
> > 
> > avail memory = 505909248 (482 MB)
> > 
> > but I get this:
> > warning: total configured swap (1048576 pages) exceeds maximum recommended amount (248160 pages).
> > warning: increase kern.maxswzone or reduce amount of swap.
> > 
> > So, this appears that it's only 2x amount of memory, NOT 8x like the
> > documentation says.
> > 
> > When running make in sbin/ggate/ggated, make consumes a large amount
> > of memory.  Before the OOM killer just kicked in, top showed:
> > Mem: 224M Active, 4096 Inact, 141M Laundry, 121M Wired, 57M Buf, 2688K Free
> > Swap: 1939M Total, 249M Used, 1689M Free, 12% Inuse, 1196K Out
> > 
> >   PID    UID      THR PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
> >  1029   1001        1  44    0   594M  3848K RUN      2:03  38.12% make
> > 
> > swapinfo -k showed:
> > /dev/md99         4194304   254392  3939912     6%
> > 
> > sysctl:
> > vm.swzone: 4466880
> > vm.swap_maxpages: 496320
> > kern.maxswzone: 0
> > 
> > dmesg when OOM strikes:
> > swap blk zone exhausted, increase kern.maxswzone
> > pid 1029 (make), uid 1001, was killed: out of swap space
> > pid 984 (bash), uid 1001, was killed: out of swap space
> > pid 956 (bash), uid 1001, was killed: out of swap space
> > pid 952 (sshd), uid 0, was killed: out of swap space
> > pid 1043 (bash), uid 1001, was killed: out of swap space
> > pid 626 (dhclient), uid 65, was killed: out of swap space
> > pid 955 (sshd), uid 1001, was killed: out of swap space
> > pid 1025 (bash), uid 1001, was killed: out of swap space
> > swblk zone ok
> > lock order reversal:
> >  1st 0xd374d028 filedesc structure (filedesc structure) @ /usr/src/sys/kern/sys_generic.c:1451
> >  2nd 0xd41a5bc4 devfs (devfs) @ /usr/src/sys/kern/vfs_vnops.c:1513
> > stack backtrace:
> > swap blk zone exhausted, increase kern.maxswzone
> > pid 981 (tmux), uid 1001, was killed: out of swap space
> > pid 983 (tmux), uid 1001, was killed: out of swap space
> > pid 1031 (bash), uid 1001, was killed: out of swap space
> > pid 580 (dhclient), uid 0, was killed: out of swap space
> > swblk zone ok
> > swap blk zone exhausted, increase kern.maxswzone
> > pid 577 (dhclient), uid 0, was killed: out of swap space
> > pid 627 (devd), uid 0, was killed: out of swap space
> > swblk zone ok
> > swap blk zone exhausted, increase kern.maxswzone
> > pid 942 (getty), uid 0, was killed: out of swap space
> > swblk zone ok
> > swap blk zone exhausted, increase kern.maxswzone
> > pid 1205 (init), uid 0, was killed: out of swap space
> > swblk zone ok
> > swap blk zone exhausted, increase kern.maxswzone
> > pid 1206 (init), uid 0, was killed: out of swap space
> > swblk zone ok
> > swap blk zone exhausted, increase kern.maxswzone
> > swblk zone ok
> > swap blk zone exhausted, increase kern.maxswzone
> > swblk zone ok
> > 
> > So, as you can see, despite having plenty of swap, and swap usage being
> > well below any of the maximums, the OOM killer kicked in, and killed off
> > a bunch of processes.
> OOM is guided by the pagedaemon progress, not by the swap amount left.
> If the system cannot meet the pagedaemon targetp by doing
> $(sysctl vm.pageout_oom_seq) back-to-back page daemon passes,
> it declares OOM condition. E.g. if you have very active process which
> keeps a lot of active memory by referencing the pages, and simultenously
> a slow or stuck swap device, then you get into this state.
> 
> Just by looking at the top stats, you have a single page in the inactive
> queue, which means that pagedaemon desperately frees clean pages and
> moves dirty pages into the laundry.  Also, you have relatively large
> laundry queue, which supports the theory about slow swap.

Yes, swap is "slow" by modern standards, but not really that slow... I'm
swapping out at over 10MB/sec... For such a system, this is quite
fast...

Though maybe I wasn't explicit, it's very clear that I'm running out
of the swap blk zone, per the very first message, and the vmstat -z
stats below (and the resulting failures):
swap blk zone exhausted

> You may try to increase vm.pageout_oom_seq to move OOM trigger furhter
> after the system is overloaded with swapping.
> 
> > 
> > It also looks like the algorithm for calculating kern.maxswzone is not
> > correct.
> > 
> > I just tried to run the system w/:
> > kern.maxswzone: 21474836
> > 
> > and it again died w/ plenty of swap free:
> > /dev/md99         4194304   238148  3956156     6%
> > 
> > This time I had vmstat -z | grep sw running, and saw:
> > swpctrie:                48,  62084,     145,     270,     203,   0,   0
> > swblk:                   72,  62040,   56357,      18,   56587,   0,   0
> > 
> > after the system died, I logged back in as see:
> > swpctrie:                48,  62084,      28,     387,     240,   0,   0
> > swblk:                   72,  62040,     175,   61865,   62957,  16,   0
> > 
> > so, it clearly ran out of swblk space VERY early, when only consuming
> > around 232MB of swap...
> > 
> > Hmm... it looks like swblk and swpctrie are not affected by the setting
> > of kern.maxswzone...  I just set it to:
> > kern.maxswzone: 85899344
> > 
> > and the limits for the zones did not increase at ALL:
> > swpctrie:                48,  62084,       0,       0,       0,   0,   0
> > swblk:                   72,  62040,       0,       0,       0,   0,   0
> The swap metadata zones must have all the KVA reserved in advance,
> because we cannot wait for AS or memory while we try to free some
> memory. At boot, the swap init code allocates KVA starting with the
> requested amount. If the allocation fails, it reduces the amount by
> 2/3 and retries, until the allocation succeeds. What you see in limits
> is the actual amount of KVA that your platform is able to provide for
> reserve, so increasing the maxswzone only results in more iterations to
> allocate.

Except that I don't see the warning "Swap blk zone entries reduced
from" in the dmesg which I'd expect to see that code is triggered...

I find it hard to believe that it can't allocate more than 5MB of KVA
at boot...  per above, 72*62040 ~= 4.26MB...

It does look like the calculation is correct for swblk assuming maxswzone
is not set (0), as:
vm.stats.vm.v_page_count: 124041

and:
n = vm_cnt.v_page_count / 2;

I'll be adding a print for maxswzone to make sure it's getting set,
though it'll take me a while to get a kernel built...

and kenv does show it set:
[freebsd@generic ~]$ sysctl kern.maxswzone
kern.maxswzone: 85899344
[freebsd@generic ~]$ kenv | grep kern.maxswzone
kern.maxswzone="85899344"

so how that code isn't being triggered is quite strange...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."