From owner-freebsd-arm@freebsd.org  Tue Mar 24 22:46:57 2020
Return-Path: <owner-freebsd-arm@freebsd.org>
Delivered-To: freebsd-arm@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 06B5826CFDC
 for <freebsd-arm@mailman.nyi.freebsd.org>;
 Tue, 24 Mar 2020 22:46:57 +0000 (UTC)
 (envelope-from fbsd@www.zefox.net)
Received: from www.zefox.net (www.zefox.net [50.1.20.27])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "www.zefox.org", Issuer "www.zefox.org" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 48n5w63Gtwz42Sq
 for <freebsd-arm@freebsd.org>; Tue, 24 Mar 2020 22:46:45 +0000 (UTC)
 (envelope-from fbsd@www.zefox.net)
Received: from www.zefox.net (localhost [127.0.0.1])
 by www.zefox.net (8.15.2/8.15.2) with ESMTPS id 02OMkxEp092965
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Tue, 24 Mar 2020 15:47:00 -0700 (PDT)
 (envelope-from fbsd@www.zefox.net)
Received: (from fbsd@localhost)
 by www.zefox.net (8.15.2/8.15.2/Submit) id 02OMkwgm092964;
 Tue, 24 Mar 2020 15:46:58 -0700 (PDT) (envelope-from fbsd)
Date: Tue, 24 Mar 2020 15:46:58 -0700
From: bob prohaska <fbsd@www.zefox.net>
To: Mark Millard <marklmi@yahoo.com>
Cc: freebsd-arm@freebsd.org, bob prohaska <fbsd@www.zefox.net>
Subject: Re: Belated out of swap kill on rpi3 at r359216
Message-ID: <20200324224658.GA92726@www.zefox.net>
References: <20200324155753.GA91922@www.zefox.net>
 <83E41A13-6C24-4B56-A837-779044038FBC@yahoo.com>
 <20200324185518.GA92311@www.zefox.net>
 <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
X-Rspamd-Queue-Id: 48n5w63Gtwz42Sq
X-Spamd-Bar: +++
Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none;
 spf=none (mx1.freebsd.org: domain of fbsd@www.zefox.net has no SPF policy when
 checking 50.1.20.27) smtp.mailfrom=fbsd@www.zefox.net
X-Spamd-Result: default: False [3.15 / 15.00]; ARC_NA(0.00)[];
 WWW_DOT_DOMAIN(0.50)[];
 URIBL_BLOCKED(0.00)[zefox.net.multi.uribl.com];
 FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3];
 TO_DN_SOME(0.00)[]; RCVD_TLS_ALL(0.00)[];
 MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[zefox.net];
 AUTH_NA(1.00)[]; NEURAL_SPAM_MEDIUM(0.37)[0.369,0];
 TO_MATCH_ENVRCPT_SOME(0.00)[];
 IP_SCORE(0.06)[ip: (0.24), ipnet: 50.1.16.0/20(0.12), asn: 7065(-0.03),
 country: US(-0.05)]; NEURAL_SPAM_LONG(0.83)[0.827,0];
 R_SPF_NA(0.00)[]; FREEMAIL_TO(0.00)[yahoo.com];
 FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[];
 MIME_TRACE(0.00)[0:+];
 ASN(0.00)[asn:7065, ipnet:50.1.16.0/20, country:US];
 MID_RHS_MATCH_FROM(0.00)[]; MID_RHS_WWW(0.50)[];
 RCVD_COUNT_TWO(0.00)[2]
X-BeenThere: freebsd-arm@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Porting FreeBSD to ARM processors." <freebsd-arm.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arm/>
List-Post: <mailto:freebsd-arm@freebsd.org>
List-Help: <mailto:freebsd-arm-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arm>,
 <mailto:freebsd-arm-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Mar 2020 22:46:57 -0000


The logfile being discussed is at
http://www.zefox.net/~fbsd/rpi3/swaptests/r359216/swapscript.osupdate.log
for convenient reference. More replies below.


On Tue, Mar 24, 2020 at 01:21:42PM -0700, Mark Millard wrote:
> 
> 
> On 2020-Mar-24, at 11:55, bob prohaska <fbsd at www.zefox.net> wrote:
> 
> > On Tue, Mar 24, 2020 at 10:09:28AM -0700, Mark Millard wrote:
> >> 
> >> 
> >> On 2020-Mar-24, at 08:57, bob prohaska <fbsd at www.zefox.net> wrote:
> >> 
> >>> An attempt to buildworld on an rpi3 running r359216 stopped with an
> >>> OOMA kill. Sources were at 359264, loader.conf contained
> >>> vm.pageout_oom_seq="4096" . 
> >> 
> >> What of the value of vm.pfault_oom_attempts ?
> >> 
> >> If vm.pfault_oom_attempts was not -1,
> >> what of the value of vm.pfault_oom_wait ?
> >> 
> > bob@www:/usr/src % sysctl vm.pfault_oom_wait
> > vm.pfault_oom_wait: 10
> > I didn't change it, this must be the default. 
> > Would it be useful to add something like
> > vm.pfault_oom_wait="20" 
> > to /boot/loader.conf?
> 
> I assume that this answer means vm.pfault_oom_attempts
> was not -1. But you did not list what it was.
>
 
Sorry, the variable names are running together in my head.

bob@www:/usr/src % sysctl vm.pfault_oom_attempts
vm.pfault_oom_attempts: 3
It's now reset to -1, hopefully it'll work better than last time 8-)

> vm.pfault_oom_attempts=-1 is the means of avoiding
> the the vm.pfault_oom_attempts,vm.pfault_oom_wait
> pair from causing OOM at all. (vm.pfault_oom_wait
> becomes irrelevant.)
> 

> When vm.pfault_oom_attempts != -1 , then there are
> potential time-outs that overall involve:
> 
>    vm.pfault_oom_attempts * vm.pfault_oom_wait
> 
> but has some tradeoffs in the partitioning between
> the 2 factors:
> 
> # sysctl -d vm.pfault_oom_attempts
> vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling
> 
> # sysctl -d vm.pfault_oom_wait
> vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler
> 
> As I understand, the following cases could each have
> very different results depending on what the latencies
> are like and such:
> 
> vm.pfault_oom_attempts==20 && vm.pfault_oom_wait==1
> vs.
> vm.pfault_oom_attempts==1 && vm.pfault_oom_wait==20
> vs.
> vm.pfault_oom_attempts==4 && vm.pfault_oom_wait==5
> 
> As I remember, Konstantin Belousov once asked someone
> that was having a repeatable problem to try some
> alternatives that explored this but, as I remember,
> he got no reply to the request.
> 
Konstantin wrote to both me and the list in a thread on 
Re: panic: non-current pmap. If it's relevant please indicate.

He also wrote to Don Lewis, in a thread on Re: spurious out of swap kills
but that was on a 
" 16-thread Ryzen machine, with 64 GB of RAM and 40 GB of swap"
which is surely unlike my predicament 8-)

> The person might have just used: vm.pfault_oom_attempts=-1
> and continued with their primary activity, for all I know.
> 
> vm.pfault_oom_attempts=-1 is only recommended when one
> is sure that they will not run out of swap/paging space,
> if I understand right.
> 
> For reference, for other settings:
> 
> # sysctl -d vm.pageout_oom_seq
> vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
> 
> There is also:
> 
> # sysctl -d vm.oom_pf_secs
> vm.oom_pf_secs: 
> 
> but it seems to have an empty description. May be
> it reports vm.pfault_oom_attempts * vm.pfault_oom_wait
> when vm.pfault_oom_attempts != -1 . (I've not looked.)
> 
> >>> A snipped of gstat log suggests the worst congestion in the storage I/O
> >>> happened at Tue Mar 24 04:52:26 PDT 2020 with an L(q) of 37, but the
> >>> OOMA kill happened at Tue Mar 24 04:53:04 PDT 2020, by which time the
> >>> L(q) had dropped to one, half a minute later.
> >>> 
> >>> Is the delay in OOMA action to be expected? 
> >>> 
> >>> Here's the relevant part of the log, I hope the columns display readably:
> >>> 
> >>> 0/2/2/19177 mbuf clusters in use (current/cache/total/max)
> >>> procs     memory       page                      disks     faults       cpu
> >>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs us sy id
> >>> 0  0 29 1897320   47312 12851   9   4   5 13330 1669   0   0 14172  8020  3034 65 29  6
> >>> dT: 1.056s  w: 1.000s
> >>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d   %busy Name
> >>>  37    367    323   5463    6.4     45   1511   76.8      0      0    0.0   91.7  mmcsd0
> >>>  37    367    323   5463    6.5     45   1511   76.9      0      0    0.0   93.3  mmcsd0s2
> >>>  34    133    111   3209    7.6     22    697  134.7      0      0    0.0   74.0  mmcsd0s2a
> >>>   3    235    212   2254    5.9     23    814   21.5      0      0    0.0   70.0  mmcsd0s2b
> >>>  34    133    111   3209    7.6     22    697  134.7      0      0    0.0   74.2  ufs/rootfs
> >>> Tue Mar 24 04:52:26 PDT 2020
> >>> Device          1K-blocks     Used    Avail Capacity
> >>> /dev/mmcsd0s2b    4404252   224484  4179768     5%
> >>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21842, size: 12288
> >>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 25477, size: 4096
> >> 
> >> How bad were things back when the "indefinate wait buffer" notices
> >> were generated (Mar 24 04:20:50 time frame)?
> >> 
> > It looks like _new_ indefinite wait messages started coming at around Mon Mar 23 23:00:28 PDT 2020:
> > procs     memory       page                      disks     faults       cpu
> > r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs us sy id
> > 2  0 18 1588824  197676 14108   2   1   0 14759  238   0   0 14055  9263  2668 62 32  5
> > dT: 1.027s  w: 1.000s
> > L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d   %busy Name
> >    9      1      0      0    0.0      1     31  13451      0      0    0.0  326.9  mmcsd0
> >    9      1      0      0    0.0      1     31  13451      0      0    0.0  326.9  mmcsd0s2
> >    6      1      0      0    0.0      1     31  13451      0      0    0.0  326.9  mmcsd0s2a
> >    6      1      0      0    0.0      1     31  13451      0      0    0.0  326.9  ufs/rootfs
> > Mon Mar 23 23:00:28 PDT 2020
> > Device          1K-blocks     Used    Avail Capacity
> > /dev/mmcsd0s2b    4404252   245288  4158964     6%
> > Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 280324, size: 8192
> > Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 285903, size: 16384
> > 
> > Admittedly, 300% busy looks pretty bad. Still the machine didn't quit.....
> > On the next sample %busy went down to 18% for swap, less for all else.
> 
> As I understand, the ms/w value 13451 means that there was
> for a time (a mean of ?) somewhat under 14 seconds from a
> write being queued to it being completed.
> 
> If a read could be blocked for such time frames by such
> writes, it suggests that vm.pfault_oom_wait might need to
> be larger than 13 if vm.pfault_oom_attempts=-1 is not in
> use.
> 
Ok, I'm starting to get it. On this machine it's 10.
 
But the 13 second delay appeared at Mon Mar 23 23:00:28 PDT 2020. 
The kill occurred around Tue Mar 24 04:53:08 PDT 2020
under what look like much more benign circumstances. 

> Or vm.pfault_oom_attempts=-1 could be used to avoid large
> time frames from directly leading to OOM activity.
>
Understood. I simply forgot to restore the setting after
the initial troubles with it. 
 
> >> 
> >> Below I show code changes to be more explicit in the
> >> output messaging about what contributes to initiating
> >> OOM kills, without needing boot verbose or the like.
> >> There are also some messages from Mark J.'s old code
> >> for reporting on things related to initiating OOM
> >> kills, or some minor variations of them.
> >> 
> > 
> >> You may want to try such changes to provide more
> >> context for your OOM kills when they happen. Below
> >> the 4 reasons reported are (showing the most
> >> detailed of the related messages from different
> >> stages):
> >> 
> >> swp_pager_meta_build: swap blk uma zone exhausted
> >> 
> >> swp_pager_meta_build: swap pctrie uma zone exhausted
> >> 
> >> vm_fault_allocate: proc %d (%s) failed to alloc page on fault, starting OOM
> >> 
> >> m_pageout_mightbe_oom: kill context: v_free_count: %u, v_inactive_count: %u
> >> 
> >> 
> > 
> > Could inquiries instead  be added to the logging script?
> > Right now it's
> > #!/bin/sh
> > while true
> > do vmstat ; gstat -abd -I 1s ; date ; swapinfo ; tail -n 2 /var/log/messages ; netstat -m | grep "mbuf clusters"
> > done
> > 
> > I'd much rather tamper with the logging script than the kernel 8-)
> 
> Unfortunately, the reason I made the kernel messaging
> changes is that, as far as I know, the normal kernel
> simply does not publish the information anywhere when
> the internal criteria leads to OOM activity: No
> inquiry available without kernel changes.

If all else fails I'll try to apply your patches to the kernel
and recompile. 

With my thanks,

bob prohaska