Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 24 Mar 2020 15:46:58 -0700
From:      bob prohaska <fbsd@www.zefox.net>
To:        Mark Millard <marklmi@yahoo.com>
Cc:        freebsd-arm@freebsd.org, bob prohaska <fbsd@www.zefox.net>
Subject:   Re: Belated out of swap kill on rpi3 at r359216
Message-ID:  <20200324224658.GA92726@www.zefox.net>
In-Reply-To: <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com>
References:  <20200324155753.GA91922@www.zefox.net> <83E41A13-6C24-4B56-A837-779044038FBC@yahoo.com> <20200324185518.GA92311@www.zefox.net> <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help

The logfile being discussed is at
http://www.zefox.net/~fbsd/rpi3/swaptests/r359216/swapscript.osupdate.log
for convenient reference. More replies below.


On Tue, Mar 24, 2020 at 01:21:42PM -0700, Mark Millard wrote:
> 
> 
> On 2020-Mar-24, at 11:55, bob prohaska <fbsd at www.zefox.net> wrote:
> 
> > On Tue, Mar 24, 2020 at 10:09:28AM -0700, Mark Millard wrote:
> >> 
> >> 
> >> On 2020-Mar-24, at 08:57, bob prohaska <fbsd at www.zefox.net> wrote:
> >> 
> >>> An attempt to buildworld on an rpi3 running r359216 stopped with an
> >>> OOMA kill. Sources were at 359264, loader.conf contained
> >>> vm.pageout_oom_seq="4096" . 
> >> 
> >> What of the value of vm.pfault_oom_attempts ?
> >> 
> >> If vm.pfault_oom_attempts was not -1,
> >> what of the value of vm.pfault_oom_wait ?
> >> 
> > bob@www:/usr/src % sysctl vm.pfault_oom_wait
> > vm.pfault_oom_wait: 10
> > I didn't change it, this must be the default. 
> > Would it be useful to add something like
> > vm.pfault_oom_wait="20" 
> > to /boot/loader.conf?
> 
> I assume that this answer means vm.pfault_oom_attempts
> was not -1. But you did not list what it was.
>
 
Sorry, the variable names are running together in my head.

bob@www:/usr/src % sysctl vm.pfault_oom_attempts
vm.pfault_oom_attempts: 3
It's now reset to -1, hopefully it'll work better than last time 8-)

> vm.pfault_oom_attempts=-1 is the means of avoiding
> the the vm.pfault_oom_attempts,vm.pfault_oom_wait
> pair from causing OOM at all. (vm.pfault_oom_wait
> becomes irrelevant.)
> 

> When vm.pfault_oom_attempts != -1 , then there are
> potential time-outs that overall involve:
> 
>    vm.pfault_oom_attempts * vm.pfault_oom_wait
> 
> but has some tradeoffs in the partitioning between
> the 2 factors:
> 
> # sysctl -d vm.pfault_oom_attempts
> vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling
> 
> # sysctl -d vm.pfault_oom_wait
> vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler
> 
> As I understand, the following cases could each have
> very different results depending on what the latencies
> are like and such:
> 
> vm.pfault_oom_attempts==20 && vm.pfault_oom_wait==1
> vs.
> vm.pfault_oom_attempts==1 && vm.pfault_oom_wait==20
> vs.
> vm.pfault_oom_attempts==4 && vm.pfault_oom_wait==5
> 
> As I remember, Konstantin Belousov once asked someone
> that was having a repeatable problem to try some
> alternatives that explored this but, as I remember,
> he got no reply to the request.
> 
Konstantin wrote to both me and the list in a thread on 
Re: panic: non-current pmap. If it's relevant please indicate.

He also wrote to Don Lewis, in a thread on Re: spurious out of swap kills
but that was on a 
" 16-thread Ryzen machine, with 64 GB of RAM and 40 GB of swap"
which is surely unlike my predicament 8-)

> The person might have just used: vm.pfault_oom_attempts=-1
> and continued with their primary activity, for all I know.
> 
> vm.pfault_oom_attempts=-1 is only recommended when one
> is sure that they will not run out of swap/paging space,
> if I understand right.
> 
> For reference, for other settings:
> 
> # sysctl -d vm.pageout_oom_seq
> vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
> 
> There is also:
> 
> # sysctl -d vm.oom_pf_secs
> vm.oom_pf_secs: 
> 
> but it seems to have an empty description. May be
> it reports vm.pfault_oom_attempts * vm.pfault_oom_wait
> when vm.pfault_oom_attempts != -1 . (I've not looked.)
> 
> >>> A snipped of gstat log suggests the worst congestion in the storage I/O
> >>> happened at Tue Mar 24 04:52:26 PDT 2020 with an L(q) of 37, but the
> >>> OOMA kill happened at Tue Mar 24 04:53:04 PDT 2020, by which time the
> >>> L(q) had dropped to one, half a minute later.
> >>> 
> >>> Is the delay in OOMA action to be expected? 
> >>> 
> >>> Here's the relevant part of the log, I hope the columns display readably:
> >>> 
> >>> 0/2/2/19177 mbuf clusters in use (current/cache/total/max)
> >>> procs     memory       page                      disks     faults       cpu
> >>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs us sy id
> >>> 0  0 29 1897320   47312 12851   9   4   5 13330 1669   0   0 14172  8020  3034 65 29  6
> >>> dT: 1.056s  w: 1.000s
> >>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d   %busy Name
> >>>  37    367    323   5463    6.4     45   1511   76.8      0      0    0.0   91.7  mmcsd0
> >>>  37    367    323   5463    6.5     45   1511   76.9      0      0    0.0   93.3  mmcsd0s2
> >>>  34    133    111   3209    7.6     22    697  134.7      0      0    0.0   74.0  mmcsd0s2a
> >>>   3    235    212   2254    5.9     23    814   21.5      0      0    0.0   70.0  mmcsd0s2b
> >>>  34    133    111   3209    7.6     22    697  134.7      0      0    0.0   74.2  ufs/rootfs
> >>> Tue Mar 24 04:52:26 PDT 2020
> >>> Device          1K-blocks     Used    Avail Capacity
> >>> /dev/mmcsd0s2b    4404252   224484  4179768     5%
> >>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21842, size: 12288
> >>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 25477, size: 4096
> >> 
> >> How bad were things back when the "indefinate wait buffer" notices
> >> were generated (Mar 24 04:20:50 time frame)?
> >> 
> > It looks like _new_ indefinite wait messages started coming at around Mon Mar 23 23:00:28 PDT 2020:
> > procs     memory       page                      disks     faults       cpu
> > r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs us sy id
> > 2  0 18 1588824  197676 14108   2   1   0 14759  238   0   0 14055  9263  2668 62 32  5
> > dT: 1.027s  w: 1.000s
> > L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   ms/d   %busy Name
> >    9      1      0      0    0.0      1     31  13451      0      0    0.0  326.9  mmcsd0
> >    9      1      0      0    0.0      1     31  13451      0      0    0.0  326.9  mmcsd0s2
> >    6      1      0      0    0.0      1     31  13451      0      0    0.0  326.9  mmcsd0s2a
> >    6      1      0      0    0.0      1     31  13451      0      0    0.0  326.9  ufs/rootfs
> > Mon Mar 23 23:00:28 PDT 2020
> > Device          1K-blocks     Used    Avail Capacity
> > /dev/mmcsd0s2b    4404252   245288  4158964     6%
> > Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 280324, size: 8192
> > Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 285903, size: 16384
> > 
> > Admittedly, 300% busy looks pretty bad. Still the machine didn't quit.....
> > On the next sample %busy went down to 18% for swap, less for all else.
> 
> As I understand, the ms/w value 13451 means that there was
> for a time (a mean of ?) somewhat under 14 seconds from a
> write being queued to it being completed.
> 
> If a read could be blocked for such time frames by such
> writes, it suggests that vm.pfault_oom_wait might need to
> be larger than 13 if vm.pfault_oom_attempts=-1 is not in
> use.
> 
Ok, I'm starting to get it. On this machine it's 10.
 
But the 13 second delay appeared at Mon Mar 23 23:00:28 PDT 2020. 
The kill occurred around Tue Mar 24 04:53:08 PDT 2020
under what look like much more benign circumstances. 

> Or vm.pfault_oom_attempts=-1 could be used to avoid large
> time frames from directly leading to OOM activity.
>
Understood. I simply forgot to restore the setting after
the initial troubles with it. 
 
> >> 
> >> Below I show code changes to be more explicit in the
> >> output messaging about what contributes to initiating
> >> OOM kills, without needing boot verbose or the like.
> >> There are also some messages from Mark J.'s old code
> >> for reporting on things related to initiating OOM
> >> kills, or some minor variations of them.
> >> 
> > 
> >> You may want to try such changes to provide more
> >> context for your OOM kills when they happen. Below
> >> the 4 reasons reported are (showing the most
> >> detailed of the related messages from different
> >> stages):
> >> 
> >> swp_pager_meta_build: swap blk uma zone exhausted
> >> 
> >> swp_pager_meta_build: swap pctrie uma zone exhausted
> >> 
> >> vm_fault_allocate: proc %d (%s) failed to alloc page on fault, starting OOM
> >> 
> >> m_pageout_mightbe_oom: kill context: v_free_count: %u, v_inactive_count: %u
> >> 
> >> 
> > 
> > Could inquiries instead  be added to the logging script?
> > Right now it's
> > #!/bin/sh
> > while true
> > do vmstat ; gstat -abd -I 1s ; date ; swapinfo ; tail -n 2 /var/log/messages ; netstat -m | grep "mbuf clusters"
> > done
> > 
> > I'd much rather tamper with the logging script than the kernel 8-)
> 
> Unfortunately, the reason I made the kernel messaging
> changes is that, as far as I know, the normal kernel
> simply does not publish the information anywhere when
> the internal criteria leads to OOM activity: No
> inquiry available without kernel changes.

If all else fails I'll try to apply your patches to the kernel
and recompile. 

With my thanks,

bob prohaska




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20200324224658.GA92726>