Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 24 Mar 2020 17:21:47 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        bob prohaska <fbsd@www.zefox.net>
Cc:        freebsd-arm@freebsd.org
Subject:   Re: Belated out of swap kill on rpi3 at r359216
Message-ID:  <764D5A86-6A42-44E0-A706-F1C49BB198DA@yahoo.com>
In-Reply-To: <20200324224658.GA92726@www.zefox.net>
References:  <20200324155753.GA91922@www.zefox.net> <83E41A13-6C24-4B56-A837-779044038FBC@yahoo.com> <20200324185518.GA92311@www.zefox.net> <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com> <20200324224658.GA92726@www.zefox.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On 2020-Mar-24, at 15:46, bob prohaska <fbsd at www.zefox.net> wrote:

> The logfile being discussed is at
> =
http://www.zefox.net/~fbsd/rpi3/swaptests/r359216/swapscript.osupdate.log
> for convenient reference. More replies below.
>=20
>=20
> On Tue, Mar 24, 2020 at 01:21:42PM -0700, Mark Millard wrote:
>>=20
>>=20
>> On 2020-Mar-24, at 11:55, bob prohaska <fbsd at www.zefox.net> wrote:
>>=20
>>> On Tue, Mar 24, 2020 at 10:09:28AM -0700, Mark Millard wrote:
>>>>=20
>>>>=20
>>>> On 2020-Mar-24, at 08:57, bob prohaska <fbsd at www.zefox.net> =
wrote:
>>>>=20
>>>>> An attempt to buildworld on an rpi3 running r359216 stopped with =
an
>>>>> OOMA kill. Sources were at 359264, loader.conf contained
>>>>> vm.pageout_oom_seq=3D"4096" .=20
>>>>=20
>>>> What of the value of vm.pfault_oom_attempts ?
>>>>=20
>>>> If vm.pfault_oom_attempts was not -1,
>>>> what of the value of vm.pfault_oom_wait ?
>>>>=20
>>> bob@www:/usr/src % sysctl vm.pfault_oom_wait
>>> vm.pfault_oom_wait: 10
>>> I didn't change it, this must be the default.=20
>>> Would it be useful to add something like
>>> vm.pfault_oom_wait=3D"20"=20
>>> to /boot/loader.conf?
>>=20
>> I assume that this answer means vm.pfault_oom_attempts
>> was not -1. But you did not list what it was.
>>=20
>=20
> Sorry, the variable names are running together in my head.
>=20
> bob@www:/usr/src % sysctl vm.pfault_oom_attempts
> vm.pfault_oom_attempts: 3
> It's now reset to -1, hopefully it'll work better than last time 8-)
>=20
>> vm.pfault_oom_attempts=3D-1 is the means of avoiding
>> the the vm.pfault_oom_attempts,vm.pfault_oom_wait
>> pair from causing OOM at all. (vm.pfault_oom_wait
>> becomes irrelevant.)
>>=20
>=20
>> When vm.pfault_oom_attempts !=3D -1 , then there are
>> potential time-outs that overall involve:
>>=20
>>   vm.pfault_oom_attempts * vm.pfault_oom_wait
>>=20
>> but has some tradeoffs in the partitioning between
>> the 2 factors:
>>=20
>> # sysctl -d vm.pfault_oom_attempts
>> vm.pfault_oom_attempts: Number of page allocation attempts in page =
fault handler before it triggers OOM handling
>>=20
>> # sysctl -d vm.pfault_oom_wait
>> vm.pfault_oom_wait: Number of seconds to wait for free pages before =
retrying the page fault handler
>>=20
>> As I understand, the following cases could each have
>> very different results depending on what the latencies
>> are like and such:
>>=20
>> vm.pfault_oom_attempts=3D=3D20 && vm.pfault_oom_wait=3D=3D1
>> vs.
>> vm.pfault_oom_attempts=3D=3D1 && vm.pfault_oom_wait=3D=3D20
>> vs.
>> vm.pfault_oom_attempts=3D=3D4 && vm.pfault_oom_wait=3D=3D5
>>=20
>> As I remember, Konstantin Belousov once asked someone
>> that was having a repeatable problem to try some
>> alternatives that explored this but, as I remember,
>> he got no reply to the request.
>>=20
> Konstantin wrote to both me and the list in a thread on=20
> Re: panic: non-current pmap. If it's relevant please indicate.
>=20
> He also wrote to Don Lewis, in a thread on Re: spurious out of swap =
kills
> but that was on a=20
> " 16-thread Ryzen machine, with 64 GB of RAM and 40 GB of swap"
> which is surely unlike my predicament 8-)
>=20
>> The person might have just used: vm.pfault_oom_attempts=3D-1
>> and continued with their primary activity, for all I know.
>>=20
>> vm.pfault_oom_attempts=3D-1 is only recommended when one
>> is sure that they will not run out of swap/paging space,
>> if I understand right.
>>=20
>> For reference, for other settings:
>>=20
>> # sysctl -d vm.pageout_oom_seq
>> vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM
>>=20
>> There is also:
>>=20
>> # sysctl -d vm.oom_pf_secs
>> vm.oom_pf_secs:=20
>>=20
>> but it seems to have an empty description. May be
>> it reports vm.pfault_oom_attempts * vm.pfault_oom_wait
>> when vm.pfault_oom_attempts !=3D -1 . (I've not looked.)
>>=20
>>>>> A snipped of gstat log suggests the worst congestion in the =
storage I/O
>>>>> happened at Tue Mar 24 04:52:26 PDT 2020 with an L(q) of 37, but =
the
>>>>> OOMA kill happened at Tue Mar 24 04:53:04 PDT 2020, by which time =
the
>>>>> L(q) had dropped to one, half a minute later.
>>>>>=20
>>>>> Is the delay in OOMA action to be expected?=20
>>>>>=20
>>>>> Here's the relevant part of the log, I hope the columns display =
readably:
>>>>>=20
>>>>> 0/2/2/19177 mbuf clusters in use (current/cache/total/max)
>>>>> procs     memory       page                      disks     faults  =
     cpu
>>>>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   =
sy   cs us sy id
>>>>> 0  0 29 1897320   47312 12851   9   4   5 13330 1669   0   0 14172 =
 8020  3034 65 29  6
>>>>> dT: 1.056s  w: 1.000s
>>>>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   =
kBps   ms/d   %busy Name
>>>>> 37    367    323   5463    6.4     45   1511   76.8      0      0  =
  0.0   91.7  mmcsd0
>>>>> 37    367    323   5463    6.5     45   1511   76.9      0      0  =
  0.0   93.3  mmcsd0s2
>>>>> 34    133    111   3209    7.6     22    697  134.7      0      0  =
  0.0   74.0  mmcsd0s2a
>>>>>  3    235    212   2254    5.9     23    814   21.5      0      0  =
  0.0   70.0  mmcsd0s2b
>>>>> 34    133    111   3209    7.6     22    697  134.7      0      0  =
  0.0   74.2  ufs/rootfs
>>>>> Tue Mar 24 04:52:26 PDT 2020
>>>>> Device          1K-blocks     Used    Avail Capacity
>>>>> /dev/mmcsd0s2b    4404252   224484  4179768     5%
>>>>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 21842, size: 12288
>>>>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 25477, size: 4096
>>>>=20
>>>> How bad were things back when the "indefinate wait buffer" notices
>>>> were generated (Mar 24 04:20:50 time frame)?
>>>>=20
>>> It looks like _new_ indefinite wait messages started coming at =
around Mon Mar 23 23:00:28 PDT 2020:
>>> procs     memory       page                      disks     faults    =
   cpu
>>> r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy  =
 cs us sy id
>>> 2  0 18 1588824  197676 14108   2   1   0 14759  238   0   0 14055  =
9263  2668 62 32  5
>>> dT: 1.027s  w: 1.000s
>>> L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps  =
 ms/d   %busy Name
>>>   9      1      0      0    0.0      1     31  13451      0      0   =
 0.0  326.9  mmcsd0
>>>   9      1      0      0    0.0      1     31  13451      0      0   =
 0.0  326.9  mmcsd0s2
>>>   6      1      0      0    0.0      1     31  13451      0      0   =
 0.0  326.9  mmcsd0s2a
>>>   6      1      0      0    0.0      1     31  13451      0      0   =
 0.0  326.9  ufs/rootfs
>>> Mon Mar 23 23:00:28 PDT 2020
>>> Device          1K-blocks     Used    Avail Capacity
>>> /dev/mmcsd0s2b    4404252   245288  4158964     6%
>>> Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 280324, size: 8192
>>> Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: =
bufobj: 0, blkno: 285903, size: 16384
>>>=20
>>> Admittedly, 300% busy looks pretty bad. Still the machine didn't =
quit.....
>>> On the next sample %busy went down to 18% for swap, less for all =
else.
>>=20
>> As I understand, the ms/w value 13451 means that there was
>> for a time (a mean of ?) somewhat under 14 seconds from a
>> write being queued to it being completed.
>>=20
>> If a read could be blocked for such time frames by such
>> writes, it suggests that vm.pfault_oom_wait might need to
>> be larger than 13 if vm.pfault_oom_attempts=3D-1 is not in
>> use.
>>=20
> Ok, I'm starting to get it. On this machine it's 10.
>=20
> But the 13 second delay appeared at Mon Mar 23 23:00:28 PDT 2020.=20
> The kill occurred around Tue Mar 24 04:53:08 PDT 2020
> under what look like much more benign circumstances.=20

I had intended to ask about the text next-to/just-above,
not much earlier [thus, the "(Mar 24 04:20:50 time frame)"].
So I was asking about:

Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 21842, size: 12288
Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 25477, size: 4096

and did not originally notice that you replied with
information for "Mon Mar 23 23:00:28 PDT 2020".

Other than showing examples of what the delays can be,
the "Mar 24 04:20:50" is of more direct relevance for
the OOM kill that happened.

I'll show the text around/before 04:20:50 later in this
note. It proves somewhat interesting.

>> Or vm.pfault_oom_attempts=3D-1 could be used to avoid large
>> time frames from directly leading to OOM activity.
>>=20
> Understood. I simply forgot to restore the setting after
> the initial troubles with it.=20
>=20
>>>>=20
>>>> Below I show code changes to be more explicit in the
>>>> output messaging about what contributes to initiating
>>>> OOM kills, without needing boot verbose or the like.
>>>> There are also some messages from Mark J.'s old code
>>>> for reporting on things related to initiating OOM
>>>> kills, or some minor variations of them.
>>>>=20
>>>=20
>>>> You may want to try such changes to provide more
>>>> context for your OOM kills when they happen. Below
>>>> the 4 reasons reported are (showing the most
>>>> detailed of the related messages from different
>>>> stages):
>>>>=20
>>>> swp_pager_meta_build: swap blk uma zone exhausted
>>>>=20
>>>> swp_pager_meta_build: swap pctrie uma zone exhausted
>>>>=20
>>>> vm_fault_allocate: proc %d (%s) failed to alloc page on fault, =
starting OOM
>>>>=20
>>>> m_pageout_mightbe_oom: kill context: v_free_count: %u, =
v_inactive_count: %u
>>>>=20
>>>>=20
>>>=20
>>> Could inquiries instead  be added to the logging script?
>>> Right now it's
>>> #!/bin/sh
>>> while true
>>> do vmstat ; gstat -abd -I 1s ; date ; swapinfo ; tail -n 2 =
/var/log/messages ; netstat -m | grep "mbuf clusters"
>>> done
>>>=20
>>> I'd much rather tamper with the logging script than the kernel 8-)
>>=20
>> Unfortunately, the reason I made the kernel messaging
>> changes is that, as far as I know, the normal kernel
>> simply does not publish the information anywhere when
>> the internal criteria leads to OOM activity: No
>> inquiry available without kernel changes.
>=20
> If all else fails I'll try to apply your patches to the kernel
> and recompile.=20

Up to you, of course.


The text before "Mar 24 04:20:50" messages, with notes
and blank lines mixed in to group things:

. . .
Tue Mar 24 04:20:38 PDT 2020
Device          1K-blocks     Used    Avail Capacity
/dev/mmcsd0s2b    4404252   403784  4000468     9%
Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 623056, size: 16384
Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 1033854, size: 4096
3/773/776/19177 mbuf clusters in use (current/cache/total/max)

procs     memory       page                      disks     faults       =
cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs =
us sy id
 0  0 30 2034696   54180 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
dT: 1.031s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
   18      1      0      0    0.0      1     31  26686      0      0    =
0.0  287.1  mmcsd0
   18      1      0      0    0.0      1     31  26686      0      0    =
0.0  287.1  mmcsd0s2
   10      1      0      0    0.0      1     31  26686      0      0    =
0.0  287.1  mmcsd0s2a
   10      1      0      0    0.0      1     31  26686      0      0    =
0.0  287.1  ufs/rootfs
Tue Mar 24 04:20:40 PDT 2020
Device          1K-blocks     Used    Avail Capacity
/dev/mmcsd0s2b    4404252   403784  4000468     9%
Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 623056, size: 16384
Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 1033854, size: 4096
6/770/776/19177 mbuf clusters in use (current/cache/total/max)

Note that the above has 26sec+ ms/w figures. No reads.
No mmcsd0s2b.


procs     memory       page                      disks     faults       =
cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs =
us sy id
 0  0 30 2034696   54252 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
dT: 1.064s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
Tue Mar 24 04:20:42 PDT 2020
Device          1K-blocks     Used    Avail Capacity
/dev/mmcsd0s2b    4404252   403784  4000468     9%
Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 623056, size: 16384
Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 1033854, size: 4096
9/1023/1032/19177 mbuf clusters in use (current/cache/total/max)

For the above, note the lack of any lines with ms/r or ms/w figures,
just the title rows show.


procs     memory       page                      disks     faults       =
cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs =
us sy id
 0  0 30 2034696   53668 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
dT: 1.046s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
   17      1      0      0    0.0      1     31  30042      0      0    =
0.0  320.7  mmcsd0
   17      1      0      0    0.0      1     31  30042      0      0    =
0.0  320.7  mmcsd0s2
    9      1      0      0    0.0      1     31  30042      0      0    =
0.0  320.7  mmcsd0s2a
    9      1      0      0    0.0      1     31  30042      0      0    =
0.0  320.7  ufs/rootfs
Tue Mar 24 04:20:43 PDT 2020
Device          1K-blocks     Used    Avail Capacity
/dev/mmcsd0s2b    4404252   403784  4000468     9%
Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 623056, size: 16384
Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 1033854, size: 4096
9/1023/1032/19177 mbuf clusters in use (current/cache/total/max)

Note that the above has 30sec+ ms/w figures. No reads.

Also note that the L(q) figures dropped by only 1 over
the about 04:20:40 to 04:20:43 interval (18->17 and
10->9). (I only take these deltas as suggestive. They
might be accidental near matches.) No mmcsd0s2b.


procs     memory       page                      disks     faults       =
cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs =
us sy id
 0  0 30 2034696   53624 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
dT: 1.003s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
Tue Mar 24 04:20:45 PDT 2020
Device          1K-blocks     Used    Avail Capacity
/dev/mmcsd0s2b    4404252   403784  4000468     9%
Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 623056, size: 16384
Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 1033854, size: 4096
12/1020/1032/19177 mbuf clusters in use (current/cache/total/max)

For the above, note the lack of any lines with ms/r or ms/w figures,
just the title rows show.


procs     memory       page                      disks     faults       =
cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs =
us sy id
 0  0 30 2034696   53600 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
dT: 1.063s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
Tue Mar 24 04:20:47 PDT 2020
Device          1K-blocks     Used    Avail Capacity
/dev/mmcsd0s2b    4404252   403784  4000468     9%
Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 623056, size: 16384
Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 1033854, size: 4096
12/1020/1032/19177 mbuf clusters in use (current/cache/total/max)

For the above, note the lack of any lines with ms/r or ms/w figures,
just the title rows show.



procs     memory       page                      disks     faults       =
cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs =
us sy id
 0  0 30 2034696   53592 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
dT: 1.006s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
Tue Mar 24 04:20:48 PDT 2020
Device          1K-blocks     Used    Avail Capacity
/dev/mmcsd0s2b    4404252   403784  4000468     9%
Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 623056, size: 16384
Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 1033854, size: 4096
15/1017/1032/19177 mbuf clusters in use (current/cache/total/max)

For the above, note the lack of any lines with ms/r or ms/w figures,
just the title rows show.



procs     memory       page                      disks     faults       =
cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs =
us sy id
 0  0 30 2034696   53672 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
dT: 1.002s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
Tue Mar 24 04:20:50 PDT 2020
Device          1K-blocks     Used    Avail Capacity
/dev/mmcsd0s2b    4404252   403784  4000468     9%
Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 623056, size: 16384
Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 1033854, size: 4096
15/1017/1032/19177 mbuf clusters in use (current/cache/total/max)

For the above, note the list of any lines with ms/r or ms/w figures,
just the title rows.



procs     memory       page                      disks     faults       =
cpu
r b w     avm     fre  flt  re  pi  po    fr   sr mm0 da0   in   sy   cs =
us sy id
 2  0 30 2045028   54160 12942   7   2   4 13432 1374   0   0 14074  =
8102  2849 65 29  5
dT: 1.001s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps   =
ms/d   %busy Name
    2    329    236   3839   13.0     93   1262   18.6      0      0    =
0.0   99.4  mmcsd0
    2    329    236   3839   13.2     93   1262   18.7      0      0    =
0.0  101.1  mmcsd0s2
    1    153    152   2816   12.5      1     32  366.5      0      0    =
0.0   95.2  mmcsd0s2a
    1    176     84   1023   14.5     92   1230   14.9      0      0    =
0.0   66.8  mmcsd0s2b
    1    153    152   2816   12.5      1     32  366.6      0      0    =
0.0   95.3  ufs/rootfs
Tue Mar 24 04:20:51 PDT 2020
Device          1K-blocks     Used    Avail Capacity
/dev/mmcsd0s2b    4404252   296976  4107276     7%
Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 21842, size: 12288
Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: =
0, blkno: 25477, size: 4096
0/1032/1032/19177 mbuf clusters in use (current/cache/total/max)

For the above, the lines with ms/r and ms/w figures are back
but the indefinite wait buffer for "Mar 24 04:20:50" happened
too, before the "Tue Mar 24 04:20:51 PDT 2020" above. I'd guess
they are from before the data that the gstat output is based on.

Also there are lots of reads and some writes in the above
gstat output. mmcsd0s2b is again showing as well.

Note: It looks like  ufs/rootfs might be a label
identifying mmcsd0s2a, so they track up to slight
time-of-information differences. I'm guessing that
mmcsd0s2b is just a swap/paging partition.

So it looks like activity for ufs/rootfs may have
blocked/delayed activity for paging/swaping
(mmcsd0s2b), at least for a time, even if it is
not directly the cause of the specific OOM
activity.



=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?764D5A86-6A42-44E0-A706-F1C49BB198DA>