Date: Tue, 24 Mar 2020 17:21:47 -0700 From: Mark Millard <marklmi@yahoo.com> To: bob prohaska <fbsd@www.zefox.net> Cc: freebsd-arm@freebsd.org Subject: Re: Belated out of swap kill on rpi3 at r359216 Message-ID: <764D5A86-6A42-44E0-A706-F1C49BB198DA@yahoo.com> In-Reply-To: <20200324224658.GA92726@www.zefox.net> References: <20200324155753.GA91922@www.zefox.net> <83E41A13-6C24-4B56-A837-779044038FBC@yahoo.com> <20200324185518.GA92311@www.zefox.net> <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com> <20200324224658.GA92726@www.zefox.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2020-Mar-24, at 15:46, bob prohaska <fbsd at www.zefox.net> wrote: > The logfile being discussed is at > = http://www.zefox.net/~fbsd/rpi3/swaptests/r359216/swapscript.osupdate.log > for convenient reference. More replies below. >=20 >=20 > On Tue, Mar 24, 2020 at 01:21:42PM -0700, Mark Millard wrote: >>=20 >>=20 >> On 2020-Mar-24, at 11:55, bob prohaska <fbsd at www.zefox.net> wrote: >>=20 >>> On Tue, Mar 24, 2020 at 10:09:28AM -0700, Mark Millard wrote: >>>>=20 >>>>=20 >>>> On 2020-Mar-24, at 08:57, bob prohaska <fbsd at www.zefox.net> = wrote: >>>>=20 >>>>> An attempt to buildworld on an rpi3 running r359216 stopped with = an >>>>> OOMA kill. Sources were at 359264, loader.conf contained >>>>> vm.pageout_oom_seq=3D"4096" .=20 >>>>=20 >>>> What of the value of vm.pfault_oom_attempts ? >>>>=20 >>>> If vm.pfault_oom_attempts was not -1, >>>> what of the value of vm.pfault_oom_wait ? >>>>=20 >>> bob@www:/usr/src % sysctl vm.pfault_oom_wait >>> vm.pfault_oom_wait: 10 >>> I didn't change it, this must be the default.=20 >>> Would it be useful to add something like >>> vm.pfault_oom_wait=3D"20"=20 >>> to /boot/loader.conf? >>=20 >> I assume that this answer means vm.pfault_oom_attempts >> was not -1. But you did not list what it was. >>=20 >=20 > Sorry, the variable names are running together in my head. >=20 > bob@www:/usr/src % sysctl vm.pfault_oom_attempts > vm.pfault_oom_attempts: 3 > It's now reset to -1, hopefully it'll work better than last time 8-) >=20 >> vm.pfault_oom_attempts=3D-1 is the means of avoiding >> the the vm.pfault_oom_attempts,vm.pfault_oom_wait >> pair from causing OOM at all. (vm.pfault_oom_wait >> becomes irrelevant.) >>=20 >=20 >> When vm.pfault_oom_attempts !=3D -1 , then there are >> potential time-outs that overall involve: >>=20 >> vm.pfault_oom_attempts * vm.pfault_oom_wait >>=20 >> but has some tradeoffs in the partitioning between >> the 2 factors: >>=20 >> # sysctl -d vm.pfault_oom_attempts >> vm.pfault_oom_attempts: Number of page allocation attempts in page = fault handler before it triggers OOM handling >>=20 >> # sysctl -d vm.pfault_oom_wait >> vm.pfault_oom_wait: Number of seconds to wait for free pages before = retrying the page fault handler >>=20 >> As I understand, the following cases could each have >> very different results depending on what the latencies >> are like and such: >>=20 >> vm.pfault_oom_attempts=3D=3D20 && vm.pfault_oom_wait=3D=3D1 >> vs. >> vm.pfault_oom_attempts=3D=3D1 && vm.pfault_oom_wait=3D=3D20 >> vs. >> vm.pfault_oom_attempts=3D=3D4 && vm.pfault_oom_wait=3D=3D5 >>=20 >> As I remember, Konstantin Belousov once asked someone >> that was having a repeatable problem to try some >> alternatives that explored this but, as I remember, >> he got no reply to the request. >>=20 > Konstantin wrote to both me and the list in a thread on=20 > Re: panic: non-current pmap. If it's relevant please indicate. >=20 > He also wrote to Don Lewis, in a thread on Re: spurious out of swap = kills > but that was on a=20 > " 16-thread Ryzen machine, with 64 GB of RAM and 40 GB of swap" > which is surely unlike my predicament 8-) >=20 >> The person might have just used: vm.pfault_oom_attempts=3D-1 >> and continued with their primary activity, for all I know. >>=20 >> vm.pfault_oom_attempts=3D-1 is only recommended when one >> is sure that they will not run out of swap/paging space, >> if I understand right. >>=20 >> For reference, for other settings: >>=20 >> # sysctl -d vm.pageout_oom_seq >> vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM >>=20 >> There is also: >>=20 >> # sysctl -d vm.oom_pf_secs >> vm.oom_pf_secs:=20 >>=20 >> but it seems to have an empty description. May be >> it reports vm.pfault_oom_attempts * vm.pfault_oom_wait >> when vm.pfault_oom_attempts !=3D -1 . (I've not looked.) >>=20 >>>>> A snipped of gstat log suggests the worst congestion in the = storage I/O >>>>> happened at Tue Mar 24 04:52:26 PDT 2020 with an L(q) of 37, but = the >>>>> OOMA kill happened at Tue Mar 24 04:53:04 PDT 2020, by which time = the >>>>> L(q) had dropped to one, half a minute later. >>>>>=20 >>>>> Is the delay in OOMA action to be expected?=20 >>>>>=20 >>>>> Here's the relevant part of the log, I hope the columns display = readably: >>>>>=20 >>>>> 0/2/2/19177 mbuf clusters in use (current/cache/total/max) >>>>> procs memory page disks faults = cpu >>>>> r b w avm fre flt re pi po fr sr mm0 da0 in = sy cs us sy id >>>>> 0 0 29 1897320 47312 12851 9 4 5 13330 1669 0 0 14172 = 8020 3034 65 29 6 >>>>> dT: 1.056s w: 1.000s >>>>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s = kBps ms/d %busy Name >>>>> 37 367 323 5463 6.4 45 1511 76.8 0 0 = 0.0 91.7 mmcsd0 >>>>> 37 367 323 5463 6.5 45 1511 76.9 0 0 = 0.0 93.3 mmcsd0s2 >>>>> 34 133 111 3209 7.6 22 697 134.7 0 0 = 0.0 74.0 mmcsd0s2a >>>>> 3 235 212 2254 5.9 23 814 21.5 0 0 = 0.0 70.0 mmcsd0s2b >>>>> 34 133 111 3209 7.6 22 697 134.7 0 0 = 0.0 74.2 ufs/rootfs >>>>> Tue Mar 24 04:52:26 PDT 2020 >>>>> Device 1K-blocks Used Avail Capacity >>>>> /dev/mmcsd0s2b 4404252 224484 4179768 5% >>>>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: = bufobj: 0, blkno: 21842, size: 12288 >>>>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: = bufobj: 0, blkno: 25477, size: 4096 >>>>=20 >>>> How bad were things back when the "indefinate wait buffer" notices >>>> were generated (Mar 24 04:20:50 time frame)? >>>>=20 >>> It looks like _new_ indefinite wait messages started coming at = around Mon Mar 23 23:00:28 PDT 2020: >>> procs memory page disks faults = cpu >>> r b w avm fre flt re pi po fr sr mm0 da0 in sy = cs us sy id >>> 2 0 18 1588824 197676 14108 2 1 0 14759 238 0 0 14055 = 9263 2668 62 32 5 >>> dT: 1.027s w: 1.000s >>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name >>> 9 1 0 0 0.0 1 31 13451 0 0 = 0.0 326.9 mmcsd0 >>> 9 1 0 0 0.0 1 31 13451 0 0 = 0.0 326.9 mmcsd0s2 >>> 6 1 0 0 0.0 1 31 13451 0 0 = 0.0 326.9 mmcsd0s2a >>> 6 1 0 0 0.0 1 31 13451 0 0 = 0.0 326.9 ufs/rootfs >>> Mon Mar 23 23:00:28 PDT 2020 >>> Device 1K-blocks Used Avail Capacity >>> /dev/mmcsd0s2b 4404252 245288 4158964 6% >>> Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: = bufobj: 0, blkno: 280324, size: 8192 >>> Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: = bufobj: 0, blkno: 285903, size: 16384 >>>=20 >>> Admittedly, 300% busy looks pretty bad. Still the machine didn't = quit..... >>> On the next sample %busy went down to 18% for swap, less for all = else. >>=20 >> As I understand, the ms/w value 13451 means that there was >> for a time (a mean of ?) somewhat under 14 seconds from a >> write being queued to it being completed. >>=20 >> If a read could be blocked for such time frames by such >> writes, it suggests that vm.pfault_oom_wait might need to >> be larger than 13 if vm.pfault_oom_attempts=3D-1 is not in >> use. >>=20 > Ok, I'm starting to get it. On this machine it's 10. >=20 > But the 13 second delay appeared at Mon Mar 23 23:00:28 PDT 2020.=20 > The kill occurred around Tue Mar 24 04:53:08 PDT 2020 > under what look like much more benign circumstances.=20 I had intended to ask about the text next-to/just-above, not much earlier [thus, the "(Mar 24 04:20:50 time frame)"]. So I was asking about: Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 21842, size: 12288 Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 25477, size: 4096 and did not originally notice that you replied with information for "Mon Mar 23 23:00:28 PDT 2020". Other than showing examples of what the delays can be, the "Mar 24 04:20:50" is of more direct relevance for the OOM kill that happened. I'll show the text around/before 04:20:50 later in this note. It proves somewhat interesting. >> Or vm.pfault_oom_attempts=3D-1 could be used to avoid large >> time frames from directly leading to OOM activity. >>=20 > Understood. I simply forgot to restore the setting after > the initial troubles with it.=20 >=20 >>>>=20 >>>> Below I show code changes to be more explicit in the >>>> output messaging about what contributes to initiating >>>> OOM kills, without needing boot verbose or the like. >>>> There are also some messages from Mark J.'s old code >>>> for reporting on things related to initiating OOM >>>> kills, or some minor variations of them. >>>>=20 >>>=20 >>>> You may want to try such changes to provide more >>>> context for your OOM kills when they happen. Below >>>> the 4 reasons reported are (showing the most >>>> detailed of the related messages from different >>>> stages): >>>>=20 >>>> swp_pager_meta_build: swap blk uma zone exhausted >>>>=20 >>>> swp_pager_meta_build: swap pctrie uma zone exhausted >>>>=20 >>>> vm_fault_allocate: proc %d (%s) failed to alloc page on fault, = starting OOM >>>>=20 >>>> m_pageout_mightbe_oom: kill context: v_free_count: %u, = v_inactive_count: %u >>>>=20 >>>>=20 >>>=20 >>> Could inquiries instead be added to the logging script? >>> Right now it's >>> #!/bin/sh >>> while true >>> do vmstat ; gstat -abd -I 1s ; date ; swapinfo ; tail -n 2 = /var/log/messages ; netstat -m | grep "mbuf clusters" >>> done >>>=20 >>> I'd much rather tamper with the logging script than the kernel 8-) >>=20 >> Unfortunately, the reason I made the kernel messaging >> changes is that, as far as I know, the normal kernel >> simply does not publish the information anywhere when >> the internal criteria leads to OOM activity: No >> inquiry available without kernel changes. >=20 > If all else fails I'll try to apply your patches to the kernel > and recompile.=20 Up to you, of course. The text before "Mar 24 04:20:50" messages, with notes and blank lines mixed in to group things: . . . Tue Mar 24 04:20:38 PDT 2020 Device 1K-blocks Used Avail Capacity /dev/mmcsd0s2b 4404252 403784 4000468 9% Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 623056, size: 16384 Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 1033854, size: 4096 3/773/776/19177 mbuf clusters in use (current/cache/total/max) procs memory page disks faults = cpu r b w avm fre flt re pi po fr sr mm0 da0 in sy cs = us sy id 0 0 30 2034696 54180 12942 7 2 4 13432 1374 0 0 14074 = 8102 2849 65 29 5 dT: 1.031s w: 1.000s L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name 18 1 0 0 0.0 1 31 26686 0 0 = 0.0 287.1 mmcsd0 18 1 0 0 0.0 1 31 26686 0 0 = 0.0 287.1 mmcsd0s2 10 1 0 0 0.0 1 31 26686 0 0 = 0.0 287.1 mmcsd0s2a 10 1 0 0 0.0 1 31 26686 0 0 = 0.0 287.1 ufs/rootfs Tue Mar 24 04:20:40 PDT 2020 Device 1K-blocks Used Avail Capacity /dev/mmcsd0s2b 4404252 403784 4000468 9% Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 623056, size: 16384 Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 1033854, size: 4096 6/770/776/19177 mbuf clusters in use (current/cache/total/max) Note that the above has 26sec+ ms/w figures. No reads. No mmcsd0s2b. procs memory page disks faults = cpu r b w avm fre flt re pi po fr sr mm0 da0 in sy cs = us sy id 0 0 30 2034696 54252 12942 7 2 4 13432 1374 0 0 14074 = 8102 2849 65 29 5 dT: 1.064s w: 1.000s L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name Tue Mar 24 04:20:42 PDT 2020 Device 1K-blocks Used Avail Capacity /dev/mmcsd0s2b 4404252 403784 4000468 9% Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 623056, size: 16384 Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 1033854, size: 4096 9/1023/1032/19177 mbuf clusters in use (current/cache/total/max) For the above, note the lack of any lines with ms/r or ms/w figures, just the title rows show. procs memory page disks faults = cpu r b w avm fre flt re pi po fr sr mm0 da0 in sy cs = us sy id 0 0 30 2034696 53668 12942 7 2 4 13432 1374 0 0 14074 = 8102 2849 65 29 5 dT: 1.046s w: 1.000s L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name 17 1 0 0 0.0 1 31 30042 0 0 = 0.0 320.7 mmcsd0 17 1 0 0 0.0 1 31 30042 0 0 = 0.0 320.7 mmcsd0s2 9 1 0 0 0.0 1 31 30042 0 0 = 0.0 320.7 mmcsd0s2a 9 1 0 0 0.0 1 31 30042 0 0 = 0.0 320.7 ufs/rootfs Tue Mar 24 04:20:43 PDT 2020 Device 1K-blocks Used Avail Capacity /dev/mmcsd0s2b 4404252 403784 4000468 9% Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 623056, size: 16384 Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 1033854, size: 4096 9/1023/1032/19177 mbuf clusters in use (current/cache/total/max) Note that the above has 30sec+ ms/w figures. No reads. Also note that the L(q) figures dropped by only 1 over the about 04:20:40 to 04:20:43 interval (18->17 and 10->9). (I only take these deltas as suggestive. They might be accidental near matches.) No mmcsd0s2b. procs memory page disks faults = cpu r b w avm fre flt re pi po fr sr mm0 da0 in sy cs = us sy id 0 0 30 2034696 53624 12942 7 2 4 13432 1374 0 0 14074 = 8102 2849 65 29 5 dT: 1.003s w: 1.000s L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name Tue Mar 24 04:20:45 PDT 2020 Device 1K-blocks Used Avail Capacity /dev/mmcsd0s2b 4404252 403784 4000468 9% Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 623056, size: 16384 Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 1033854, size: 4096 12/1020/1032/19177 mbuf clusters in use (current/cache/total/max) For the above, note the lack of any lines with ms/r or ms/w figures, just the title rows show. procs memory page disks faults = cpu r b w avm fre flt re pi po fr sr mm0 da0 in sy cs = us sy id 0 0 30 2034696 53600 12942 7 2 4 13432 1374 0 0 14074 = 8102 2849 65 29 5 dT: 1.063s w: 1.000s L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name Tue Mar 24 04:20:47 PDT 2020 Device 1K-blocks Used Avail Capacity /dev/mmcsd0s2b 4404252 403784 4000468 9% Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 623056, size: 16384 Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 1033854, size: 4096 12/1020/1032/19177 mbuf clusters in use (current/cache/total/max) For the above, note the lack of any lines with ms/r or ms/w figures, just the title rows show. procs memory page disks faults = cpu r b w avm fre flt re pi po fr sr mm0 da0 in sy cs = us sy id 0 0 30 2034696 53592 12942 7 2 4 13432 1374 0 0 14074 = 8102 2849 65 29 5 dT: 1.006s w: 1.000s L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name Tue Mar 24 04:20:48 PDT 2020 Device 1K-blocks Used Avail Capacity /dev/mmcsd0s2b 4404252 403784 4000468 9% Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 623056, size: 16384 Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 1033854, size: 4096 15/1017/1032/19177 mbuf clusters in use (current/cache/total/max) For the above, note the lack of any lines with ms/r or ms/w figures, just the title rows show. procs memory page disks faults = cpu r b w avm fre flt re pi po fr sr mm0 da0 in sy cs = us sy id 0 0 30 2034696 53672 12942 7 2 4 13432 1374 0 0 14074 = 8102 2849 65 29 5 dT: 1.002s w: 1.000s L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name Tue Mar 24 04:20:50 PDT 2020 Device 1K-blocks Used Avail Capacity /dev/mmcsd0s2b 4404252 403784 4000468 9% Mar 24 02:50:01 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 623056, size: 16384 Mar 24 04:17:17 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 1033854, size: 4096 15/1017/1032/19177 mbuf clusters in use (current/cache/total/max) For the above, note the list of any lines with ms/r or ms/w figures, just the title rows. procs memory page disks faults = cpu r b w avm fre flt re pi po fr sr mm0 da0 in sy cs = us sy id 2 0 30 2045028 54160 12942 7 2 4 13432 1374 0 0 14074 = 8102 2849 65 29 5 dT: 1.001s w: 1.000s L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name 2 329 236 3839 13.0 93 1262 18.6 0 0 = 0.0 99.4 mmcsd0 2 329 236 3839 13.2 93 1262 18.7 0 0 = 0.0 101.1 mmcsd0s2 1 153 152 2816 12.5 1 32 366.5 0 0 = 0.0 95.2 mmcsd0s2a 1 176 84 1023 14.5 92 1230 14.9 0 0 = 0.0 66.8 mmcsd0s2b 1 153 152 2816 12.5 1 32 366.6 0 0 = 0.0 95.3 ufs/rootfs Tue Mar 24 04:20:51 PDT 2020 Device 1K-blocks Used Avail Capacity /dev/mmcsd0s2b 4404252 296976 4107276 7% Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 21842, size: 12288 Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: = 0, blkno: 25477, size: 4096 0/1032/1032/19177 mbuf clusters in use (current/cache/total/max) For the above, the lines with ms/r and ms/w figures are back but the indefinite wait buffer for "Mar 24 04:20:50" happened too, before the "Tue Mar 24 04:20:51 PDT 2020" above. I'd guess they are from before the data that the gstat output is based on. Also there are lots of reads and some writes in the above gstat output. mmcsd0s2b is again showing as well. Note: It looks like ufs/rootfs might be a label identifying mmcsd0s2a, so they track up to slight time-of-information differences. I'm guessing that mmcsd0s2b is just a swap/paging partition. So it looks like activity for ufs/rootfs may have blocked/delayed activity for paging/swaping (mmcsd0s2b), at least for a time, even if it is not directly the cause of the specific OOM activity. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?764D5A86-6A42-44E0-A706-F1C49BB198DA>