Date: Tue, 24 Mar 2020 13:21:42 -0700 From: Mark Millard <marklmi@yahoo.com> To: bob prohaska <fbsd@www.zefox.net> Cc: freebsd-arm@freebsd.org Subject: Re: Belated out of swap kill on rpi3 at r359216 Message-ID: <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com> In-Reply-To: <20200324185518.GA92311@www.zefox.net> References: <20200324155753.GA91922@www.zefox.net> <83E41A13-6C24-4B56-A837-779044038FBC@yahoo.com> <20200324185518.GA92311@www.zefox.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2020-Mar-24, at 11:55, bob prohaska <fbsd at www.zefox.net> wrote: > On Tue, Mar 24, 2020 at 10:09:28AM -0700, Mark Millard wrote: >>=20 >>=20 >> On 2020-Mar-24, at 08:57, bob prohaska <fbsd at www.zefox.net> wrote: >>=20 >>> An attempt to buildworld on an rpi3 running r359216 stopped with an >>> OOMA kill. Sources were at 359264, loader.conf contained >>> vm.pageout_oom_seq=3D"4096" .=20 >>=20 >> What of the value of vm.pfault_oom_attempts ? >>=20 >> If vm.pfault_oom_attempts was not -1, >> what of the value of vm.pfault_oom_wait ? >>=20 > bob@www:/usr/src % sysctl vm.pfault_oom_wait > vm.pfault_oom_wait: 10 > I didn't change it, this must be the default.=20 > Would it be useful to add something like > vm.pfault_oom_wait=3D"20"=20 > to /boot/loader.conf? I assume that this answer means vm.pfault_oom_attempts was not -1. But you did not list what it was. vm.pfault_oom_attempts=3D-1 is the means of avoiding the the vm.pfault_oom_attempts,vm.pfault_oom_wait pair from causing OOM at all. (vm.pfault_oom_wait becomes irrelevant.) When vm.pfault_oom_attempts !=3D -1 , then there are potential time-outs that overall involve: vm.pfault_oom_attempts * vm.pfault_oom_wait but has some tradeoffs in the partitioning between the 2 factors: # sysctl -d vm.pfault_oom_attempts vm.pfault_oom_attempts: Number of page allocation attempts in page fault = handler before it triggers OOM handling # sysctl -d vm.pfault_oom_wait vm.pfault_oom_wait: Number of seconds to wait for free pages before = retrying the page fault handler As I understand, the following cases could each have very different results depending on what the latencies are like and such: vm.pfault_oom_attempts=3D=3D20 && vm.pfault_oom_wait=3D=3D1 vs. vm.pfault_oom_attempts=3D=3D1 && vm.pfault_oom_wait=3D=3D20 vs. vm.pfault_oom_attempts=3D=3D4 && vm.pfault_oom_wait=3D=3D5 As I remember, Konstantin Belousov once asked someone that was having a repeatable problem to try some alternatives that explored this but, as I remember, he got no reply to the request. The person might have just used: vm.pfault_oom_attempts=3D-1 and continued with their primary activity, for all I know. vm.pfault_oom_attempts=3D-1 is only recommended when one is sure that they will not run out of swap/paging space, if I understand right. For reference, for other settings: # sysctl -d vm.pageout_oom_seq vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM There is also: # sysctl -d vm.oom_pf_secs vm.oom_pf_secs:=20 but it seems to have an empty description. May be it reports vm.pfault_oom_attempts * vm.pfault_oom_wait when vm.pfault_oom_attempts !=3D -1 . (I've not looked.) >>> A snipped of gstat log suggests the worst congestion in the storage = I/O >>> happened at Tue Mar 24 04:52:26 PDT 2020 with an L(q) of 37, but the >>> OOMA kill happened at Tue Mar 24 04:53:04 PDT 2020, by which time = the >>> L(q) had dropped to one, half a minute later. >>>=20 >>> Is the delay in OOMA action to be expected?=20 >>>=20 >>> Here's the relevant part of the log, I hope the columns display = readably: >>>=20 >>> 0/2/2/19177 mbuf clusters in use (current/cache/total/max) >>> procs memory page disks faults = cpu >>> r b w avm fre flt re pi po fr sr mm0 da0 in sy = cs us sy id >>> 0 0 29 1897320 47312 12851 9 4 5 13330 1669 0 0 14172 = 8020 3034 65 29 6 >>> dT: 1.056s w: 1.000s >>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name >>> 37 367 323 5463 6.4 45 1511 76.8 0 0 = 0.0 91.7 mmcsd0 >>> 37 367 323 5463 6.5 45 1511 76.9 0 0 = 0.0 93.3 mmcsd0s2 >>> 34 133 111 3209 7.6 22 697 134.7 0 0 = 0.0 74.0 mmcsd0s2a >>> 3 235 212 2254 5.9 23 814 21.5 0 0 = 0.0 70.0 mmcsd0s2b >>> 34 133 111 3209 7.6 22 697 134.7 0 0 = 0.0 74.2 ufs/rootfs >>> Tue Mar 24 04:52:26 PDT 2020 >>> Device 1K-blocks Used Avail Capacity >>> /dev/mmcsd0s2b 4404252 224484 4179768 5% >>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: = bufobj: 0, blkno: 21842, size: 12288 >>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: = bufobj: 0, blkno: 25477, size: 4096 >>=20 >> How bad were things back when the "indefinate wait buffer" notices >> were generated (Mar 24 04:20:50 time frame)? >>=20 > It looks like _new_ indefinite wait messages started coming at around = Mon Mar 23 23:00:28 PDT 2020: > procs memory page disks faults = cpu > r b w avm fre flt re pi po fr sr mm0 da0 in sy = cs us sy id > 2 0 18 1588824 197676 14108 2 1 0 14759 238 0 0 14055 = 9263 2668 62 32 5 > dT: 1.027s w: 1.000s > L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps = ms/d %busy Name > 9 1 0 0 0.0 1 31 13451 0 0 = 0.0 326.9 mmcsd0 > 9 1 0 0 0.0 1 31 13451 0 0 = 0.0 326.9 mmcsd0s2 > 6 1 0 0 0.0 1 31 13451 0 0 = 0.0 326.9 mmcsd0s2a > 6 1 0 0 0.0 1 31 13451 0 0 = 0.0 326.9 ufs/rootfs > Mon Mar 23 23:00:28 PDT 2020 > Device 1K-blocks Used Avail Capacity > /dev/mmcsd0s2b 4404252 245288 4158964 6% > Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: = bufobj: 0, blkno: 280324, size: 8192 > Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: = bufobj: 0, blkno: 285903, size: 16384 >=20 > Admittedly, 300% busy looks pretty bad. Still the machine didn't = quit..... > On the next sample %busy went down to 18% for swap, less for all else. As I understand, the ms/w value 13451 means that there was for a time (a mean of ?) somewhat under 14 seconds from a write being queued to it being completed. If a read could be blocked for such time frames by such writes, it suggests that vm.pfault_oom_wait might need to be larger than 13 if vm.pfault_oom_attempts=3D-1 is not in use. Or vm.pfault_oom_attempts=3D-1 could be used to avoid large time frames from directly leading to OOM activity. >>=20 >> Below I show code changes to be more explicit in the >> output messaging about what contributes to initiating >> OOM kills, without needing boot verbose or the like. >> There are also some messages from Mark J.'s old code >> for reporting on things related to initiating OOM >> kills, or some minor variations of them. >>=20 >=20 >> You may want to try such changes to provide more >> context for your OOM kills when they happen. Below >> the 4 reasons reported are (showing the most >> detailed of the related messages from different >> stages): >>=20 >> swp_pager_meta_build: swap blk uma zone exhausted >>=20 >> swp_pager_meta_build: swap pctrie uma zone exhausted >>=20 >> vm_fault_allocate: proc %d (%s) failed to alloc page on fault, = starting OOM >>=20 >> m_pageout_mightbe_oom: kill context: v_free_count: %u, = v_inactive_count: %u >>=20 >>=20 >=20 > Could inquiries instead be added to the logging script? > Right now it's > #!/bin/sh > while true > do vmstat ; gstat -abd -I 1s ; date ; swapinfo ; tail -n 2 = /var/log/messages ; netstat -m | grep "mbuf clusters" > done >=20 > I'd much rather tamper with the logging script than the kernel 8-) Unfortunately, the reason I made the kernel messaging changes is that, as far as I know, the normal kernel simply does not publish the information anywhere when the internal criteria leads to OOM activity: No inquiry available without kernel changes. =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?75CE3C07-8A0A-4D32-84C6-24BEA967447E>