Date: Tue, 24 Mar 2020 13:21:42 -0700 From: Mark Millard <marklmi@yahoo.com> To: bob prohaska <fbsd@www.zefox.net> Cc: freebsd-arm@freebsd.org Subject: Re: Belated out of swap kill on rpi3 at r359216 Message-ID: <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com> In-Reply-To: <20200324185518.GA92311@www.zefox.net> References: <20200324155753.GA91922@www.zefox.net> <83E41A13-6C24-4B56-A837-779044038FBC@yahoo.com> <20200324185518.GA92311@www.zefox.net>
index | next in thread | previous in thread | raw e-mail
On 2020-Mar-24, at 11:55, bob prohaska <fbsd at www.zefox.net> wrote: > On Tue, Mar 24, 2020 at 10:09:28AM -0700, Mark Millard wrote: >> >> >> On 2020-Mar-24, at 08:57, bob prohaska <fbsd at www.zefox.net> wrote: >> >>> An attempt to buildworld on an rpi3 running r359216 stopped with an >>> OOMA kill. Sources were at 359264, loader.conf contained >>> vm.pageout_oom_seq="4096" . >> >> What of the value of vm.pfault_oom_attempts ? >> >> If vm.pfault_oom_attempts was not -1, >> what of the value of vm.pfault_oom_wait ? >> > bob@www:/usr/src % sysctl vm.pfault_oom_wait > vm.pfault_oom_wait: 10 > I didn't change it, this must be the default. > Would it be useful to add something like > vm.pfault_oom_wait="20" > to /boot/loader.conf? I assume that this answer means vm.pfault_oom_attempts was not -1. But you did not list what it was. vm.pfault_oom_attempts=-1 is the means of avoiding the the vm.pfault_oom_attempts,vm.pfault_oom_wait pair from causing OOM at all. (vm.pfault_oom_wait becomes irrelevant.) When vm.pfault_oom_attempts != -1 , then there are potential time-outs that overall involve: vm.pfault_oom_attempts * vm.pfault_oom_wait but has some tradeoffs in the partitioning between the 2 factors: # sysctl -d vm.pfault_oom_attempts vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling # sysctl -d vm.pfault_oom_wait vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler As I understand, the following cases could each have very different results depending on what the latencies are like and such: vm.pfault_oom_attempts==20 && vm.pfault_oom_wait==1 vs. vm.pfault_oom_attempts==1 && vm.pfault_oom_wait==20 vs. vm.pfault_oom_attempts==4 && vm.pfault_oom_wait==5 As I remember, Konstantin Belousov once asked someone that was having a repeatable problem to try some alternatives that explored this but, as I remember, he got no reply to the request. The person might have just used: vm.pfault_oom_attempts=-1 and continued with their primary activity, for all I know. vm.pfault_oom_attempts=-1 is only recommended when one is sure that they will not run out of swap/paging space, if I understand right. For reference, for other settings: # sysctl -d vm.pageout_oom_seq vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM There is also: # sysctl -d vm.oom_pf_secs vm.oom_pf_secs: but it seems to have an empty description. May be it reports vm.pfault_oom_attempts * vm.pfault_oom_wait when vm.pfault_oom_attempts != -1 . (I've not looked.) >>> A snipped of gstat log suggests the worst congestion in the storage I/O >>> happened at Tue Mar 24 04:52:26 PDT 2020 with an L(q) of 37, but the >>> OOMA kill happened at Tue Mar 24 04:53:04 PDT 2020, by which time the >>> L(q) had dropped to one, half a minute later. >>> >>> Is the delay in OOMA action to be expected? >>> >>> Here's the relevant part of the log, I hope the columns display readably: >>> >>> 0/2/2/19177 mbuf clusters in use (current/cache/total/max) >>> procs memory page disks faults cpu >>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id >>> 0 0 29 1897320 47312 12851 9 4 5 13330 1669 0 0 14172 8020 3034 65 29 6 >>> dT: 1.056s w: 1.000s >>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name >>> 37 367 323 5463 6.4 45 1511 76.8 0 0 0.0 91.7 mmcsd0 >>> 37 367 323 5463 6.5 45 1511 76.9 0 0 0.0 93.3 mmcsd0s2 >>> 34 133 111 3209 7.6 22 697 134.7 0 0 0.0 74.0 mmcsd0s2a >>> 3 235 212 2254 5.9 23 814 21.5 0 0 0.0 70.0 mmcsd0s2b >>> 34 133 111 3209 7.6 22 697 134.7 0 0 0.0 74.2 ufs/rootfs >>> Tue Mar 24 04:52:26 PDT 2020 >>> Device 1K-blocks Used Avail Capacity >>> /dev/mmcsd0s2b 4404252 224484 4179768 5% >>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21842, size: 12288 >>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 25477, size: 4096 >> >> How bad were things back when the "indefinate wait buffer" notices >> were generated (Mar 24 04:20:50 time frame)? >> > It looks like _new_ indefinite wait messages started coming at around Mon Mar 23 23:00:28 PDT 2020: > procs memory page disks faults cpu > r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id > 2 0 18 1588824 197676 14108 2 1 0 14759 238 0 0 14055 9263 2668 62 32 5 > dT: 1.027s w: 1.000s > L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name > 9 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 mmcsd0 > 9 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 mmcsd0s2 > 6 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 mmcsd0s2a > 6 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 ufs/rootfs > Mon Mar 23 23:00:28 PDT 2020 > Device 1K-blocks Used Avail Capacity > /dev/mmcsd0s2b 4404252 245288 4158964 6% > Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 280324, size: 8192 > Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 285903, size: 16384 > > Admittedly, 300% busy looks pretty bad. Still the machine didn't quit..... > On the next sample %busy went down to 18% for swap, less for all else. As I understand, the ms/w value 13451 means that there was for a time (a mean of ?) somewhat under 14 seconds from a write being queued to it being completed. If a read could be blocked for such time frames by such writes, it suggests that vm.pfault_oom_wait might need to be larger than 13 if vm.pfault_oom_attempts=-1 is not in use. Or vm.pfault_oom_attempts=-1 could be used to avoid large time frames from directly leading to OOM activity. >> >> Below I show code changes to be more explicit in the >> output messaging about what contributes to initiating >> OOM kills, without needing boot verbose or the like. >> There are also some messages from Mark J.'s old code >> for reporting on things related to initiating OOM >> kills, or some minor variations of them. >> > >> You may want to try such changes to provide more >> context for your OOM kills when they happen. Below >> the 4 reasons reported are (showing the most >> detailed of the related messages from different >> stages): >> >> swp_pager_meta_build: swap blk uma zone exhausted >> >> swp_pager_meta_build: swap pctrie uma zone exhausted >> >> vm_fault_allocate: proc %d (%s) failed to alloc page on fault, starting OOM >> >> m_pageout_mightbe_oom: kill context: v_free_count: %u, v_inactive_count: %u >> >> > > Could inquiries instead be added to the logging script? > Right now it's > #!/bin/sh > while true > do vmstat ; gstat -abd -I 1s ; date ; swapinfo ; tail -n 2 /var/log/messages ; netstat -m | grep "mbuf clusters" > done > > I'd much rather tamper with the logging script than the kernel 8-) Unfortunately, the reason I made the kernel messaging changes is that, as far as I know, the normal kernel simply does not publish the information anywhere when the internal criteria leads to OOM activity: No inquiry available without kernel changes. === Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)home | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?75CE3C07-8A0A-4D32-84C6-24BEA967447E>
