From owner-freebsd-arm@freebsd.org Tue Mar 24 22:46:57 2020 Return-Path: Delivered-To: freebsd-arm@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 06B5826CFDC for ; Tue, 24 Mar 2020 22:46:57 +0000 (UTC) (envelope-from fbsd@www.zefox.net) Received: from www.zefox.net (www.zefox.net [50.1.20.27]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "www.zefox.org", Issuer "www.zefox.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 48n5w63Gtwz42Sq for ; Tue, 24 Mar 2020 22:46:45 +0000 (UTC) (envelope-from fbsd@www.zefox.net) Received: from www.zefox.net (localhost [127.0.0.1]) by www.zefox.net (8.15.2/8.15.2) with ESMTPS id 02OMkxEp092965 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 24 Mar 2020 15:47:00 -0700 (PDT) (envelope-from fbsd@www.zefox.net) Received: (from fbsd@localhost) by www.zefox.net (8.15.2/8.15.2/Submit) id 02OMkwgm092964; Tue, 24 Mar 2020 15:46:58 -0700 (PDT) (envelope-from fbsd) Date: Tue, 24 Mar 2020 15:46:58 -0700 From: bob prohaska To: Mark Millard Cc: freebsd-arm@freebsd.org, bob prohaska Subject: Re: Belated out of swap kill on rpi3 at r359216 Message-ID: <20200324224658.GA92726@www.zefox.net> References: <20200324155753.GA91922@www.zefox.net> <83E41A13-6C24-4B56-A837-779044038FBC@yahoo.com> <20200324185518.GA92311@www.zefox.net> <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <75CE3C07-8A0A-4D32-84C6-24BEA967447E@yahoo.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-Rspamd-Queue-Id: 48n5w63Gtwz42Sq X-Spamd-Bar: +++ Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=none (mx1.freebsd.org: domain of fbsd@www.zefox.net has no SPF policy when checking 50.1.20.27) smtp.mailfrom=fbsd@www.zefox.net X-Spamd-Result: default: False [3.15 / 15.00]; ARC_NA(0.00)[]; WWW_DOT_DOMAIN(0.50)[]; URIBL_BLOCKED(0.00)[zefox.net.multi.uribl.com]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; TO_DN_SOME(0.00)[]; RCVD_TLS_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[zefox.net]; AUTH_NA(1.00)[]; NEURAL_SPAM_MEDIUM(0.37)[0.369,0]; TO_MATCH_ENVRCPT_SOME(0.00)[]; IP_SCORE(0.06)[ip: (0.24), ipnet: 50.1.16.0/20(0.12), asn: 7065(-0.03), country: US(-0.05)]; NEURAL_SPAM_LONG(0.83)[0.827,0]; R_SPF_NA(0.00)[]; FREEMAIL_TO(0.00)[yahoo.com]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:7065, ipnet:50.1.16.0/20, country:US]; MID_RHS_MATCH_FROM(0.00)[]; MID_RHS_WWW(0.50)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 24 Mar 2020 22:46:57 -0000 The logfile being discussed is at http://www.zefox.net/~fbsd/rpi3/swaptests/r359216/swapscript.osupdate.log for convenient reference. More replies below. On Tue, Mar 24, 2020 at 01:21:42PM -0700, Mark Millard wrote: > > > On 2020-Mar-24, at 11:55, bob prohaska wrote: > > > On Tue, Mar 24, 2020 at 10:09:28AM -0700, Mark Millard wrote: > >> > >> > >> On 2020-Mar-24, at 08:57, bob prohaska wrote: > >> > >>> An attempt to buildworld on an rpi3 running r359216 stopped with an > >>> OOMA kill. Sources were at 359264, loader.conf contained > >>> vm.pageout_oom_seq="4096" . > >> > >> What of the value of vm.pfault_oom_attempts ? > >> > >> If vm.pfault_oom_attempts was not -1, > >> what of the value of vm.pfault_oom_wait ? > >> > > bob@www:/usr/src % sysctl vm.pfault_oom_wait > > vm.pfault_oom_wait: 10 > > I didn't change it, this must be the default. > > Would it be useful to add something like > > vm.pfault_oom_wait="20" > > to /boot/loader.conf? > > I assume that this answer means vm.pfault_oom_attempts > was not -1. But you did not list what it was. > Sorry, the variable names are running together in my head. bob@www:/usr/src % sysctl vm.pfault_oom_attempts vm.pfault_oom_attempts: 3 It's now reset to -1, hopefully it'll work better than last time 8-) > vm.pfault_oom_attempts=-1 is the means of avoiding > the the vm.pfault_oom_attempts,vm.pfault_oom_wait > pair from causing OOM at all. (vm.pfault_oom_wait > becomes irrelevant.) > > When vm.pfault_oom_attempts != -1 , then there are > potential time-outs that overall involve: > > vm.pfault_oom_attempts * vm.pfault_oom_wait > > but has some tradeoffs in the partitioning between > the 2 factors: > > # sysctl -d vm.pfault_oom_attempts > vm.pfault_oom_attempts: Number of page allocation attempts in page fault handler before it triggers OOM handling > > # sysctl -d vm.pfault_oom_wait > vm.pfault_oom_wait: Number of seconds to wait for free pages before retrying the page fault handler > > As I understand, the following cases could each have > very different results depending on what the latencies > are like and such: > > vm.pfault_oom_attempts==20 && vm.pfault_oom_wait==1 > vs. > vm.pfault_oom_attempts==1 && vm.pfault_oom_wait==20 > vs. > vm.pfault_oom_attempts==4 && vm.pfault_oom_wait==5 > > As I remember, Konstantin Belousov once asked someone > that was having a repeatable problem to try some > alternatives that explored this but, as I remember, > he got no reply to the request. > Konstantin wrote to both me and the list in a thread on Re: panic: non-current pmap. If it's relevant please indicate. He also wrote to Don Lewis, in a thread on Re: spurious out of swap kills but that was on a " 16-thread Ryzen machine, with 64 GB of RAM and 40 GB of swap" which is surely unlike my predicament 8-) > The person might have just used: vm.pfault_oom_attempts=-1 > and continued with their primary activity, for all I know. > > vm.pfault_oom_attempts=-1 is only recommended when one > is sure that they will not run out of swap/paging space, > if I understand right. > > For reference, for other settings: > > # sysctl -d vm.pageout_oom_seq > vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM > > There is also: > > # sysctl -d vm.oom_pf_secs > vm.oom_pf_secs: > > but it seems to have an empty description. May be > it reports vm.pfault_oom_attempts * vm.pfault_oom_wait > when vm.pfault_oom_attempts != -1 . (I've not looked.) > > >>> A snipped of gstat log suggests the worst congestion in the storage I/O > >>> happened at Tue Mar 24 04:52:26 PDT 2020 with an L(q) of 37, but the > >>> OOMA kill happened at Tue Mar 24 04:53:04 PDT 2020, by which time the > >>> L(q) had dropped to one, half a minute later. > >>> > >>> Is the delay in OOMA action to be expected? > >>> > >>> Here's the relevant part of the log, I hope the columns display readably: > >>> > >>> 0/2/2/19177 mbuf clusters in use (current/cache/total/max) > >>> procs memory page disks faults cpu > >>> r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id > >>> 0 0 29 1897320 47312 12851 9 4 5 13330 1669 0 0 14172 8020 3034 65 29 6 > >>> dT: 1.056s w: 1.000s > >>> L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name > >>> 37 367 323 5463 6.4 45 1511 76.8 0 0 0.0 91.7 mmcsd0 > >>> 37 367 323 5463 6.5 45 1511 76.9 0 0 0.0 93.3 mmcsd0s2 > >>> 34 133 111 3209 7.6 22 697 134.7 0 0 0.0 74.0 mmcsd0s2a > >>> 3 235 212 2254 5.9 23 814 21.5 0 0 0.0 70.0 mmcsd0s2b > >>> 34 133 111 3209 7.6 22 697 134.7 0 0 0.0 74.2 ufs/rootfs > >>> Tue Mar 24 04:52:26 PDT 2020 > >>> Device 1K-blocks Used Avail Capacity > >>> /dev/mmcsd0s2b 4404252 224484 4179768 5% > >>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 21842, size: 12288 > >>> Mar 24 04:20:50 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 25477, size: 4096 > >> > >> How bad were things back when the "indefinate wait buffer" notices > >> were generated (Mar 24 04:20:50 time frame)? > >> > > It looks like _new_ indefinite wait messages started coming at around Mon Mar 23 23:00:28 PDT 2020: > > procs memory page disks faults cpu > > r b w avm fre flt re pi po fr sr mm0 da0 in sy cs us sy id > > 2 0 18 1588824 197676 14108 2 1 0 14759 238 0 0 14055 9263 2668 62 32 5 > > dT: 1.027s w: 1.000s > > L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d %busy Name > > 9 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 mmcsd0 > > 9 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 mmcsd0s2 > > 6 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 mmcsd0s2a > > 6 1 0 0 0.0 1 31 13451 0 0 0.0 326.9 ufs/rootfs > > Mon Mar 23 23:00:28 PDT 2020 > > Device 1K-blocks Used Avail Capacity > > /dev/mmcsd0s2b 4404252 245288 4158964 6% > > Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 280324, size: 8192 > > Mar 23 23:00:28 www kernel: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 285903, size: 16384 > > > > Admittedly, 300% busy looks pretty bad. Still the machine didn't quit..... > > On the next sample %busy went down to 18% for swap, less for all else. > > As I understand, the ms/w value 13451 means that there was > for a time (a mean of ?) somewhat under 14 seconds from a > write being queued to it being completed. > > If a read could be blocked for such time frames by such > writes, it suggests that vm.pfault_oom_wait might need to > be larger than 13 if vm.pfault_oom_attempts=-1 is not in > use. > Ok, I'm starting to get it. On this machine it's 10. But the 13 second delay appeared at Mon Mar 23 23:00:28 PDT 2020. The kill occurred around Tue Mar 24 04:53:08 PDT 2020 under what look like much more benign circumstances. > Or vm.pfault_oom_attempts=-1 could be used to avoid large > time frames from directly leading to OOM activity. > Understood. I simply forgot to restore the setting after the initial troubles with it. > >> > >> Below I show code changes to be more explicit in the > >> output messaging about what contributes to initiating > >> OOM kills, without needing boot verbose or the like. > >> There are also some messages from Mark J.'s old code > >> for reporting on things related to initiating OOM > >> kills, or some minor variations of them. > >> > > > >> You may want to try such changes to provide more > >> context for your OOM kills when they happen. Below > >> the 4 reasons reported are (showing the most > >> detailed of the related messages from different > >> stages): > >> > >> swp_pager_meta_build: swap blk uma zone exhausted > >> > >> swp_pager_meta_build: swap pctrie uma zone exhausted > >> > >> vm_fault_allocate: proc %d (%s) failed to alloc page on fault, starting OOM > >> > >> m_pageout_mightbe_oom: kill context: v_free_count: %u, v_inactive_count: %u > >> > >> > > > > Could inquiries instead be added to the logging script? > > Right now it's > > #!/bin/sh > > while true > > do vmstat ; gstat -abd -I 1s ; date ; swapinfo ; tail -n 2 /var/log/messages ; netstat -m | grep "mbuf clusters" > > done > > > > I'd much rather tamper with the logging script than the kernel 8-) > > Unfortunately, the reason I made the kernel messaging > changes is that, as far as I know, the normal kernel > simply does not publish the information anywhere when > the internal criteria leads to OOM activity: No > inquiry available without kernel changes. If all else fails I'll try to apply your patches to the kernel and recompile. With my thanks, bob prohaska