From owner-freebsd-hackers@freebsd.org Fri Dec 8 15:34:36 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id ABB8AE8678A for ; Fri, 8 Dec 2017 15:34:36 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id EF57D76CE0; Fri, 8 Dec 2017 15:34:35 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id vB8FYTw1095150 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 8 Dec 2017 17:34:29 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua vB8FYTw1095150 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id vB8FYTAj095149; Fri, 8 Dec 2017 17:34:29 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 8 Dec 2017 17:34:29 +0200 From: Konstantin Belousov To: Larry McVoy Cc: freebsd-hackers@freebsd.org, pho@freebsd.org Subject: Re: OOM problem? Message-ID: <20171208153429.GJ2272@kib.kiev.ua> References: <20171208011430.GA16016@mcvoy.com> <20171208101543.GC2272@kib.kiev.ua> <20171208150121.GH16028@mcvoy.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171208150121.GH16028@mcvoy.com> User-Agent: Mutt/1.9.1 (2017-09-22) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Dec 2017 15:34:36 -0000 On Fri, Dec 08, 2017 at 07:01:21AM -0800, Larry McVoy wrote: > On Fri, Dec 08, 2017 at 12:15:43PM +0200, Konstantin Belousov wrote: > > > The OOM code kicks in and it behaves poorly. It doesn't kill any of > > > the big processes, those are all sleeping without PCATCH on so they are > > > skipped. > > What is the proof for this statement ? > > I let the system run overnight trying to find more memory and it never > killed any of the big processes. > > I am able to log in and kill -9 would not kill them. The wait channel of the stuck process and its kernel backtrace is the first step to investigate. > > I tried a reboot and that hung. > > It took a power cycle to get the machine back. > > I've done this multiple times and always get the same result. > > > A process waiting for a page in the fault handler must receive the page > > to get out of the handler, even if the system is in OOM. > > I may be confusing you because this is not the normal page fault on a file > code path (at least I think it is not). The process is indeed faulting > in pages but they are pages that were allocated via whatever malloc calls > these days (in SunOS it mmapped /dev/zero, before that it was sbrk(2), > I dunno what FreeBSD does, I couldn't find malloc in src/lib, I see that > it's jemalloc but /usr/src/lib/libc/stdlib/jemalloc has no files?) Backtrace would answer this question easily. > > I think we are landing in vm_wait() but I can put some debugging in there > and confirm that if that helps. There is special version of vm_wait(), vm_waitpfault(), done initially to easily distiguish page faults waiting for a page vs. other unsatisfied page allocations by the name of the wait channel. > > > > A) Don't allocate more mem than you have. This problem exists simply > > > because the system allowed malloc to return more space than the > > > system had. If the system kept track of all the mem it has (ram > > > plus swap) and when processes asked for an allocation that pushed it > > > over that limit, fail that allocation. It's yet another globally > > > locked thing (though Jeff's NUMA stuff may make that better), you > > > have to keep track of allocations and frees (as in on exit(2) not > > > free(3)), that's why I think it's detail oriented to do it this way. > > > Probably the right way but has to be done carefully and someone has > > > to care enough to keep watching that this doesn't get broken. > > This behaviour can be requested by disabling overcommit. See tuning(7). > > The code might rot from the time it was done, because this feature often > > asked for, but rarely used for real. > > Seems like that should be on by default, no? Of course no. Both program's authors and users are accustomed to the overcommit. I.e., programs freely allocate huge UVA but limit actual (faulted in) memory usage, and do fork(2) while owning huge virtual allocations. This is a common behaviour for the languages runtimes with gc, but other programs also do this. > > > > B) Sleep with PCATCH, if that doesn't work, loop sleeping for a period, > > > wake up and see if you are signaled. I'm rusty enough that I don't > > > remember if msleep() with PCATCH will catch signals or not (I don't > > > remember a msleep(), that might be a BSD thing and not a SunOS thing). > > > But whatever, either it catches signals or you replace that sleep with > > > a loop that sleeps for a second or so, wakes up and looks to see if it's > > > been signaled and if so dies, else goes back to sleep waiting for pageout > > > and/or OOM to free some mem. > > Not exactly this, but something close, was done by the patch I provided to > > you already. > > I need to double check but I'm pretty sure I'm running with your patch at > least some version of it. Doesn't help. Would it help if I packaged up > a test case? Right now I'm using something like this: > > cd LMbench2+/src > for i in 1 2 3 4 5 6 7 8 9 0 > do ../bin/*/lat_mem_rd 25g 4096 & > done > > but I could make something simpler. I'm willing to keep pushing on this > if that's helpful but if you'd prefer to debug it yourself I can package > up a test case. Should probably do that anyway. Yes, the reproduction case and machine parameters to reproduce would allow me to see system state and do additional experiments. Please send the scripts to me and Peter Holm (pho, I Cc: ed him). On Fri, Dec 08, 2017 at 07:03:33AM -0800, Larry McVoy wrote: > On Fri, Dec 08, 2017 at 12:16:58PM +0200, Konstantin Belousov wrote: > > On Fri, Dec 08, 2017 at 08:18:21AM +0000, Johannes Lundberg wrote: > > > Regarding potential oom overhaul. Personally I like the idea of an oom > > > signal. The idea comes from iOS where applications get a callback when > > > system memory is low and they're given a chance to free unused > > > resources or resources that can easily be recreated, before getting > > > killed completely. > > The OOM signal is a topic which was discussed to death many times before. > > The summary is that it does not work, because you need to provide pages > > for userspace to be able to handle the signal. > > Just for the record, what I was proposing wasn't as ambitious as what > Johannes suggested (while I like his idea it's "weird" and it's unlikely > that Firefox et al would use it unless we got Linux to have the same > thing). > > I was just suggesting that processes sleeping in vm_wait() wake up once > in a while to respect signals, as in, if I kill -9 that process I want it > to go away. Currently, it doesn't. This cannot work. Currently vm_fault() must either call pmap_enter() to install pte into page table, pointing to the proper page, or return an error. Error must be returned only for the actual cause, i.e. we should not return a code (similar to EFAULT, but it is Mach error, not errno) when we have some transient problem unrelated to the process address map. The reason is that vm_fault() handles not only page faults from userspace, but also kernel accesses. The caller of vm_fault() might not be the trap() routine which handles faults, but other kernel code like uiomove(9) called from a subsystem. In other words, signal might be impossible to deliver (e.g. by terminating the process) in the context which called vm_fault(). So even if we detect a signal in vm_waitpfault(), we still must allocate the page. And if we must allocate it, there is no point in checking for signals. We already speed up allocation in noted that the process was killed.