From owner-freebsd-hackers@freebsd.org Fri Dec 8 16:44:21 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CFE4FE8821D for ; Fri, 8 Dec 2017 16:44:21 +0000 (UTC) (envelope-from pho@holm.cc) Received: from relay01.pair.com (relay01.pair.com [209.68.5.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id AB4DE79C64 for ; Fri, 8 Dec 2017 16:44:21 +0000 (UTC) (envelope-from pho@holm.cc) Received: from x2.osted.lan (87-58-223-204-dynamic.dk.customer.tdc.net [87.58.223.204]) by relay01.pair.com (Postfix) with ESMTP id 0A912D003F9; Fri, 8 Dec 2017 11:44:13 -0500 (EST) Received: from x2.osted.lan (localhost [127.0.0.1]) by x2.osted.lan (8.14.9/8.14.9) with ESMTP id vB8GiBbA085742 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 8 Dec 2017 17:44:11 +0100 (CET) (envelope-from pho@x2.osted.lan) Received: (from pho@localhost) by x2.osted.lan (8.14.9/8.14.9/Submit) id vB8GiAIB085741; Fri, 8 Dec 2017 17:44:10 +0100 (CET) (envelope-from pho) Date: Fri, 8 Dec 2017 17:44:10 +0100 From: Peter Holm To: Konstantin Belousov Cc: Larry McVoy , freebsd-hackers@freebsd.org Subject: Re: OOM problem? Message-ID: <20171208164410.GA85620@x2.osted.lan> References: <20171208011430.GA16016@mcvoy.com> <20171208101543.GC2272@kib.kiev.ua> <20171208150121.GH16028@mcvoy.com> <20171208153429.GJ2272@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171208153429.GJ2272@kib.kiev.ua> User-Agent: Mutt/1.5.23 (2014-03-12) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Dec 2017 16:44:21 -0000 On Fri, Dec 08, 2017 at 05:34:29PM +0200, Konstantin Belousov wrote: > On Fri, Dec 08, 2017 at 07:01:21AM -0800, Larry McVoy wrote: > > On Fri, Dec 08, 2017 at 12:15:43PM +0200, Konstantin Belousov wrote: > > > > The OOM code kicks in and it behaves poorly. It doesn't kill any of > > > > the big processes, those are all sleeping without PCATCH on so they are > > > > skipped. > > > What is the proof for this statement ? > > > > I let the system run overnight trying to find more memory and it never > > killed any of the big processes. > > > > I am able to log in and kill -9 would not kill them. > The wait channel of the stuck process and its kernel backtrace is the > first step to investigate. > > > > > I tried a reboot and that hung. > > > > It took a power cycle to get the machine back. > > > > I've done this multiple times and always get the same result. > > > > > A process waiting for a page in the fault handler must receive the page > > > to get out of the handler, even if the system is in OOM. > > > > I may be confusing you because this is not the normal page fault on a file > > code path (at least I think it is not). The process is indeed faulting > > in pages but they are pages that were allocated via whatever malloc calls > > these days (in SunOS it mmapped /dev/zero, before that it was sbrk(2), > > I dunno what FreeBSD does, I couldn't find malloc in src/lib, I see that > > it's jemalloc but /usr/src/lib/libc/stdlib/jemalloc has no files?) > Backtrace would answer this question easily. > > > > > I think we are landing in vm_wait() but I can put some debugging in there > > and confirm that if that helps. > There is special version of vm_wait(), vm_waitpfault(), done initially > to easily distiguish page faults waiting for a page vs. other > unsatisfied page allocations by the name of the wait channel. > > > > > > > A) Don't allocate more mem than you have. This problem exists simply > > > > because the system allowed malloc to return more space than the > > > > system had. If the system kept track of all the mem it has (ram > > > > plus swap) and when processes asked for an allocation that pushed it > > > > over that limit, fail that allocation. It's yet another globally > > > > locked thing (though Jeff's NUMA stuff may make that better), you > > > > have to keep track of allocations and frees (as in on exit(2) not > > > > free(3)), that's why I think it's detail oriented to do it this way. > > > > Probably the right way but has to be done carefully and someone has > > > > to care enough to keep watching that this doesn't get broken. > > > This behaviour can be requested by disabling overcommit. See tuning(7). > > > The code might rot from the time it was done, because this feature often > > > asked for, but rarely used for real. > > > > Seems like that should be on by default, no? > Of course no. Both program's authors and users are accustomed to the > overcommit. I.e., programs freely allocate huge UVA but limit actual > (faulted in) memory usage, and do fork(2) while owning huge virtual > allocations. This is a common behaviour for the languages runtimes with > gc, but other programs also do this. > > > > > > > B) Sleep with PCATCH, if that doesn't work, loop sleeping for a period, > > > > wake up and see if you are signaled. I'm rusty enough that I don't > > > > remember if msleep() with PCATCH will catch signals or not (I don't > > > > remember a msleep(), that might be a BSD thing and not a SunOS thing). > > > > But whatever, either it catches signals or you replace that sleep with > > > > a loop that sleeps for a second or so, wakes up and looks to see if it's > > > > been signaled and if so dies, else goes back to sleep waiting for pageout > > > > and/or OOM to free some mem. > > > Not exactly this, but something close, was done by the patch I provided to > > > you already. > > > > I need to double check but I'm pretty sure I'm running with your patch at > > least some version of it. Doesn't help. Would it help if I packaged up > > a test case? Right now I'm using something like this: > > > > cd LMbench2+/src > > for i in 1 2 3 4 5 6 7 8 9 0 > > do ../bin/*/lat_mem_rd 25g 4096 & > > done > > > > but I could make something simpler. I'm willing to keep pushing on this > > if that's helpful but if you'd prefer to debug it yourself I can package > > up a test case. Should probably do that anyway. > Yes, the reproduction case and machine parameters to reproduce would > allow me to see system state and do additional experiments. Please send > the scripts to me and Peter Holm (pho, I Cc: ed him). > I seem to be able to reproduce this. Unfortunately I did not get a vmcore. I'll try again. https://people.freebsd.org/~pho/stress/log/kostik1067.txt - Peter