From owner-freebsd-hackers@freebsd.org  Fri Dec  8 16:44:21 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id CFE4FE8821D
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Fri,  8 Dec 2017 16:44:21 +0000 (UTC) (envelope-from pho@holm.cc)
Received: from relay01.pair.com (relay01.pair.com [209.68.5.15])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id AB4DE79C64
 for <freebsd-hackers@freebsd.org>; Fri,  8 Dec 2017 16:44:21 +0000 (UTC)
 (envelope-from pho@holm.cc)
Received: from x2.osted.lan (87-58-223-204-dynamic.dk.customer.tdc.net
 [87.58.223.204])
 by relay01.pair.com (Postfix) with ESMTP id 0A912D003F9;
 Fri,  8 Dec 2017 11:44:13 -0500 (EST)
Received: from x2.osted.lan (localhost [127.0.0.1])
 by x2.osted.lan (8.14.9/8.14.9) with ESMTP id vB8GiBbA085742
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Fri, 8 Dec 2017 17:44:11 +0100 (CET) (envelope-from pho@x2.osted.lan)
Received: (from pho@localhost)
 by x2.osted.lan (8.14.9/8.14.9/Submit) id vB8GiAIB085741;
 Fri, 8 Dec 2017 17:44:10 +0100 (CET) (envelope-from pho)
Date: Fri, 8 Dec 2017 17:44:10 +0100
From: Peter Holm <peter@holm.cc>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: Larry McVoy <lm@mcvoy.com>, freebsd-hackers@freebsd.org
Subject: Re: OOM problem?
Message-ID: <20171208164410.GA85620@x2.osted.lan>
References: <20171208011430.GA16016@mcvoy.com>
 <20171208101543.GC2272@kib.kiev.ua>
 <20171208150121.GH16028@mcvoy.com>
 <20171208153429.GJ2272@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20171208153429.GJ2272@kib.kiev.ua>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Dec 2017 16:44:21 -0000

On Fri, Dec 08, 2017 at 05:34:29PM +0200, Konstantin Belousov wrote:
> On Fri, Dec 08, 2017 at 07:01:21AM -0800, Larry McVoy wrote:
> > On Fri, Dec 08, 2017 at 12:15:43PM +0200, Konstantin Belousov wrote:
> > > > The OOM code kicks in and it behaves poorly.  It doesn't kill any of
> > > > the big processes, those are all sleeping without PCATCH on so they are
> > > > skipped.
> > > What is the proof for this statement ?
> > 
> > I let the system run overnight trying to find more memory and it never
> > killed any of the big processes.
> > 
> > I am able to log in and kill -9 would not kill them.
> The wait channel of the stuck process and its kernel backtrace is the
> first step to investigate.
> 
> > 
> > I tried a reboot and that hung.
> > 
> > It took a power cycle to get the machine back.
> > 
> > I've done this multiple times and always get the same result.
> > 
> > > A process waiting for a page in the fault handler must receive the page
> > > to get out of the handler, even if the system is in OOM.  
> > 
> > I may be confusing you because this is not the normal page fault on a file
> > code path (at least I think it is not).  The process is indeed faulting
> > in pages but they are pages that were allocated via whatever malloc calls
> > these days (in SunOS it mmapped /dev/zero, before that it was sbrk(2),
> > I dunno what FreeBSD does, I couldn't find malloc in src/lib, I see that
> > it's jemalloc but /usr/src/lib/libc/stdlib/jemalloc has no files?)
> Backtrace would answer this question easily.
> 
> > 
> > I think we are landing in vm_wait() but I can put some debugging in there
> > and confirm that if that helps.
> There is special version of vm_wait(), vm_waitpfault(), done initially
> to easily distiguish page faults waiting for a page vs. other
> unsatisfied page allocations by the name of the wait channel.
> 
> > 
> > > > A) Don't allocate more mem than you have.  This problem exists simply
> > > >    because the system allowed malloc to return more space than the
> > > >    system had.  If the system kept track of all the mem it has (ram
> > > >    plus swap) and when processes asked for an allocation that pushed it
> > > >    over that limit, fail that allocation.  It's yet another globally
> > > >    locked thing (though Jeff's NUMA stuff may make that better), you
> > > >    have to keep track of allocations and frees (as in on exit(2) not
> > > >    free(3)), that's why I think it's detail oriented to do it this way.
> > > >    Probably the right way but has to be done carefully and someone has
> > > >    to care enough to keep watching that this doesn't get broken.
> > > This behaviour can be requested by disabling overcommit.   See tuning(7).
> > > The code might rot from the time it was done, because this feature often
> > > asked for, but rarely used for real.
> > 
> > Seems like that should be on by default, no?
> Of course no. Both program's authors and users are accustomed to the
> overcommit. I.e., programs freely allocate huge UVA but limit actual
> (faulted in) memory usage, and do fork(2) while owning huge virtual
> allocations. This is a common behaviour for the languages runtimes with
> gc, but other programs also do this.
> 
> > 
> > > > B) Sleep with PCATCH, if that doesn't work, loop sleeping for a period, 
> > > >    wake up and see if you are signaled.  I'm rusty enough that I don't
> > > >    remember if msleep() with PCATCH will catch signals or not (I don't
> > > >    remember a msleep(), that might be a BSD thing and not a SunOS thing).
> > > >    But whatever, either it catches signals or you replace that sleep with
> > > >    a loop that sleeps for a second or so, wakes up and looks to see if it's
> > > >    been signaled and if so dies, else goes back to sleep waiting for pageout
> > > >    and/or OOM to free some mem.
> > > Not exactly this, but something close, was done by the patch I provided to
> > > you already.
> > 
> > I need to double check but I'm pretty sure I'm running with your patch at
> > least some version of it.  Doesn't help.  Would it help if I packaged up
> > a test case?  Right now I'm using something like this:
> > 
> >     cd LMbench2+/src
> >     for i in 1 2 3 4 5 6 7 8 9 0
> >     do	../bin/*/lat_mem_rd 25g 4096 &
> >     done
> > 
> > but I could make something simpler.  I'm willing to keep pushing on this
> > if that's helpful but if you'd prefer to debug it yourself I can package
> > up a test case.  Should probably do that anyway.
> Yes, the reproduction case and machine parameters to reproduce would
> allow me to see system state and do additional experiments.  Please send
> the scripts to me and Peter Holm (pho, I Cc: ed him).
> 

I seem to be able to reproduce this. Unfortunately I did not get a
vmcore. I'll try again.

https://people.freebsd.org/~pho/stress/log/kostik1067.txt

- Peter