From owner-freebsd-hackers@freebsd.org  Fri Dec  8 01:14:38 2017
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A1A61E95F8E
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Fri,  8 Dec 2017 01:14:38 +0000 (UTC) (envelope-from lm@mcvoy.com)
Received: from mcvoy.com (mcvoy.com [192.169.23.250])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 624F17E63D
 for <freebsd-hackers@freebsd.org>; Fri,  8 Dec 2017 01:14:38 +0000 (UTC)
 (envelope-from lm@mcvoy.com)
Received: by mcvoy.com (Postfix, from userid 3546)
 id ABEBC35E0C0; Thu,  7 Dec 2017 17:14:30 -0800 (PST)
Date: Thu, 7 Dec 2017 17:14:30 -0800
From: Larry McVoy <lm@mcvoy.com>
To: freebsd-hackers@freebsd.org
Subject: OOM problem?
Message-ID: <20171208011430.GA16016@mcvoy.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.24 (2015-08-30)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 08 Dec 2017 01:14:38 -0000

Hi hackers,

I've been playing around on a box that Netflix loaned me, I'm thinking
about novel ways to deal with NUMA issues.

I ran into a problem with the kernel, wanted to check in and see if
anyone cares (I've got a couple different ways that it could be fixed
but if noone cares I'll drop it).  It's sort of an ugly problem in that
when it happens your only recourse is to power cycle the machine, you
can't kill off the processes causing the problem.

I was trying to create benchmarks that would show what the system could do
if you locked things down to different NUMA domains (BTW, the NUMA stuff
is a complete red herring, the problem I'm about to describe happens if
NUMA support isn't enabled).

The machine is running 12.0-CURRENT FreeBSD 12.0-CURRENT #13 ce7b9882181
with a few diffs I did for debugging and a tweak to the pageout daemon
suggested by Jeff.  It is a 256GB of RAM machine configured with no swap
space (that detail is important).

I created a set of 10 processes that malloced 25GB each and read it
repeatedly.  That was enough memory pressure to use up all of free mem.

Here is the problem.  All of these "misbehaved" (by using lots of ram)
processes go to sleep, I believe in vm_wait().  They are all waiting
for more ram so the pageout daemon is kicked but to no avail, all the
ram is tied up in the processes that want more ram.  The pageout daemon
kicks out what it can but it quickly gets to the point that it scans
everything and finds nothing (I know this because I added debugging to
show that's what it is doing).

The OOM code kicks in and it behaves poorly.  It doesn't kill any of
the big processes, those are all sleeping without PCATCH on so they are
skipped.  The OOM code starts killing off anything it can find, it was
killing getty, ssh, bash, dhclient.  One buglet is that, in my opinion,
it finds stuff to kill that it probably shouldn't.  Anything that init
will respawn is fine, anything that would not be respawned should be 
run as not killable.  Seems like an audit of those processes might be
in order.

I know that you'll ask why no swap?  Just add swap and the problem
goes away.  Does it?  I don't think so, that's just kicking the can
down the road.  If we add 256GB of swap now we have a 512GB bag to fill,
fill that and I think we're right back to where we started.

What are the ideas for fixing it?  I've got two.  I think the first
one is a bit hard to get right and I'm not sure if the second one will
work (sorry, it's been a long time since I was a kernel hack, like SunOS
4.x long time).

A) Don't allocate more mem than you have.  This problem exists simply
   because the system allowed malloc to return more space than the
   system had.  If the system kept track of all the mem it has (ram
   plus swap) and when processes asked for an allocation that pushed it
   over that limit, fail that allocation.  It's yet another globally
   locked thing (though Jeff's NUMA stuff may make that better), you
   have to keep track of allocations and frees (as in on exit(2) not
   free(3)), that's why I think it's detail oriented to do it this way.
   Probably the right way but has to be done carefully and someone has
   to care enough to keep watching that this doesn't get broken.

B) Sleep with PCATCH, if that doesn't work, loop sleeping for a period, 
   wake up and see if you are signaled.  I'm rusty enough that I don't
   remember if msleep() with PCATCH will catch signals or not (I don't
   remember a msleep(), that might be a BSD thing and not a SunOS thing).
   But whatever, either it catches signals or you replace that sleep with
   a loop that sleeps for a second or so, wakes up and looks to see if it's
   been signaled and if so dies, else goes back to sleep waiting for pageout
   and/or OOM to free some mem.

I kinda like B better because it seems harder to have that approach bit rot.
I'm wondering if anyone cares about this problem.  If no, fine.  If yes,
I can cons up a test case and hand that off to someone who wants to fix
the problem.  If noone wants to fix it, I'll give it a try but I'd like
feedback on the above approaches, not interested in going down a rathole
for no good reason.

Thanks,

--lm