From owner-freebsd-hackers@freebsd.org Fri Dec 8 01:14:38 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A1A61E95F8E for ; Fri, 8 Dec 2017 01:14:38 +0000 (UTC) (envelope-from lm@mcvoy.com) Received: from mcvoy.com (mcvoy.com [192.169.23.250]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 624F17E63D for ; Fri, 8 Dec 2017 01:14:38 +0000 (UTC) (envelope-from lm@mcvoy.com) Received: by mcvoy.com (Postfix, from userid 3546) id ABEBC35E0C0; Thu, 7 Dec 2017 17:14:30 -0800 (PST) Date: Thu, 7 Dec 2017 17:14:30 -0800 From: Larry McVoy To: freebsd-hackers@freebsd.org Subject: OOM problem? Message-ID: <20171208011430.GA16016@mcvoy.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.24 (2015-08-30) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 08 Dec 2017 01:14:38 -0000 Hi hackers, I've been playing around on a box that Netflix loaned me, I'm thinking about novel ways to deal with NUMA issues. I ran into a problem with the kernel, wanted to check in and see if anyone cares (I've got a couple different ways that it could be fixed but if noone cares I'll drop it). It's sort of an ugly problem in that when it happens your only recourse is to power cycle the machine, you can't kill off the processes causing the problem. I was trying to create benchmarks that would show what the system could do if you locked things down to different NUMA domains (BTW, the NUMA stuff is a complete red herring, the problem I'm about to describe happens if NUMA support isn't enabled). The machine is running 12.0-CURRENT FreeBSD 12.0-CURRENT #13 ce7b9882181 with a few diffs I did for debugging and a tweak to the pageout daemon suggested by Jeff. It is a 256GB of RAM machine configured with no swap space (that detail is important). I created a set of 10 processes that malloced 25GB each and read it repeatedly. That was enough memory pressure to use up all of free mem. Here is the problem. All of these "misbehaved" (by using lots of ram) processes go to sleep, I believe in vm_wait(). They are all waiting for more ram so the pageout daemon is kicked but to no avail, all the ram is tied up in the processes that want more ram. The pageout daemon kicks out what it can but it quickly gets to the point that it scans everything and finds nothing (I know this because I added debugging to show that's what it is doing). The OOM code kicks in and it behaves poorly. It doesn't kill any of the big processes, those are all sleeping without PCATCH on so they are skipped. The OOM code starts killing off anything it can find, it was killing getty, ssh, bash, dhclient. One buglet is that, in my opinion, it finds stuff to kill that it probably shouldn't. Anything that init will respawn is fine, anything that would not be respawned should be run as not killable. Seems like an audit of those processes might be in order. I know that you'll ask why no swap? Just add swap and the problem goes away. Does it? I don't think so, that's just kicking the can down the road. If we add 256GB of swap now we have a 512GB bag to fill, fill that and I think we're right back to where we started. What are the ideas for fixing it? I've got two. I think the first one is a bit hard to get right and I'm not sure if the second one will work (sorry, it's been a long time since I was a kernel hack, like SunOS 4.x long time). A) Don't allocate more mem than you have. This problem exists simply because the system allowed malloc to return more space than the system had. If the system kept track of all the mem it has (ram plus swap) and when processes asked for an allocation that pushed it over that limit, fail that allocation. It's yet another globally locked thing (though Jeff's NUMA stuff may make that better), you have to keep track of allocations and frees (as in on exit(2) not free(3)), that's why I think it's detail oriented to do it this way. Probably the right way but has to be done carefully and someone has to care enough to keep watching that this doesn't get broken. B) Sleep with PCATCH, if that doesn't work, loop sleeping for a period, wake up and see if you are signaled. I'm rusty enough that I don't remember if msleep() with PCATCH will catch signals or not (I don't remember a msleep(), that might be a BSD thing and not a SunOS thing). But whatever, either it catches signals or you replace that sleep with a loop that sleeps for a second or so, wakes up and looks to see if it's been signaled and if so dies, else goes back to sleep waiting for pageout and/or OOM to free some mem. I kinda like B better because it seems harder to have that approach bit rot. I'm wondering if anyone cares about this problem. If no, fine. If yes, I can cons up a test case and hand that off to someone who wants to fix the problem. If noone wants to fix it, I'll give it a try but I'd like feedback on the above approaches, not interested in going down a rathole for no good reason. Thanks, --lm