From owner-freebsd-arch@freebsd.org  Thu Nov 30 18:49:23 2017
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D2DCFE68A1B
 for <freebsd-arch@mailman.ysv.freebsd.org>;
 Thu, 30 Nov 2017 18:49:23 +0000 (UTC) (envelope-from lm@mcvoy.com)
Received: from mcvoy.com (mcvoy.com [192.169.23.250])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id C3FE37010E
 for <freebsd-arch@freebsd.org>; Thu, 30 Nov 2017 18:49:23 +0000 (UTC)
 (envelope-from lm@mcvoy.com)
Received: by mcvoy.com (Postfix, from userid 3546)
 id 1ABEB35E0BB; Thu, 30 Nov 2017 10:49:23 -0800 (PST)
Date: Thu, 30 Nov 2017 10:49:23 -0800
From: Larry McVoy <lm@mcvoy.com>
To: Warner Losh <imp@bsdimp.com>
Cc: Larry McVoy <lm@mcvoy.com>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>,
 Scott Long <scottl@netflix.com>, Kevin Bowling <kbowling@llnw.com>,
 Drew Gallatin <gallatin@netflix.com>
Subject: Re: small patch for pageout. Comments?
Message-ID: <20171130184923.GA30262@mcvoy.com>
References: <20171130173424.GA811@mcvoy.com>
 <CANCZdfqL9ZsKTfFi+vsCTh3yaNjtwaYYY3fvivdbNybBnujawg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CANCZdfqL9ZsKTfFi+vsCTh3yaNjtwaYYY3fvivdbNybBnujawg@mail.gmail.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Nov 2017 18:49:23 -0000

On Thu, Nov 30, 2017 at 11:37:35AM -0700, Warner Losh wrote:
> On Thu, Nov 30, 2017 at 10:34 AM, Larry McVoy <lm@mcvoy.com> wrote:
> 
> > In a recent numa meeting that Scott called, Jeff suggested a small
> > patch to the pageout daemon (included below).
> >
> > It's rather dramatic the difference it makes for me.  If I arrange to
> > thrash the crap out of memory, without this patch the kernel is so
> > borked with all the processes in disk wait that I can't kill them,
> > I can't reboot, my only option is to power off.
> >
> > With the patch there is still some borkage, the kernel is randomly
> > killing processes because of out of mem, it should kill one of my
> > processes that is causing the problem but it doesn't, it killed
> > random stuff like dhclient, getty (logged me out), etc.
> >
> > But the system is responsive.
> >
> > What the patch does is say "if we have more than one core, don't sleep
> > in pageout, just keep running until we freed enough mem".
> >
> > Comments?
> >
> 
> Just to confirm why this patch works.
> 
> For UP systems, we have to pause here to allow work to complete, otherwise
> we can't switch to their threads to complete the I/Os. For MP, however, we
> can continue to schedule more work because that work can be completed on
> other CPUs. This parallelism greatly increases the pageout rate, allowing
> the system to keep up better when some ass-hat process (or processes) is
> thrashing memory.

Yep.

> I'm pretty sure the UP case was also designed to not flood the lower layers
> with work, starving other consumers. Does this result in undo flooding, and
> would we get better results if we could schedule up to the right amount of
> work rather flooding in the MP case?

I dunno if there is a "right amount".  I could make it a little smarter by
keeping track of how many pages we freed and sleep if we freed none in a 
scan (which seems really unlikely).

All I know for sure is that without this you can lock up the system to
the point it takes a power cycle to unwedge it.  With this the system
is responsive.

Rather than worrying about the smartness, I'd argue this is an improvement,
ship it, and then I can go look at how the system decides to kill processes
(because that's currently busted).