From owner-freebsd-current@FreeBSD.ORG  Thu Oct 16 05:56:44 2014
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id F0AFF89F;
 Thu, 16 Oct 2014 05:56:43 +0000 (UTC)
Received: from aslan.scsiguy.com (www.scsiguy.com [70.89.174.89])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id A781BBBF;
 Thu, 16 Oct 2014 05:56:43 +0000 (UTC)
Received: from [192.168.0.61] (jt-mbp.home.scsiguy.org [192.168.0.61])
 (authenticated bits=0)
 by aslan.scsiguy.com (8.14.9/8.14.9) with ESMTP id s9G5uXK8078988
 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO);
 Wed, 15 Oct 2014 23:56:35 -0600 (MDT)
 (envelope-from gibbs@FreeBSD.org)
From: "Justin T. Gibbs" <gibbs@FreeBSD.org>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
Subject: OOM killer and kernel cache reclamation rate limit in
 vm_pageout_scan()
Date: Wed, 15 Oct 2014 23:56:33 -0600
Message-Id: <C64FB06B-AC9D-4A84-9CBB-8ED45CA6A315@FreeBSD.org>
To: freebsd-current@freebsd.org
Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\))
X-Mailer: Apple Mail (2.1878.6)
Cc: alc@FreeBSD.org, Andriy Gapon <avg@freebsd.org>
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 16 Oct 2014 05:56:44 -0000

avg pointed out the rate limiting code in vm_pageout_scan() during =
discussion about PR 187594.  While it certainly can contribute to the =
problems discussed in that PR, a bigger problem is that it can allow the =
OOM killer to be triggered even though there is plenty of reclaimable =
memory available in the system.  Any load that can consume enough pages =
within the polling interval to hit the v_free_min threshold (e.g. =
multiple 'dd if=3D/dev/zero of=3D/file/on/zfs') can make this happen.

The product I=92m working on does not have swap configured and treats =
any OOM trigger as fatal, so it is very obvious when this happens. :-)

I=92ve tried several things to mitigate the problem.  The first was to =
ignore rate limiting for pass 2.  However, even though ZFS is guaranteed =
to receive some feedback prior to OOM being declared, my testing showed =
that a trivial load (a couple dd operations) could still consume enough =
of the reclaimed space to leave the system below its target at the end =
of pass 2.  After removing the rate limiting entirely, I=92ve so far =
been unable to kill the system via a ZFS induced load.

I understand the motivation behind the rate limiting, but the current =
implementation seems too simplistic to be safe.  The documentation for =
the Solaris slab allocator provides good motivation for their approach =
of using a =93sliding average=94 to reign in temporary bursts of usage =
without unduly harming efficient service for the recorded steady-state =
memory demand.  Regardless of the approach taken, I believe that the OOM =
killer must be a last resort and shouldn=92t be called when there are =
caches that can be culled.

One other thing I=92ve noticed in my testing with ZFS is that it needs =
feedback and a little time to react to memory pressure.  Calling it=92s =
lowmem handler just once isn=92t enough for it to limit in-flight writes =
so it can avoid reuse of pages that it just freed up.  But, it doesn=92t =
take too long to react (> 1sec in the profiling I=92ve done).  Is there =
a way in vm_pageout_scan() that we can better record that progress is =
being made (pages were freed in the pass, even if some/all of them were =
consumed again) and allow more passes before the OOM killer is invoked =
in this case?

=97
Justin