From nobody Tue May 18 23:55:36 2021
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id CE89C5C5602
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Tue, 18 May 2021 23:55:48 +0000 (UTC)
	(envelope-from asomers@gmail.com)
Received: from mail-oi1-f181.google.com (mail-oi1-f181.google.com [209.85.167.181])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4FlCYw5SCqz4xG7;
	Tue, 18 May 2021 23:55:48 +0000 (UTC)
	(envelope-from asomers@gmail.com)
Received: by mail-oi1-f181.google.com with SMTP id h9so11510049oih.4;
        Tue, 18 May 2021 16:55:48 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=2QYLOkUriQslBs7HweIJ0trmNjYN7tN2fEbUMcklNeY=;
        b=Vwdb9vipAoiXpTPzIAayJWElVQ7kHQt7462V4Yyw6WHd7X2Da8Z9MqTOBPCFcsv6nQ
         1SynR2wPIZO4flOAfftfZg1D7QPwshexITSjmYQT9WOagbX6y7R3w4xe74EqbBmJ6/Jz
         SZFh2SVfuCKViLuLbYPHUYb7LEFrjd0Vau0SetjgM08dmLSkBYmcXrLxeUSKGhkDu1hZ
         KXVhxM/g5se2d5OGhsESQxGGBED3ofJ2mkE+BZ83fNv3+jHhW3nofM57KB8Df9kA0Lty
         MGJWowVUHTG85Pck6lT0wsapWax9KoVBX90qn7iV4eEC0aE3pSvE4H2CRCDfTTVys1nE
         /Hdg==
X-Gm-Message-State: AOAM533vACdU75KdwnC/lOeGZpXal+JfXEAgVvxqhVswzSxQfy8Ob3N0
	c36w0nDl/zG0ZFPehXH3vVjrzjzkOgywnQuTkdIADKQwghg=
X-Google-Smtp-Source: ABdhPJwSVggNk3RpFwenReSxyiQIjGNBuGbCBu+wWPNqjMeJfkWK0jPIpCys3pIMljQc6MPzRTJKP+HT3nRBHt0SJy0=
X-Received: by 2002:a05:6808:8c6:: with SMTP id k6mr5483314oij.55.1621382147230;
 Tue, 18 May 2021 16:55:47 -0700 (PDT)
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:hackers+help@freebsd.org>
List-Post: <mailto:hackers@freebsd.org>
List-Subscribe: <mailto:hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
References: <CAOtMX2gvkrYS0zYYYtjD+Aaqv62MzFYFhWPHjLDGXA1=H7LfCg@mail.gmail.com>
 <YKQ1biSSGbluuy5f@nuc> <CAOtMX2he1YBidG=zF=iUQw+Os7p=gWMk-sab00NVr0nNs=Cwog@mail.gmail.com>
 <YKQ7YhMke7ibse6F@nuc>
In-Reply-To: <YKQ7YhMke7ibse6F@nuc>
From: Alan Somers <asomers@freebsd.org>
Date: Tue, 18 May 2021 17:55:36 -0600
Message-ID: <CAOtMX2hT2XR=fyU6HB11WHbRx4qtNoyPHkX60g3JXXH9JWrObQ@mail.gmail.com>
Subject: Re: The pagedaemon evicts ARC before scanning the inactive page list
To: Mark Johnston <markj@freebsd.org>
Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org>
Content-Type: multipart/alternative; boundary="000000000000ee500b05c2a372c3"
X-Rspamd-Queue-Id: 4FlCYw5SCqz4xG7
X-Spamd-Bar: ----
Authentication-Results: mx1.freebsd.org;
	none
X-Spamd-Result: default: False [-4.00 / 15.00];
	 REPLY(-4.00)[]

--000000000000ee500b05c2a372c3
Content-Type: text/plain; charset="UTF-8"

On Tue, May 18, 2021 at 4:10 PM Mark Johnston <markj@freebsd.org> wrote:

> On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote:
> > On Tue, May 18, 2021 at 3:45 PM Mark Johnston <markj@freebsd.org> wrote:
> >
> > > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote:
> > > > I'm using ZFS on servers with tons of RAM and running FreeBSD
> > > > 12.2-RELEASE.  Sometimes they get into a pathological situation where
> > > most
> > > > of that RAM sits unused.  For example, right now one of them has:
> > > >
> > > > 2 GB   Active
> > > > 529 GB Inactive
> > > > 16 GB  Free
> > > > 99 GB  ARC total
> > > > 469 GB ARC max
> > > > 86 GB  ARC target
> > > >
> > > > When a server gets into this situation, it stays there for days,
> with the
> > > > ARC target barely budging.  All that inactive memory never gets
> reclaimed
> > > > and put to a good use.  Frequently the server never recovers until a
> > > reboot.
> > > >
> > > > I have a theory for what's going on.  Ever since r334508^ the
> pagedaemon
> > > > sends the vm_lowmem event _before_ it scans the inactive page list.
> If
> > > the
> > > > ARC frees enough memory, then vm_pageout_scan_inactive won't need to
> free
> > > > any.  Is that order really correct?  For reference, here's the
> relevant
> > > > code, from vm_pageout_worker:
> > >
> > > That was the case even before r334508.  Note that prior to that
> revision
> > > vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before
> > > scanning the inactive queue.  During a memory shortage we have pass >
> 0.
> > > pass == 0 only when the page daemon is scanning the active queue.
> > >
> > > > shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
> > > > if (shortage > 0) {
> > > >         ofree = vmd->vmd_free_count;
> > > >         if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
> > > >                 shortage -= min(vmd->vmd_free_count - ofree,
> > > >                     (u_int)shortage);
> > > >         target_met = vm_pageout_scan_inactive(vmd, shortage,
> > > >             &addl_shortage);
> > > > } else
> > > >         addl_shortage = 0
> > > >
> > > > Raising vfs.zfs.arc_min seems to workaround the problem.  But ideally
> > > that
> > > > wouldn't be necessary.
> > >
> > > vm_lowmem is too primitive: it doesn't tell subscribing subsystems
> > > anything about the magnitude of the shortage.  At the same time, the VM
> > > doesn't know much about how much memory they are consuming.  A better
> > > strategy, at least for the ARC, would be reclaim memory based on the
> > > relative memory consumption of each subsystem.  In your case, when the
> > > page daemon goes to reclaim memory, it should use the inactive queue to
> > > make up ~85% of the shortfall and reclaim the rest from the ARC.  Even
> > > better would be if the ARC could use the page cache as a second-level
> > > cache, like the buffer cache does.
> > >
> > > Today I believe the ARC treats vm_lowmem as a signal to shed some
> > > arbitrary fraction of evictable data.  If the ARC is able to quickly
> > > answer the question, "how much memory can I release if asked?", then
> > > the page daemon could use that to determine how much of its reclamation
> > > target should come from the ARC vs. the page cache.
> > >
> >
> > I guess I don't understand why you would ever free from the ARC rather
> than
> > from the inactive list.  When is inactive memory ever useful?
>
> Pages in the inactive queue are either unmapped or haven't had their
> mappings referenced recently.  But they may still be frequently accessed
> by file I/O operations like sendfile(2).  That's not to say that
> reclaiming from other subsystems first is always the right strategy, but
> note also that the page daemon may scan the inactive queue many times in
> between vm_lowmem calls.
>

So By default ZFS tries to free (arc_target / 128) bytes of memory in
arc_lowmem.  That's huge!  On this server, pidctrl_daemon typically
requests 0-10MB, and arc_lowmem tries to free 600 MB.  It looks like it
would be easy to modify vm_lowmem to include the total amount of memory
that it wants freed.  I could make such a patch.  My next question is:
what's the fastest way to generate a lot of inactive memory?  My first
attempt was "find . | xargs md5", but that isn't terribly effective.  The
production machines are doing a lot of "zfs recv" and running some busy Go
programs, among other things, but I can't easily replicate that workload on
a development system.
-Alan

--000000000000ee500b05c2a372c3
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail=
_attr">On Tue, May 18, 2021 at 4:10 PM Mark Johnston &lt;<a href=3D"mailto:=
markj@freebsd.org">markj@freebsd.org</a>&gt; wrote:<br></div><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid=
 rgb(204,204,204);padding-left:1ex">On Tue, May 18, 2021 at 04:00:14PM -060=
0, Alan Somers wrote:<br>
&gt; On Tue, May 18, 2021 at 3:45 PM Mark Johnston &lt;<a href=3D"mailto:ma=
rkj@freebsd.org" target=3D"_blank">markj@freebsd.org</a>&gt; wrote:<br>
&gt; <br>
&gt; &gt; On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote:<br>
&gt; &gt; &gt; I&#39;m using ZFS on servers with tons of RAM and running Fr=
eeBSD<br>
&gt; &gt; &gt; 12.2-RELEASE.=C2=A0 Sometimes they get into a pathological s=
ituation where<br>
&gt; &gt; most<br>
&gt; &gt; &gt; of that RAM sits unused.=C2=A0 For example, right now one of=
 them has:<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; 2 GB=C2=A0 =C2=A0Active<br>
&gt; &gt; &gt; 529 GB Inactive<br>
&gt; &gt; &gt; 16 GB=C2=A0 Free<br>
&gt; &gt; &gt; 99 GB=C2=A0 ARC total<br>
&gt; &gt; &gt; 469 GB ARC max<br>
&gt; &gt; &gt; 86 GB=C2=A0 ARC target<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; When a server gets into this situation, it stays there for d=
ays, with the<br>
&gt; &gt; &gt; ARC target barely budging.=C2=A0 All that inactive memory ne=
ver gets reclaimed<br>
&gt; &gt; &gt; and put to a good use.=C2=A0 Frequently the server never rec=
overs until a<br>
&gt; &gt; reboot.<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; I have a theory for what&#39;s going on.=C2=A0 Ever since r3=
34508^ the pagedaemon<br>
&gt; &gt; &gt; sends the vm_lowmem event _before_ it scans the inactive pag=
e list.=C2=A0 If<br>
&gt; &gt; the<br>
&gt; &gt; &gt; ARC frees enough memory, then vm_pageout_scan_inactive won&#=
39;t need to free<br>
&gt; &gt; &gt; any.=C2=A0 Is that order really correct?=C2=A0 For reference=
, here&#39;s the relevant<br>
&gt; &gt; &gt; code, from vm_pageout_worker:<br>
&gt; &gt;<br>
&gt; &gt; That was the case even before r334508.=C2=A0 Note that prior to t=
hat revision<br>
&gt; &gt; vm_pageout_scan_inactive() would trigger vm_lowmem if pass &gt; 0=
, before<br>
&gt; &gt; scanning the inactive queue.=C2=A0 During a memory shortage we ha=
ve pass &gt; 0.<br>
&gt; &gt; pass =3D=3D 0 only when the page daemon is scanning the active qu=
eue.<br>
&gt; &gt;<br>
&gt; &gt; &gt; shortage =3D pidctrl_daemon(&amp;vmd-&gt;vmd_pid, vmd-&gt;vm=
d_free_count);<br>
&gt; &gt; &gt; if (shortage &gt; 0) {<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ofree =3D vmd-&gt;vmd_free_=
count;<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (vm_pageout_lowmem() &am=
p;&amp; vmd-&gt;vmd_free_count &gt; ofree)<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0shortage -=3D min(vmd-&gt;vmd_free_count - ofree,<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0(u_int)shortage);<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0target_met =3D vm_pageout_s=
can_inactive(vmd, shortage,<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&amp;addl_sho=
rtage);<br>
&gt; &gt; &gt; } else<br>
&gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0addl_shortage =3D 0<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; Raising vfs.zfs.arc_min seems to workaround the problem.=C2=
=A0 But ideally<br>
&gt; &gt; that<br>
&gt; &gt; &gt; wouldn&#39;t be necessary.<br>
&gt; &gt;<br>
&gt; &gt; vm_lowmem is too primitive: it doesn&#39;t tell subscribing subsy=
stems<br>
&gt; &gt; anything about the magnitude of the shortage.=C2=A0 At the same t=
ime, the VM<br>
&gt; &gt; doesn&#39;t know much about how much memory they are consuming.=
=C2=A0 A better<br>
&gt; &gt; strategy, at least for the ARC, would be reclaim memory based on =
the<br>
&gt; &gt; relative memory consumption of each subsystem.=C2=A0 In your case=
, when the<br>
&gt; &gt; page daemon goes to reclaim memory, it should use the inactive qu=
eue to<br>
&gt; &gt; make up ~85% of the shortfall and reclaim the rest from the ARC.=
=C2=A0 Even<br>
&gt; &gt; better would be if the ARC could use the page cache as a second-l=
evel<br>
&gt; &gt; cache, like the buffer cache does.<br>
&gt; &gt;<br>
&gt; &gt; Today I believe the ARC treats vm_lowmem as a signal to shed some=
<br>
&gt; &gt; arbitrary fraction of evictable data.=C2=A0 If the ARC is able to=
 quickly<br>
&gt; &gt; answer the question, &quot;how much memory can I release if asked=
?&quot;, then<br>
&gt; &gt; the page daemon could use that to determine how much of its recla=
mation<br>
&gt; &gt; target should come from the ARC vs. the page cache.<br>
&gt; &gt;<br>
&gt; <br>
&gt; I guess I don&#39;t understand why you would ever free from the ARC ra=
ther than<br>
&gt; from the inactive list.=C2=A0 When is inactive memory ever useful?<br>
<br>
Pages in the inactive queue are either unmapped or haven&#39;t had their<br=
>
mappings referenced recently.=C2=A0 But they may still be frequently access=
ed<br>
by file I/O operations like sendfile(2).=C2=A0 That&#39;s not to say that<b=
r>
reclaiming from other subsystems first is always the right strategy, but<br=
>
note also that the page daemon may scan the inactive queue many times in<br=
>
between vm_lowmem calls.<br></blockquote><div><br></div><div>So By default =
ZFS tries to free (arc_target / 128) bytes of memory in arc_lowmem.=C2=A0 T=
hat&#39;s huge!=C2=A0 On this server, pidctrl_daemon typically requests 0-1=
0MB, and arc_lowmem tries to free 600 MB.=C2=A0 It looks like it would be e=
asy to modify vm_lowmem to include the total amount of memory that it wants=
 freed.=C2=A0 I could make such a patch.=C2=A0 My next question is: what=
9;s the fastest way to generate a lot of inactive memory?=C2=A0 My first at=
tempt was &quot;find . | xargs md5&quot;, but that isn&#39;t terribly effec=
tive.=C2=A0 The production machines are doing a lot of &quot;zfs recv&quot;=
 and running some busy Go programs, among other things, but I can&#39;t eas=
ily replicate that workload on a development system.</div><div>-Alan<br></d=
iv></div></div>

--000000000000ee500b05c2a372c3--