Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 18 May 2021 21:55:25 -0600
From:      Alan Somers <asomers@freebsd.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Mark Johnston <markj@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: The pagedaemon evicts ARC before scanning the inactive page list
Message-ID:  <CAOtMX2jAE58CtSHk4vJXXz0_0YC5ejmRx4DnArKGzksNFg51hQ@mail.gmail.com>
In-Reply-To: <YKSFCuNAkPH9Du5E@kib.kiev.ua>
References:  <CAOtMX2gvkrYS0zYYYtjD%2BAaqv62MzFYFhWPHjLDGXA1=H7LfCg@mail.gmail.com> <YKQ1biSSGbluuy5f@nuc> <CAOtMX2he1YBidG=zF=iUQw%2BOs7p=gWMk-sab00NVr0nNs=Cwog@mail.gmail.com> <YKQ7YhMke7ibse6F@nuc> <CAOtMX2hT2XR=fyU6HB11WHbRx4qtNoyPHkX60g3JXXH9JWrObQ@mail.gmail.com> <YKSFCuNAkPH9Du5E@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000a2f55605c2a6ccf3
Content-Type: text/plain; charset="UTF-8"

On Tue, May 18, 2021 at 9:25 PM Konstantin Belousov <kostikbel@gmail.com>
wrote:

> On Tue, May 18, 2021 at 05:55:36PM -0600, Alan Somers wrote:
> > On Tue, May 18, 2021 at 4:10 PM Mark Johnston <markj@freebsd.org> wrote:
> >
> > > On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote:
> > > > On Tue, May 18, 2021 at 3:45 PM Mark Johnston <markj@freebsd.org>
> wrote:
> > > >
> > > > > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote:
> > > > > > I'm using ZFS on servers with tons of RAM and running FreeBSD
> > > > > > 12.2-RELEASE.  Sometimes they get into a pathological situation
> where
> > > > > most
> > > > > > of that RAM sits unused.  For example, right now one of them has:
> > > > > >
> > > > > > 2 GB   Active
> > > > > > 529 GB Inactive
> > > > > > 16 GB  Free
> > > > > > 99 GB  ARC total
> > > > > > 469 GB ARC max
> > > > > > 86 GB  ARC target
> > > > > >
> > > > > > When a server gets into this situation, it stays there for days,
> > > with the
> > > > > > ARC target barely budging.  All that inactive memory never gets
> > > reclaimed
> > > > > > and put to a good use.  Frequently the server never recovers
> until a
> > > > > reboot.
> > > > > >
> > > > > > I have a theory for what's going on.  Ever since r334508^ the
> > > pagedaemon
> > > > > > sends the vm_lowmem event _before_ it scans the inactive page
> list.
> > > If
> > > > > the
> > > > > > ARC frees enough memory, then vm_pageout_scan_inactive won't
> need to
> > > free
> > > > > > any.  Is that order really correct?  For reference, here's the
> > > relevant
> > > > > > code, from vm_pageout_worker:
> > > > >
> > > > > That was the case even before r334508.  Note that prior to that
> > > revision
> > > > > vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0,
> before
> > > > > scanning the inactive queue.  During a memory shortage we have
> pass >
> > > 0.
> > > > > pass == 0 only when the page daemon is scanning the active queue.
> > > > >
> > > > > > shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
> > > > > > if (shortage > 0) {
> > > > > >         ofree = vmd->vmd_free_count;
> > > > > >         if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
> > > > > >                 shortage -= min(vmd->vmd_free_count - ofree,
> > > > > >                     (u_int)shortage);
> > > > > >         target_met = vm_pageout_scan_inactive(vmd, shortage,
> > > > > >             &addl_shortage);
> > > > > > } else
> > > > > >         addl_shortage = 0
> > > > > >
> > > > > > Raising vfs.zfs.arc_min seems to workaround the problem.  But
> ideally
> > > > > that
> > > > > > wouldn't be necessary.
> > > > >
> > > > > vm_lowmem is too primitive: it doesn't tell subscribing subsystems
> > > > > anything about the magnitude of the shortage.  At the same time,
> the VM
> > > > > doesn't know much about how much memory they are consuming.  A
> better
> > > > > strategy, at least for the ARC, would be reclaim memory based on
> the
> > > > > relative memory consumption of each subsystem.  In your case, when
> the
> > > > > page daemon goes to reclaim memory, it should use the inactive
> queue to
> > > > > make up ~85% of the shortfall and reclaim the rest from the ARC.
> Even
> > > > > better would be if the ARC could use the page cache as a
> second-level
> > > > > cache, like the buffer cache does.
> > > > >
> > > > > Today I believe the ARC treats vm_lowmem as a signal to shed some
> > > > > arbitrary fraction of evictable data.  If the ARC is able to
> quickly
> > > > > answer the question, "how much memory can I release if asked?",
> then
> > > > > the page daemon could use that to determine how much of its
> reclamation
> > > > > target should come from the ARC vs. the page cache.
> > > > >
> > > >
> > > > I guess I don't understand why you would ever free from the ARC
> rather
> > > than
> > > > from the inactive list.  When is inactive memory ever useful?
> > >
> > > Pages in the inactive queue are either unmapped or haven't had their
> > > mappings referenced recently.  But they may still be frequently
> accessed
> > > by file I/O operations like sendfile(2).  That's not to say that
> > > reclaiming from other subsystems first is always the right strategy,
> but
> > > note also that the page daemon may scan the inactive queue many times
> in
> > > between vm_lowmem calls.
> > >
> >
> > So By default ZFS tries to free (arc_target / 128) bytes of memory in
> > arc_lowmem.  That's huge!  On this server, pidctrl_daemon typically
> > requests 0-10MB, and arc_lowmem tries to free 600 MB.  It looks like it
> > would be easy to modify vm_lowmem to include the total amount of memory
> > that it wants freed.  I could make such a patch.  My next question is:
> > what's the fastest way to generate a lot of inactive memory?  My first
> > attempt was "find . | xargs md5", but that isn't terribly effective.  The
> > production machines are doing a lot of "zfs recv" and running some busy
> Go
> > programs, among other things, but I can't easily replicate that workload
> on
>
> Is your machine ZFS-only?  If yes, then typical source of inactive memory
> can be of two kinds:
>

No, there is also FUSE.  But there is typically < 1GB of Buf memory, so I
didn't mention it.


> - anonymous memory that apps allocate with facilities like malloc(3).
>   If inactive is shrinkable then it is probably not, because dirty pages
>   from anon objects must go through laundry->swap route to get evicted,
>   and you did not mentioned swapping
>

No, there's no appreciable amount of swapping going on.  Nor is the laundry
list typically more than a few hundred MB.


> - double-copy pages cached in v_objects of ZFS vnodes, clean or dirty.
>   If unmapped, these are mostly a waste.  Even if mapped, the source
>   of truth for data is ARC, AFAIU, so they can be dropped as well, since
>   inactive state means that its content is not hot.
>

So if a process mmap()'s a file on ZFS and reads from it but never writes
to it, will those pages show up as inactive?


>
> You can try to inspect the most outstanding objects adding to the
> inactive queue with 'vmobject -o' to see where the most of inactive pages
> come from.
>

Wow, that did it!  About 99% of the inactive pages come from just a few
vnodes which are used by the FUSE servers.  But I also see a few large
entries like
1105308 333933 771375   1   0 WB  df
what does that signify?


>
> If indeed they are double-copy, then perhaps ZFS can react even to the
> current primitive vm_lowmem signal somewhat different. First, it could
> do the pass over its vnodes and
> - free clean unmapped pages
> - if some targets are not met after that, laundry dirty pages,
>   then return to freeing clean unmapped pages
> all that before ever touching its cache (ARC).
>

--000000000000a2f55605c2a6ccf3
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail=
_attr">On Tue, May 18, 2021 at 9:25 PM Konstantin Belousov &lt;<a href=3D"m=
ailto:kostikbel@gmail.com">kostikbel@gmail.com</a>&gt; wrote:<br></div><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left=
:1px solid rgb(204,204,204);padding-left:1ex">On Tue, May 18, 2021 at 05:55=
:36PM -0600, Alan Somers wrote:<br>
&gt; On Tue, May 18, 2021 at 4:10 PM Mark Johnston &lt;<a href=3D"mailto:ma=
rkj@freebsd.org" target=3D"_blank">markj@freebsd.org</a>&gt; wrote:<br>
&gt; <br>
&gt; &gt; On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote:<br>
&gt; &gt; &gt; On Tue, May 18, 2021 at 3:45 PM Mark Johnston &lt;<a href=3D=
"mailto:markj@freebsd.org" target=3D"_blank">markj@freebsd.org</a>&gt; wrot=
e:<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers w=
rote:<br>
&gt; &gt; &gt; &gt; &gt; I&#39;m using ZFS on servers with tons of RAM and =
running FreeBSD<br>
&gt; &gt; &gt; &gt; &gt; 12.2-RELEASE.=C2=A0 Sometimes they get into a path=
ological situation where<br>
&gt; &gt; &gt; &gt; most<br>
&gt; &gt; &gt; &gt; &gt; of that RAM sits unused.=C2=A0 For example, right =
now one of them has:<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; 2 GB=C2=A0 =C2=A0Active<br>
&gt; &gt; &gt; &gt; &gt; 529 GB Inactive<br>
&gt; &gt; &gt; &gt; &gt; 16 GB=C2=A0 Free<br>
&gt; &gt; &gt; &gt; &gt; 99 GB=C2=A0 ARC total<br>
&gt; &gt; &gt; &gt; &gt; 469 GB ARC max<br>
&gt; &gt; &gt; &gt; &gt; 86 GB=C2=A0 ARC target<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; When a server gets into this situation, it stays t=
here for days,<br>
&gt; &gt; with the<br>
&gt; &gt; &gt; &gt; &gt; ARC target barely budging.=C2=A0 All that inactive=
 memory never gets<br>
&gt; &gt; reclaimed<br>
&gt; &gt; &gt; &gt; &gt; and put to a good use.=C2=A0 Frequently the server=
 never recovers until a<br>
&gt; &gt; &gt; &gt; reboot.<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; I have a theory for what&#39;s going on.=C2=A0 Eve=
r since r334508^ the<br>
&gt; &gt; pagedaemon<br>
&gt; &gt; &gt; &gt; &gt; sends the vm_lowmem event _before_ it scans the in=
active page list.<br>
&gt; &gt; If<br>
&gt; &gt; &gt; &gt; the<br>
&gt; &gt; &gt; &gt; &gt; ARC frees enough memory, then vm_pageout_scan_inac=
tive won&#39;t need to<br>
&gt; &gt; free<br>
&gt; &gt; &gt; &gt; &gt; any.=C2=A0 Is that order really correct?=C2=A0 For=
 reference, here&#39;s the<br>
&gt; &gt; relevant<br>
&gt; &gt; &gt; &gt; &gt; code, from vm_pageout_worker:<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; That was the case even before r334508.=C2=A0 Note that =
prior to that<br>
&gt; &gt; revision<br>
&gt; &gt; &gt; &gt; vm_pageout_scan_inactive() would trigger vm_lowmem if p=
ass &gt; 0, before<br>
&gt; &gt; &gt; &gt; scanning the inactive queue.=C2=A0 During a memory shor=
tage we have pass &gt;<br>
&gt; &gt; 0.<br>
&gt; &gt; &gt; &gt; pass =3D=3D 0 only when the page daemon is scanning the=
 active queue.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; shortage =3D pidctrl_daemon(&amp;vmd-&gt;vmd_pid, =
vmd-&gt;vmd_free_count);<br>
&gt; &gt; &gt; &gt; &gt; if (shortage &gt; 0) {<br>
&gt; &gt; &gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ofree =3D vmd-&gt=
;vmd_free_count;<br>
&gt; &gt; &gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (vm_pageout_lo=
wmem() &amp;&amp; vmd-&gt;vmd_free_count &gt; ofree)<br>
&gt; &gt; &gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0shortage -=3D min(vmd-&gt;vmd_free_count - ofree,<br>
&gt; &gt; &gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0(u_int)shortage);<br>
&gt; &gt; &gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0target_met =3D vm=
_pageout_scan_inactive(vmd, shortage,<br>
&gt; &gt; &gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&am=
p;addl_shortage);<br>
&gt; &gt; &gt; &gt; &gt; } else<br>
&gt; &gt; &gt; &gt; &gt;=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0addl_shortage =3D=
 0<br>
&gt; &gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; &gt; Raising vfs.zfs.arc_min seems to workaround the pr=
oblem.=C2=A0 But ideally<br>
&gt; &gt; &gt; &gt; that<br>
&gt; &gt; &gt; &gt; &gt; wouldn&#39;t be necessary.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; vm_lowmem is too primitive: it doesn&#39;t tell subscri=
bing subsystems<br>
&gt; &gt; &gt; &gt; anything about the magnitude of the shortage.=C2=A0 At =
the same time, the VM<br>
&gt; &gt; &gt; &gt; doesn&#39;t know much about how much memory they are co=
nsuming.=C2=A0 A better<br>
&gt; &gt; &gt; &gt; strategy, at least for the ARC, would be reclaim memory=
 based on the<br>
&gt; &gt; &gt; &gt; relative memory consumption of each subsystem.=C2=A0 In=
 your case, when the<br>
&gt; &gt; &gt; &gt; page daemon goes to reclaim memory, it should use the i=
nactive queue to<br>
&gt; &gt; &gt; &gt; make up ~85% of the shortfall and reclaim the rest from=
 the ARC.=C2=A0 Even<br>
&gt; &gt; &gt; &gt; better would be if the ARC could use the page cache as =
a second-level<br>
&gt; &gt; &gt; &gt; cache, like the buffer cache does.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt; &gt; Today I believe the ARC treats vm_lowmem as a signal to=
 shed some<br>
&gt; &gt; &gt; &gt; arbitrary fraction of evictable data.=C2=A0 If the ARC =
is able to quickly<br>
&gt; &gt; &gt; &gt; answer the question, &quot;how much memory can I releas=
e if asked?&quot;, then<br>
&gt; &gt; &gt; &gt; the page daemon could use that to determine how much of=
 its reclamation<br>
&gt; &gt; &gt; &gt; target should come from the ARC vs. the page cache.<br>
&gt; &gt; &gt; &gt;<br>
&gt; &gt; &gt;<br>
&gt; &gt; &gt; I guess I don&#39;t understand why you would ever free from =
the ARC rather<br>
&gt; &gt; than<br>
&gt; &gt; &gt; from the inactive list.=C2=A0 When is inactive memory ever u=
seful?<br>
&gt; &gt;<br>
&gt; &gt; Pages in the inactive queue are either unmapped or haven&#39;t ha=
d their<br>
&gt; &gt; mappings referenced recently.=C2=A0 But they may still be frequen=
tly accessed<br>
&gt; &gt; by file I/O operations like sendfile(2).=C2=A0 That&#39;s not to =
say that<br>
&gt; &gt; reclaiming from other subsystems first is always the right strate=
gy, but<br>
&gt; &gt; note also that the page daemon may scan the inactive queue many t=
imes in<br>
&gt; &gt; between vm_lowmem calls.<br>
&gt; &gt;<br>
&gt; <br>
&gt; So By default ZFS tries to free (arc_target / 128) bytes of memory in<=
br>
&gt; arc_lowmem.=C2=A0 That&#39;s huge!=C2=A0 On this server, pidctrl_daemo=
n typically<br>
&gt; requests 0-10MB, and arc_lowmem tries to free 600 MB.=C2=A0 It looks l=
ike it<br>
&gt; would be easy to modify vm_lowmem to include the total amount of memor=
y<br>
&gt; that it wants freed.=C2=A0 I could make such a patch.=C2=A0 My next qu=
estion is:<br>
&gt; what&#39;s the fastest way to generate a lot of inactive memory?=C2=A0=
 My first<br>
&gt; attempt was &quot;find . | xargs md5&quot;, but that isn&#39;t terribl=
y effective.=C2=A0 The<br>
&gt; production machines are doing a lot of &quot;zfs recv&quot; and runnin=
g some busy Go<br>
&gt; programs, among other things, but I can&#39;t easily replicate that wo=
rkload on<br>
<br>
Is your machine ZFS-only?=C2=A0 If yes, then typical source of inactive mem=
ory<br>
can be of two kinds:<br></blockquote><div><br></div><div>No, there is also =
FUSE.=C2=A0 But there is typically &lt; 1GB of Buf memory, so I didn&#39;t =
mention it.<br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);paddi=
ng-left:1ex">
- anonymous memory that apps allocate with facilities like malloc(3).<br>
=C2=A0 If inactive is shrinkable then it is probably not, because dirty pag=
es<br>
=C2=A0 from anon objects must go through laundry-&gt;swap route to get evic=
ted,<br>
=C2=A0 and you did not mentioned swapping<br></blockquote><div><br></div><d=
iv>No, there&#39;s no appreciable amount of swapping going on.=C2=A0 Nor is=
 the laundry list typically more than a few hundred MB.<br></div><div>=C2=
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8e=
x;border-left:1px solid rgb(204,204,204);padding-left:1ex">
- double-copy pages cached in v_objects of ZFS vnodes, clean or dirty.<br>
=C2=A0 If unmapped, these are mostly a waste.=C2=A0 Even if mapped, the sou=
rce<br>
=C2=A0 of truth for data is ARC, AFAIU, so they can be dropped as well, sin=
ce<br>
=C2=A0 inactive state means that its content is not hot.<br></blockquote><d=
iv><br></div><div>So if a process mmap()&#39;s a file on ZFS and reads from=
 it but never writes to it, will those pages show up as inactive?<br></div>=
<div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px =
0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
You can try to inspect the most outstanding objects adding to the<br>
inactive queue with &#39;vmobject -o&#39; to see where the most of inactive=
 pages<br>
come from.<br></blockquote><div><br></div><div>Wow, that did it!=C2=A0 Abou=
t 99% of the inactive pages come from just a few vnodes which are used by t=
he FUSE servers.=C2=A0 But I also see a few large entries like</div><div>11=
05308 333933 771375 =C2=A0 1 =C2=A0 0 WB =C2=A0df</div><div>what does that =
signify?<br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">
<br>
If indeed they are double-copy, then perhaps ZFS can react even to the<br>
current primitive vm_lowmem signal somewhat different. First, it could<br>
do the pass over its vnodes and<br>
- free clean unmapped pages <br>
- if some targets are not met after that, laundry dirty pages,<br>
=C2=A0 then return to freeing clean unmapped pages<br>
all that before ever touching its cache (ARC).<br>
</blockquote></div></div>

--000000000000a2f55605c2a6ccf3--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2jAE58CtSHk4vJXXz0_0YC5ejmRx4DnArKGzksNFg51hQ>