Date: Tue, 18 May 2021 17:55:36 -0600 From: Alan Somers <asomers@freebsd.org> To: Mark Johnston <markj@freebsd.org> Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: The pagedaemon evicts ARC before scanning the inactive page list Message-ID: <CAOtMX2hT2XR=fyU6HB11WHbRx4qtNoyPHkX60g3JXXH9JWrObQ@mail.gmail.com> In-Reply-To: <YKQ7YhMke7ibse6F@nuc> References: <CAOtMX2gvkrYS0zYYYtjD%2BAaqv62MzFYFhWPHjLDGXA1=H7LfCg@mail.gmail.com> <YKQ1biSSGbluuy5f@nuc> <CAOtMX2he1YBidG=zF=iUQw%2BOs7p=gWMk-sab00NVr0nNs=Cwog@mail.gmail.com> <YKQ7YhMke7ibse6F@nuc>
next in thread | previous in thread | raw e-mail | index | archive | help
[-- Attachment #1 --]
On Tue, May 18, 2021 at 4:10 PM Mark Johnston <markj@freebsd.org> wrote:
> On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote:
> > On Tue, May 18, 2021 at 3:45 PM Mark Johnston <markj@freebsd.org> wrote:
> >
> > > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote:
> > > > I'm using ZFS on servers with tons of RAM and running FreeBSD
> > > > 12.2-RELEASE. Sometimes they get into a pathological situation where
> > > most
> > > > of that RAM sits unused. For example, right now one of them has:
> > > >
> > > > 2 GB Active
> > > > 529 GB Inactive
> > > > 16 GB Free
> > > > 99 GB ARC total
> > > > 469 GB ARC max
> > > > 86 GB ARC target
> > > >
> > > > When a server gets into this situation, it stays there for days,
> with the
> > > > ARC target barely budging. All that inactive memory never gets
> reclaimed
> > > > and put to a good use. Frequently the server never recovers until a
> > > reboot.
> > > >
> > > > I have a theory for what's going on. Ever since r334508^ the
> pagedaemon
> > > > sends the vm_lowmem event _before_ it scans the inactive page list.
> If
> > > the
> > > > ARC frees enough memory, then vm_pageout_scan_inactive won't need to
> free
> > > > any. Is that order really correct? For reference, here's the
> relevant
> > > > code, from vm_pageout_worker:
> > >
> > > That was the case even before r334508. Note that prior to that
> revision
> > > vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before
> > > scanning the inactive queue. During a memory shortage we have pass >
> 0.
> > > pass == 0 only when the page daemon is scanning the active queue.
> > >
> > > > shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);
> > > > if (shortage > 0) {
> > > > ofree = vmd->vmd_free_count;
> > > > if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)
> > > > shortage -= min(vmd->vmd_free_count - ofree,
> > > > (u_int)shortage);
> > > > target_met = vm_pageout_scan_inactive(vmd, shortage,
> > > > &addl_shortage);
> > > > } else
> > > > addl_shortage = 0
> > > >
> > > > Raising vfs.zfs.arc_min seems to workaround the problem. But ideally
> > > that
> > > > wouldn't be necessary.
> > >
> > > vm_lowmem is too primitive: it doesn't tell subscribing subsystems
> > > anything about the magnitude of the shortage. At the same time, the VM
> > > doesn't know much about how much memory they are consuming. A better
> > > strategy, at least for the ARC, would be reclaim memory based on the
> > > relative memory consumption of each subsystem. In your case, when the
> > > page daemon goes to reclaim memory, it should use the inactive queue to
> > > make up ~85% of the shortfall and reclaim the rest from the ARC. Even
> > > better would be if the ARC could use the page cache as a second-level
> > > cache, like the buffer cache does.
> > >
> > > Today I believe the ARC treats vm_lowmem as a signal to shed some
> > > arbitrary fraction of evictable data. If the ARC is able to quickly
> > > answer the question, "how much memory can I release if asked?", then
> > > the page daemon could use that to determine how much of its reclamation
> > > target should come from the ARC vs. the page cache.
> > >
> >
> > I guess I don't understand why you would ever free from the ARC rather
> than
> > from the inactive list. When is inactive memory ever useful?
>
> Pages in the inactive queue are either unmapped or haven't had their
> mappings referenced recently. But they may still be frequently accessed
> by file I/O operations like sendfile(2). That's not to say that
> reclaiming from other subsystems first is always the right strategy, but
> note also that the page daemon may scan the inactive queue many times in
> between vm_lowmem calls.
>
So By default ZFS tries to free (arc_target / 128) bytes of memory in
arc_lowmem. That's huge! On this server, pidctrl_daemon typically
requests 0-10MB, and arc_lowmem tries to free 600 MB. It looks like it
would be easy to modify vm_lowmem to include the total amount of memory
that it wants freed. I could make such a patch. My next question is:
what's the fastest way to generate a lot of inactive memory? My first
attempt was "find . | xargs md5", but that isn't terribly effective. The
production machines are doing a lot of "zfs recv" and running some busy Go
programs, among other things, but I can't easily replicate that workload on
a development system.
-Alan
[-- Attachment #2 --]
<div dir="ltr"><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 18, 2021 at 4:10 PM Mark Johnston <<a href="mailto:markj@freebsd.org">markj@freebsd.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote:<br>
> On Tue, May 18, 2021 at 3:45 PM Mark Johnston <<a href="mailto:markj@freebsd.org" target="_blank">markj@freebsd.org</a>> wrote:<br>
> <br>
> > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote:<br>
> > > I'm using ZFS on servers with tons of RAM and running FreeBSD<br>
> > > 12.2-RELEASE. Sometimes they get into a pathological situation where<br>
> > most<br>
> > > of that RAM sits unused. For example, right now one of them has:<br>
> > ><br>
> > > 2 GB Active<br>
> > > 529 GB Inactive<br>
> > > 16 GB Free<br>
> > > 99 GB ARC total<br>
> > > 469 GB ARC max<br>
> > > 86 GB ARC target<br>
> > ><br>
> > > When a server gets into this situation, it stays there for days, with the<br>
> > > ARC target barely budging. All that inactive memory never gets reclaimed<br>
> > > and put to a good use. Frequently the server never recovers until a<br>
> > reboot.<br>
> > ><br>
> > > I have a theory for what's going on. Ever since r334508^ the pagedaemon<br>
> > > sends the vm_lowmem event _before_ it scans the inactive page list. If<br>
> > the<br>
> > > ARC frees enough memory, then vm_pageout_scan_inactive won't need to free<br>
> > > any. Is that order really correct? For reference, here's the relevant<br>
> > > code, from vm_pageout_worker:<br>
> ><br>
> > That was the case even before r334508. Note that prior to that revision<br>
> > vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before<br>
> > scanning the inactive queue. During a memory shortage we have pass > 0.<br>
> > pass == 0 only when the page daemon is scanning the active queue.<br>
> ><br>
> > > shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count);<br>
> > > if (shortage > 0) {<br>
> > > ofree = vmd->vmd_free_count;<br>
> > > if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree)<br>
> > > shortage -= min(vmd->vmd_free_count - ofree,<br>
> > > (u_int)shortage);<br>
> > > target_met = vm_pageout_scan_inactive(vmd, shortage,<br>
> > > &addl_shortage);<br>
> > > } else<br>
> > > addl_shortage = 0<br>
> > ><br>
> > > Raising vfs.zfs.arc_min seems to workaround the problem. But ideally<br>
> > that<br>
> > > wouldn't be necessary.<br>
> ><br>
> > vm_lowmem is too primitive: it doesn't tell subscribing subsystems<br>
> > anything about the magnitude of the shortage. At the same time, the VM<br>
> > doesn't know much about how much memory they are consuming. A better<br>
> > strategy, at least for the ARC, would be reclaim memory based on the<br>
> > relative memory consumption of each subsystem. In your case, when the<br>
> > page daemon goes to reclaim memory, it should use the inactive queue to<br>
> > make up ~85% of the shortfall and reclaim the rest from the ARC. Even<br>
> > better would be if the ARC could use the page cache as a second-level<br>
> > cache, like the buffer cache does.<br>
> ><br>
> > Today I believe the ARC treats vm_lowmem as a signal to shed some<br>
> > arbitrary fraction of evictable data. If the ARC is able to quickly<br>
> > answer the question, "how much memory can I release if asked?", then<br>
> > the page daemon could use that to determine how much of its reclamation<br>
> > target should come from the ARC vs. the page cache.<br>
> ><br>
> <br>
> I guess I don't understand why you would ever free from the ARC rather than<br>
> from the inactive list. When is inactive memory ever useful?<br>
<br>
Pages in the inactive queue are either unmapped or haven't had their<br>
mappings referenced recently. But they may still be frequently accessed<br>
by file I/O operations like sendfile(2). That's not to say that<br>
reclaiming from other subsystems first is always the right strategy, but<br>
note also that the page daemon may scan the inactive queue many times in<br>
between vm_lowmem calls.<br></blockquote><div><br></div><div>So By default ZFS tries to free (arc_target / 128) bytes of memory in arc_lowmem. That's huge! On this server, pidctrl_daemon typically requests 0-10MB, and arc_lowmem tries to free 600 MB. It looks like it would be easy to modify vm_lowmem to include the total amount of memory that it wants freed. I could make such a patch. My next question is: what's the fastest way to generate a lot of inactive memory? My first attempt was "find . | xargs md5", but that isn't terribly effective. The production machines are doing a lot of "zfs recv" and running some busy Go programs, among other things, but I can't easily replicate that workload on a development system.</div><div>-Alan<br></div></div></div>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2hT2XR=fyU6HB11WHbRx4qtNoyPHkX60g3JXXH9JWrObQ>
