Date: Tue, 18 May 2021 21:55:25 -0600 From: Alan Somers <asomers@freebsd.org> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Mark Johnston <markj@freebsd.org>, FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: The pagedaemon evicts ARC before scanning the inactive page list Message-ID: <CAOtMX2jAE58CtSHk4vJXXz0_0YC5ejmRx4DnArKGzksNFg51hQ@mail.gmail.com> In-Reply-To: <YKSFCuNAkPH9Du5E@kib.kiev.ua> References: <CAOtMX2gvkrYS0zYYYtjD%2BAaqv62MzFYFhWPHjLDGXA1=H7LfCg@mail.gmail.com> <YKQ1biSSGbluuy5f@nuc> <CAOtMX2he1YBidG=zF=iUQw%2BOs7p=gWMk-sab00NVr0nNs=Cwog@mail.gmail.com> <YKQ7YhMke7ibse6F@nuc> <CAOtMX2hT2XR=fyU6HB11WHbRx4qtNoyPHkX60g3JXXH9JWrObQ@mail.gmail.com> <YKSFCuNAkPH9Du5E@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000a2f55605c2a6ccf3 Content-Type: text/plain; charset="UTF-8" On Tue, May 18, 2021 at 9:25 PM Konstantin Belousov <kostikbel@gmail.com> wrote: > On Tue, May 18, 2021 at 05:55:36PM -0600, Alan Somers wrote: > > On Tue, May 18, 2021 at 4:10 PM Mark Johnston <markj@freebsd.org> wrote: > > > > > On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote: > > > > On Tue, May 18, 2021 at 3:45 PM Mark Johnston <markj@freebsd.org> > wrote: > > > > > > > > > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote: > > > > > > I'm using ZFS on servers with tons of RAM and running FreeBSD > > > > > > 12.2-RELEASE. Sometimes they get into a pathological situation > where > > > > > most > > > > > > of that RAM sits unused. For example, right now one of them has: > > > > > > > > > > > > 2 GB Active > > > > > > 529 GB Inactive > > > > > > 16 GB Free > > > > > > 99 GB ARC total > > > > > > 469 GB ARC max > > > > > > 86 GB ARC target > > > > > > > > > > > > When a server gets into this situation, it stays there for days, > > > with the > > > > > > ARC target barely budging. All that inactive memory never gets > > > reclaimed > > > > > > and put to a good use. Frequently the server never recovers > until a > > > > > reboot. > > > > > > > > > > > > I have a theory for what's going on. Ever since r334508^ the > > > pagedaemon > > > > > > sends the vm_lowmem event _before_ it scans the inactive page > list. > > > If > > > > > the > > > > > > ARC frees enough memory, then vm_pageout_scan_inactive won't > need to > > > free > > > > > > any. Is that order really correct? For reference, here's the > > > relevant > > > > > > code, from vm_pageout_worker: > > > > > > > > > > That was the case even before r334508. Note that prior to that > > > revision > > > > > vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, > before > > > > > scanning the inactive queue. During a memory shortage we have > pass > > > > 0. > > > > > pass == 0 only when the page daemon is scanning the active queue. > > > > > > > > > > > shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count); > > > > > > if (shortage > 0) { > > > > > > ofree = vmd->vmd_free_count; > > > > > > if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree) > > > > > > shortage -= min(vmd->vmd_free_count - ofree, > > > > > > (u_int)shortage); > > > > > > target_met = vm_pageout_scan_inactive(vmd, shortage, > > > > > > &addl_shortage); > > > > > > } else > > > > > > addl_shortage = 0 > > > > > > > > > > > > Raising vfs.zfs.arc_min seems to workaround the problem. But > ideally > > > > > that > > > > > > wouldn't be necessary. > > > > > > > > > > vm_lowmem is too primitive: it doesn't tell subscribing subsystems > > > > > anything about the magnitude of the shortage. At the same time, > the VM > > > > > doesn't know much about how much memory they are consuming. A > better > > > > > strategy, at least for the ARC, would be reclaim memory based on > the > > > > > relative memory consumption of each subsystem. In your case, when > the > > > > > page daemon goes to reclaim memory, it should use the inactive > queue to > > > > > make up ~85% of the shortfall and reclaim the rest from the ARC. > Even > > > > > better would be if the ARC could use the page cache as a > second-level > > > > > cache, like the buffer cache does. > > > > > > > > > > Today I believe the ARC treats vm_lowmem as a signal to shed some > > > > > arbitrary fraction of evictable data. If the ARC is able to > quickly > > > > > answer the question, "how much memory can I release if asked?", > then > > > > > the page daemon could use that to determine how much of its > reclamation > > > > > target should come from the ARC vs. the page cache. > > > > > > > > > > > > > I guess I don't understand why you would ever free from the ARC > rather > > > than > > > > from the inactive list. When is inactive memory ever useful? > > > > > > Pages in the inactive queue are either unmapped or haven't had their > > > mappings referenced recently. But they may still be frequently > accessed > > > by file I/O operations like sendfile(2). That's not to say that > > > reclaiming from other subsystems first is always the right strategy, > but > > > note also that the page daemon may scan the inactive queue many times > in > > > between vm_lowmem calls. > > > > > > > So By default ZFS tries to free (arc_target / 128) bytes of memory in > > arc_lowmem. That's huge! On this server, pidctrl_daemon typically > > requests 0-10MB, and arc_lowmem tries to free 600 MB. It looks like it > > would be easy to modify vm_lowmem to include the total amount of memory > > that it wants freed. I could make such a patch. My next question is: > > what's the fastest way to generate a lot of inactive memory? My first > > attempt was "find . | xargs md5", but that isn't terribly effective. The > > production machines are doing a lot of "zfs recv" and running some busy > Go > > programs, among other things, but I can't easily replicate that workload > on > > Is your machine ZFS-only? If yes, then typical source of inactive memory > can be of two kinds: > No, there is also FUSE. But there is typically < 1GB of Buf memory, so I didn't mention it. > - anonymous memory that apps allocate with facilities like malloc(3). > If inactive is shrinkable then it is probably not, because dirty pages > from anon objects must go through laundry->swap route to get evicted, > and you did not mentioned swapping > No, there's no appreciable amount of swapping going on. Nor is the laundry list typically more than a few hundred MB. > - double-copy pages cached in v_objects of ZFS vnodes, clean or dirty. > If unmapped, these are mostly a waste. Even if mapped, the source > of truth for data is ARC, AFAIU, so they can be dropped as well, since > inactive state means that its content is not hot. > So if a process mmap()'s a file on ZFS and reads from it but never writes to it, will those pages show up as inactive? > > You can try to inspect the most outstanding objects adding to the > inactive queue with 'vmobject -o' to see where the most of inactive pages > come from. > Wow, that did it! About 99% of the inactive pages come from just a few vnodes which are used by the FUSE servers. But I also see a few large entries like 1105308 333933 771375 1 0 WB df what does that signify? > > If indeed they are double-copy, then perhaps ZFS can react even to the > current primitive vm_lowmem signal somewhat different. First, it could > do the pass over its vnodes and > - free clean unmapped pages > - if some targets are not met after that, laundry dirty pages, > then return to freeing clean unmapped pages > all that before ever touching its cache (ARC). > --000000000000a2f55605c2a6ccf3 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr"><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail= _attr">On Tue, May 18, 2021 at 9:25 PM Konstantin Belousov <<a href=3D"m= ailto:kostikbel@gmail.com">kostikbel@gmail.com</a>> wrote:<br></div><blo= ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left= :1px solid rgb(204,204,204);padding-left:1ex">On Tue, May 18, 2021 at 05:55= :36PM -0600, Alan Somers wrote:<br> > On Tue, May 18, 2021 at 4:10 PM Mark Johnston <<a href=3D"mailto:ma= rkj@freebsd.org" target=3D"_blank">markj@freebsd.org</a>> wrote:<br> > <br> > > On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote:<br> > > > On Tue, May 18, 2021 at 3:45 PM Mark Johnston <<a href=3D= "mailto:markj@freebsd.org" target=3D"_blank">markj@freebsd.org</a>> wrot= e:<br> > > ><br> > > > > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers w= rote:<br> > > > > > I'm using ZFS on servers with tons of RAM and = running FreeBSD<br> > > > > > 12.2-RELEASE.=C2=A0 Sometimes they get into a path= ological situation where<br> > > > > most<br> > > > > > of that RAM sits unused.=C2=A0 For example, right = now one of them has:<br> > > > > ><br> > > > > > 2 GB=C2=A0 =C2=A0Active<br> > > > > > 529 GB Inactive<br> > > > > > 16 GB=C2=A0 Free<br> > > > > > 99 GB=C2=A0 ARC total<br> > > > > > 469 GB ARC max<br> > > > > > 86 GB=C2=A0 ARC target<br> > > > > ><br> > > > > > When a server gets into this situation, it stays t= here for days,<br> > > with the<br> > > > > > ARC target barely budging.=C2=A0 All that inactive= memory never gets<br> > > reclaimed<br> > > > > > and put to a good use.=C2=A0 Frequently the server= never recovers until a<br> > > > > reboot.<br> > > > > ><br> > > > > > I have a theory for what's going on.=C2=A0 Eve= r since r334508^ the<br> > > pagedaemon<br> > > > > > sends the vm_lowmem event _before_ it scans the in= active page list.<br> > > If<br> > > > > the<br> > > > > > ARC frees enough memory, then vm_pageout_scan_inac= tive won't need to<br> > > free<br> > > > > > any.=C2=A0 Is that order really correct?=C2=A0 For= reference, here's the<br> > > relevant<br> > > > > > code, from vm_pageout_worker:<br> > > > ><br> > > > > That was the case even before r334508.=C2=A0 Note that = prior to that<br> > > revision<br> > > > > vm_pageout_scan_inactive() would trigger vm_lowmem if p= ass > 0, before<br> > > > > scanning the inactive queue.=C2=A0 During a memory shor= tage we have pass ><br> > > 0.<br> > > > > pass =3D=3D 0 only when the page daemon is scanning the= active queue.<br> > > > ><br> > > > > > shortage =3D pidctrl_daemon(&vmd->vmd_pid, = vmd->vmd_free_count);<br> > > > > > if (shortage > 0) {<br> > > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ofree =3D vmd->= ;vmd_free_count;<br> > > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (vm_pageout_lo= wmem() && vmd->vmd_free_count > ofree)<br> > > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0shortage -=3D min(vmd->vmd_free_count - ofree,<br> > > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0(u_int)shortage);<br> > > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0target_met =3D vm= _pageout_scan_inactive(vmd, shortage,<br> > > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&am= p;addl_shortage);<br> > > > > > } else<br> > > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0addl_shortage =3D= 0<br> > > > > ><br> > > > > > Raising vfs.zfs.arc_min seems to workaround the pr= oblem.=C2=A0 But ideally<br> > > > > that<br> > > > > > wouldn't be necessary.<br> > > > ><br> > > > > vm_lowmem is too primitive: it doesn't tell subscri= bing subsystems<br> > > > > anything about the magnitude of the shortage.=C2=A0 At = the same time, the VM<br> > > > > doesn't know much about how much memory they are co= nsuming.=C2=A0 A better<br> > > > > strategy, at least for the ARC, would be reclaim memory= based on the<br> > > > > relative memory consumption of each subsystem.=C2=A0 In= your case, when the<br> > > > > page daemon goes to reclaim memory, it should use the i= nactive queue to<br> > > > > make up ~85% of the shortfall and reclaim the rest from= the ARC.=C2=A0 Even<br> > > > > better would be if the ARC could use the page cache as = a second-level<br> > > > > cache, like the buffer cache does.<br> > > > ><br> > > > > Today I believe the ARC treats vm_lowmem as a signal to= shed some<br> > > > > arbitrary fraction of evictable data.=C2=A0 If the ARC = is able to quickly<br> > > > > answer the question, "how much memory can I releas= e if asked?", then<br> > > > > the page daemon could use that to determine how much of= its reclamation<br> > > > > target should come from the ARC vs. the page cache.<br> > > > ><br> > > ><br> > > > I guess I don't understand why you would ever free from = the ARC rather<br> > > than<br> > > > from the inactive list.=C2=A0 When is inactive memory ever u= seful?<br> > ><br> > > Pages in the inactive queue are either unmapped or haven't ha= d their<br> > > mappings referenced recently.=C2=A0 But they may still be frequen= tly accessed<br> > > by file I/O operations like sendfile(2).=C2=A0 That's not to = say that<br> > > reclaiming from other subsystems first is always the right strate= gy, but<br> > > note also that the page daemon may scan the inactive queue many t= imes in<br> > > between vm_lowmem calls.<br> > ><br> > <br> > So By default ZFS tries to free (arc_target / 128) bytes of memory in<= br> > arc_lowmem.=C2=A0 That's huge!=C2=A0 On this server, pidctrl_daemo= n typically<br> > requests 0-10MB, and arc_lowmem tries to free 600 MB.=C2=A0 It looks l= ike it<br> > would be easy to modify vm_lowmem to include the total amount of memor= y<br> > that it wants freed.=C2=A0 I could make such a patch.=C2=A0 My next qu= estion is:<br> > what's the fastest way to generate a lot of inactive memory?=C2=A0= My first<br> > attempt was "find . | xargs md5", but that isn't terribl= y effective.=C2=A0 The<br> > production machines are doing a lot of "zfs recv" and runnin= g some busy Go<br> > programs, among other things, but I can't easily replicate that wo= rkload on<br> <br> Is your machine ZFS-only?=C2=A0 If yes, then typical source of inactive mem= ory<br> can be of two kinds:<br></blockquote><div><br></div><div>No, there is also = FUSE.=C2=A0 But there is typically < 1GB of Buf memory, so I didn't = mention it.<br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" sty= le=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);paddi= ng-left:1ex"> - anonymous memory that apps allocate with facilities like malloc(3).<br> =C2=A0 If inactive is shrinkable then it is probably not, because dirty pag= es<br> =C2=A0 from anon objects must go through laundry->swap route to get evic= ted,<br> =C2=A0 and you did not mentioned swapping<br></blockquote><div><br></div><d= iv>No, there's no appreciable amount of swapping going on.=C2=A0 Nor is= the laundry list typically more than a few hundred MB.<br></div><div>=C2= =A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8e= x;border-left:1px solid rgb(204,204,204);padding-left:1ex"> - double-copy pages cached in v_objects of ZFS vnodes, clean or dirty.<br> =C2=A0 If unmapped, these are mostly a waste.=C2=A0 Even if mapped, the sou= rce<br> =C2=A0 of truth for data is ARC, AFAIU, so they can be dropped as well, sin= ce<br> =C2=A0 inactive state means that its content is not hot.<br></blockquote><d= iv><br></div><div>So if a process mmap()'s a file on ZFS and reads from= it but never writes to it, will those pages show up as inactive?<br></div>= <div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px = 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <br> You can try to inspect the most outstanding objects adding to the<br> inactive queue with 'vmobject -o' to see where the most of inactive= pages<br> come from.<br></blockquote><div><br></div><div>Wow, that did it!=C2=A0 Abou= t 99% of the inactive pages come from just a few vnodes which are used by t= he FUSE servers.=C2=A0 But I also see a few large entries like</div><div>11= 05308 333933 771375 =C2=A0 1 =C2=A0 0 WB =C2=A0df</div><div>what does that = signify?<br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style= =3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding= -left:1ex"> <br> If indeed they are double-copy, then perhaps ZFS can react even to the<br> current primitive vm_lowmem signal somewhat different. First, it could<br> do the pass over its vnodes and<br> - free clean unmapped pages <br> - if some targets are not met after that, laundry dirty pages,<br> =C2=A0 then return to freeing clean unmapped pages<br> all that before ever touching its cache (ARC).<br> </blockquote></div></div> --000000000000a2f55605c2a6ccf3--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2jAE58CtSHk4vJXXz0_0YC5ejmRx4DnArKGzksNFg51hQ>