From nobody Wed May 19 03:55:25 2021 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 90A685C18ED for ; Wed, 19 May 2021 03:55:38 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-ot1-f46.google.com (mail-ot1-f46.google.com [209.85.210.46]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4FlJtf3QYdz4tK8; Wed, 19 May 2021 03:55:38 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-ot1-f46.google.com with SMTP id u25-20020a0568302319b02902ac3d54c25eso10642732ote.1; Tue, 18 May 2021 20:55:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=OQspE392oHie2+oO+rjc7djOEqmUTziW/jJN6phYY5M=; b=Rv6BIgMBC2PwrkP8YCPHwvSsNMxRijPFlqenlL47/w4JcqddVFGWA1JtN6qaw9mP9C G7ek7AgQ38iT4HS7i8iQR3UVatvNwdvJPOymFiuiqFHsHb9m0kiIeAG7VaCjvJBuc6bI P+23G9LWqW2Aq06ZWEXIj5eEzqg7kcjoLyaZ/KeYLjWVSFkcq+ut0RUKMDOXr8aaXGlU Z0BT5vU7h6CfWcS8DN0MPSm0jwpwz0ia+PhNXb1rJ2c8nt7g+5pRL/El96576GYuygY1 M9hpHv6WUfBo/hzSxfTHqmZxIrhFCkJmwGiAQsa3n5IUEpMsqlr6+PxuRj8C5Sk9ePdl 0EsQ== X-Gm-Message-State: AOAM530p6x9mEw2X3H52iC1A/P2Hqz1zFtJMPpMlPpBhDA5PGMcMhups HZk9M1nKBii0j6R3hgu1h+I5BcuEq6gG2fKORhE= X-Google-Smtp-Source: ABdhPJwFcjAdz5gkxo6etya7f+kKRln9ejVtDSHKzJEGeQXTxMFpT/Bl1FJw1E96QYczm71JhrcvPPnHMXJkyu/M92A= X-Received: by 2002:a05:6830:3115:: with SMTP id b21mr6824842ots.291.1621396537143; Tue, 18 May 2021 20:55:37 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Tue, 18 May 2021 21:55:25 -0600 Message-ID: Subject: Re: The pagedaemon evicts ARC before scanning the inactive page list To: Konstantin Belousov Cc: Mark Johnston , FreeBSD Hackers Content-Type: multipart/alternative; boundary="000000000000a2f55605c2a6ccf3" X-Rspamd-Queue-Id: 4FlJtf3QYdz4tK8 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] --000000000000a2f55605c2a6ccf3 Content-Type: text/plain; charset="UTF-8" On Tue, May 18, 2021 at 9:25 PM Konstantin Belousov wrote: > On Tue, May 18, 2021 at 05:55:36PM -0600, Alan Somers wrote: > > On Tue, May 18, 2021 at 4:10 PM Mark Johnston wrote: > > > > > On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote: > > > > On Tue, May 18, 2021 at 3:45 PM Mark Johnston > wrote: > > > > > > > > > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote: > > > > > > I'm using ZFS on servers with tons of RAM and running FreeBSD > > > > > > 12.2-RELEASE. Sometimes they get into a pathological situation > where > > > > > most > > > > > > of that RAM sits unused. For example, right now one of them has: > > > > > > > > > > > > 2 GB Active > > > > > > 529 GB Inactive > > > > > > 16 GB Free > > > > > > 99 GB ARC total > > > > > > 469 GB ARC max > > > > > > 86 GB ARC target > > > > > > > > > > > > When a server gets into this situation, it stays there for days, > > > with the > > > > > > ARC target barely budging. All that inactive memory never gets > > > reclaimed > > > > > > and put to a good use. Frequently the server never recovers > until a > > > > > reboot. > > > > > > > > > > > > I have a theory for what's going on. Ever since r334508^ the > > > pagedaemon > > > > > > sends the vm_lowmem event _before_ it scans the inactive page > list. > > > If > > > > > the > > > > > > ARC frees enough memory, then vm_pageout_scan_inactive won't > need to > > > free > > > > > > any. Is that order really correct? For reference, here's the > > > relevant > > > > > > code, from vm_pageout_worker: > > > > > > > > > > That was the case even before r334508. Note that prior to that > > > revision > > > > > vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, > before > > > > > scanning the inactive queue. During a memory shortage we have > pass > > > > 0. > > > > > pass == 0 only when the page daemon is scanning the active queue. > > > > > > > > > > > shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count); > > > > > > if (shortage > 0) { > > > > > > ofree = vmd->vmd_free_count; > > > > > > if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree) > > > > > > shortage -= min(vmd->vmd_free_count - ofree, > > > > > > (u_int)shortage); > > > > > > target_met = vm_pageout_scan_inactive(vmd, shortage, > > > > > > &addl_shortage); > > > > > > } else > > > > > > addl_shortage = 0 > > > > > > > > > > > > Raising vfs.zfs.arc_min seems to workaround the problem. But > ideally > > > > > that > > > > > > wouldn't be necessary. > > > > > > > > > > vm_lowmem is too primitive: it doesn't tell subscribing subsystems > > > > > anything about the magnitude of the shortage. At the same time, > the VM > > > > > doesn't know much about how much memory they are consuming. A > better > > > > > strategy, at least for the ARC, would be reclaim memory based on > the > > > > > relative memory consumption of each subsystem. In your case, when > the > > > > > page daemon goes to reclaim memory, it should use the inactive > queue to > > > > > make up ~85% of the shortfall and reclaim the rest from the ARC. > Even > > > > > better would be if the ARC could use the page cache as a > second-level > > > > > cache, like the buffer cache does. > > > > > > > > > > Today I believe the ARC treats vm_lowmem as a signal to shed some > > > > > arbitrary fraction of evictable data. If the ARC is able to > quickly > > > > > answer the question, "how much memory can I release if asked?", > then > > > > > the page daemon could use that to determine how much of its > reclamation > > > > > target should come from the ARC vs. the page cache. > > > > > > > > > > > > > I guess I don't understand why you would ever free from the ARC > rather > > > than > > > > from the inactive list. When is inactive memory ever useful? > > > > > > Pages in the inactive queue are either unmapped or haven't had their > > > mappings referenced recently. But they may still be frequently > accessed > > > by file I/O operations like sendfile(2). That's not to say that > > > reclaiming from other subsystems first is always the right strategy, > but > > > note also that the page daemon may scan the inactive queue many times > in > > > between vm_lowmem calls. > > > > > > > So By default ZFS tries to free (arc_target / 128) bytes of memory in > > arc_lowmem. That's huge! On this server, pidctrl_daemon typically > > requests 0-10MB, and arc_lowmem tries to free 600 MB. It looks like it > > would be easy to modify vm_lowmem to include the total amount of memory > > that it wants freed. I could make such a patch. My next question is: > > what's the fastest way to generate a lot of inactive memory? My first > > attempt was "find . | xargs md5", but that isn't terribly effective. The > > production machines are doing a lot of "zfs recv" and running some busy > Go > > programs, among other things, but I can't easily replicate that workload > on > > Is your machine ZFS-only? If yes, then typical source of inactive memory > can be of two kinds: > No, there is also FUSE. But there is typically < 1GB of Buf memory, so I didn't mention it. > - anonymous memory that apps allocate with facilities like malloc(3). > If inactive is shrinkable then it is probably not, because dirty pages > from anon objects must go through laundry->swap route to get evicted, > and you did not mentioned swapping > No, there's no appreciable amount of swapping going on. Nor is the laundry list typically more than a few hundred MB. > - double-copy pages cached in v_objects of ZFS vnodes, clean or dirty. > If unmapped, these are mostly a waste. Even if mapped, the source > of truth for data is ARC, AFAIU, so they can be dropped as well, since > inactive state means that its content is not hot. > So if a process mmap()'s a file on ZFS and reads from it but never writes to it, will those pages show up as inactive? > > You can try to inspect the most outstanding objects adding to the > inactive queue with 'vmobject -o' to see where the most of inactive pages > come from. > Wow, that did it! About 99% of the inactive pages come from just a few vnodes which are used by the FUSE servers. But I also see a few large entries like 1105308 333933 771375 1 0 WB df what does that signify? > > If indeed they are double-copy, then perhaps ZFS can react even to the > current primitive vm_lowmem signal somewhat different. First, it could > do the pass over its vnodes and > - free clean unmapped pages > - if some targets are not met after that, laundry dirty pages, > then return to freeing clean unmapped pages > all that before ever touching its cache (ARC). > --000000000000a2f55605c2a6ccf3 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Tue, May 18, 2021 at 9:25 PM Konstantin Belousov <kostikbel@gmail.com> wrote:
On Tue, May 18, 2021 at 05:55= :36PM -0600, Alan Somers wrote:
> On Tue, May 18, 2021 at 4:10 PM Mark Johnston <markj@freebsd.org> wrote:
>
> > On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote:
> > > On Tue, May 18, 2021 at 3:45 PM Mark Johnston <markj@freebsd.org> wrot= e:
> > >
> > > > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers w= rote:
> > > > > I'm using ZFS on servers with tons of RAM and = running FreeBSD
> > > > > 12.2-RELEASE.=C2=A0 Sometimes they get into a path= ological situation where
> > > > most
> > > > > of that RAM sits unused.=C2=A0 For example, right = now one of them has:
> > > > >
> > > > > 2 GB=C2=A0 =C2=A0Active
> > > > > 529 GB Inactive
> > > > > 16 GB=C2=A0 Free
> > > > > 99 GB=C2=A0 ARC total
> > > > > 469 GB ARC max
> > > > > 86 GB=C2=A0 ARC target
> > > > >
> > > > > When a server gets into this situation, it stays t= here for days,
> > with the
> > > > > ARC target barely budging.=C2=A0 All that inactive= memory never gets
> > reclaimed
> > > > > and put to a good use.=C2=A0 Frequently the server= never recovers until a
> > > > reboot.
> > > > >
> > > > > I have a theory for what's going on.=C2=A0 Eve= r since r334508^ the
> > pagedaemon
> > > > > sends the vm_lowmem event _before_ it scans the in= active page list.
> > If
> > > > the
> > > > > ARC frees enough memory, then vm_pageout_scan_inac= tive won't need to
> > free
> > > > > any.=C2=A0 Is that order really correct?=C2=A0 For= reference, here's the
> > relevant
> > > > > code, from vm_pageout_worker:
> > > >
> > > > That was the case even before r334508.=C2=A0 Note that = prior to that
> > revision
> > > > vm_pageout_scan_inactive() would trigger vm_lowmem if p= ass > 0, before
> > > > scanning the inactive queue.=C2=A0 During a memory shor= tage we have pass >
> > 0.
> > > > pass =3D=3D 0 only when the page daemon is scanning the= active queue.
> > > >
> > > > > shortage =3D pidctrl_daemon(&vmd->vmd_pid, = vmd->vmd_free_count);
> > > > > if (shortage > 0) {
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ofree =3D vmd->= ;vmd_free_count;
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (vm_pageout_lo= wmem() && vmd->vmd_free_count > ofree)
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0shortage -=3D min(vmd->vmd_free_count - ofree,
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0(u_int)shortage);
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0target_met =3D vm= _pageout_scan_inactive(vmd, shortage,
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&am= p;addl_shortage);
> > > > > } else
> > > > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0addl_shortage =3D= 0
> > > > >
> > > > > Raising vfs.zfs.arc_min seems to workaround the pr= oblem.=C2=A0 But ideally
> > > > that
> > > > > wouldn't be necessary.
> > > >
> > > > vm_lowmem is too primitive: it doesn't tell subscri= bing subsystems
> > > > anything about the magnitude of the shortage.=C2=A0 At = the same time, the VM
> > > > doesn't know much about how much memory they are co= nsuming.=C2=A0 A better
> > > > strategy, at least for the ARC, would be reclaim memory= based on the
> > > > relative memory consumption of each subsystem.=C2=A0 In= your case, when the
> > > > page daemon goes to reclaim memory, it should use the i= nactive queue to
> > > > make up ~85% of the shortfall and reclaim the rest from= the ARC.=C2=A0 Even
> > > > better would be if the ARC could use the page cache as = a second-level
> > > > cache, like the buffer cache does.
> > > >
> > > > Today I believe the ARC treats vm_lowmem as a signal to= shed some
> > > > arbitrary fraction of evictable data.=C2=A0 If the ARC = is able to quickly
> > > > answer the question, "how much memory can I releas= e if asked?", then
> > > > the page daemon could use that to determine how much of= its reclamation
> > > > target should come from the ARC vs. the page cache.
> > > >
> > >
> > > I guess I don't understand why you would ever free from = the ARC rather
> > than
> > > from the inactive list.=C2=A0 When is inactive memory ever u= seful?
> >
> > Pages in the inactive queue are either unmapped or haven't ha= d their
> > mappings referenced recently.=C2=A0 But they may still be frequen= tly accessed
> > by file I/O operations like sendfile(2).=C2=A0 That's not to = say that
> > reclaiming from other subsystems first is always the right strate= gy, but
> > note also that the page daemon may scan the inactive queue many t= imes in
> > between vm_lowmem calls.
> >
>
> So By default ZFS tries to free (arc_target / 128) bytes of memory in<= br> > arc_lowmem.=C2=A0 That's huge!=C2=A0 On this server, pidctrl_daemo= n typically
> requests 0-10MB, and arc_lowmem tries to free 600 MB.=C2=A0 It looks l= ike it
> would be easy to modify vm_lowmem to include the total amount of memor= y
> that it wants freed.=C2=A0 I could make such a patch.=C2=A0 My next qu= estion is:
> what's the fastest way to generate a lot of inactive memory?=C2=A0= My first
> attempt was "find . | xargs md5", but that isn't terribl= y effective.=C2=A0 The
> production machines are doing a lot of "zfs recv" and runnin= g some busy Go
> programs, among other things, but I can't easily replicate that wo= rkload on

Is your machine ZFS-only?=C2=A0 If yes, then typical source of inactive mem= ory
can be of two kinds:

No, there is also = FUSE.=C2=A0 But there is typically < 1GB of Buf memory, so I didn't = mention it.
=C2=A0
- anonymous memory that apps allocate with facilities like malloc(3).
=C2=A0 If inactive is shrinkable then it is probably not, because dirty pag= es
=C2=A0 from anon objects must go through laundry->swap route to get evic= ted,
=C2=A0 and you did not mentioned swapping

No, there's no appreciable amount of swapping going on.=C2=A0 Nor is= the laundry list typically more than a few hundred MB.
=C2= =A0
- double-copy pages cached in v_objects of ZFS vnodes, clean or dirty.
=C2=A0 If unmapped, these are mostly a waste.=C2=A0 Even if mapped, the sou= rce
=C2=A0 of truth for data is ARC, AFAIU, so they can be dropped as well, sin= ce
=C2=A0 inactive state means that its content is not hot.

So if a process mmap()'s a file on ZFS and reads from= it but never writes to it, will those pages show up as inactive?
=
=C2=A0

You can try to inspect the most outstanding objects adding to the
inactive queue with 'vmobject -o' to see where the most of inactive= pages
come from.

Wow, that did it!=C2=A0 Abou= t 99% of the inactive pages come from just a few vnodes which are used by t= he FUSE servers.=C2=A0 But I also see a few large entries like
11= 05308 333933 771375 =C2=A0 1 =C2=A0 0 WB =C2=A0df
what does that = signify?
=C2=A0

If indeed they are double-copy, then perhaps ZFS can react even to the
current primitive vm_lowmem signal somewhat different. First, it could
do the pass over its vnodes and
- free clean unmapped pages
- if some targets are not met after that, laundry dirty pages,
=C2=A0 then return to freeing clean unmapped pages
all that before ever touching its cache (ARC).
--000000000000a2f55605c2a6ccf3--