From nobody Tue May 18 23:55:36 2021 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id CE89C5C5602 for ; Tue, 18 May 2021 23:55:48 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-oi1-f181.google.com (mail-oi1-f181.google.com [209.85.167.181]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4FlCYw5SCqz4xG7; Tue, 18 May 2021 23:55:48 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-oi1-f181.google.com with SMTP id h9so11510049oih.4; Tue, 18 May 2021 16:55:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=2QYLOkUriQslBs7HweIJ0trmNjYN7tN2fEbUMcklNeY=; b=Vwdb9vipAoiXpTPzIAayJWElVQ7kHQt7462V4Yyw6WHd7X2Da8Z9MqTOBPCFcsv6nQ 1SynR2wPIZO4flOAfftfZg1D7QPwshexITSjmYQT9WOagbX6y7R3w4xe74EqbBmJ6/Jz SZFh2SVfuCKViLuLbYPHUYb7LEFrjd0Vau0SetjgM08dmLSkBYmcXrLxeUSKGhkDu1hZ KXVhxM/g5se2d5OGhsESQxGGBED3ofJ2mkE+BZ83fNv3+jHhW3nofM57KB8Df9kA0Lty MGJWowVUHTG85Pck6lT0wsapWax9KoVBX90qn7iV4eEC0aE3pSvE4H2CRCDfTTVys1nE /Hdg== X-Gm-Message-State: AOAM533vACdU75KdwnC/lOeGZpXal+JfXEAgVvxqhVswzSxQfy8Ob3N0 c36w0nDl/zG0ZFPehXH3vVjrzjzkOgywnQuTkdIADKQwghg= X-Google-Smtp-Source: ABdhPJwSVggNk3RpFwenReSxyiQIjGNBuGbCBu+wWPNqjMeJfkWK0jPIpCys3pIMljQc6MPzRTJKP+HT3nRBHt0SJy0= X-Received: by 2002:a05:6808:8c6:: with SMTP id k6mr5483314oij.55.1621382147230; Tue, 18 May 2021 16:55:47 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Tue, 18 May 2021 17:55:36 -0600 Message-ID: Subject: Re: The pagedaemon evicts ARC before scanning the inactive page list To: Mark Johnston Cc: FreeBSD Hackers Content-Type: multipart/alternative; boundary="000000000000ee500b05c2a372c3" X-Rspamd-Queue-Id: 4FlCYw5SCqz4xG7 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; none X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[] --000000000000ee500b05c2a372c3 Content-Type: text/plain; charset="UTF-8" On Tue, May 18, 2021 at 4:10 PM Mark Johnston wrote: > On Tue, May 18, 2021 at 04:00:14PM -0600, Alan Somers wrote: > > On Tue, May 18, 2021 at 3:45 PM Mark Johnston wrote: > > > > > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote: > > > > I'm using ZFS on servers with tons of RAM and running FreeBSD > > > > 12.2-RELEASE. Sometimes they get into a pathological situation where > > > most > > > > of that RAM sits unused. For example, right now one of them has: > > > > > > > > 2 GB Active > > > > 529 GB Inactive > > > > 16 GB Free > > > > 99 GB ARC total > > > > 469 GB ARC max > > > > 86 GB ARC target > > > > > > > > When a server gets into this situation, it stays there for days, > with the > > > > ARC target barely budging. All that inactive memory never gets > reclaimed > > > > and put to a good use. Frequently the server never recovers until a > > > reboot. > > > > > > > > I have a theory for what's going on. Ever since r334508^ the > pagedaemon > > > > sends the vm_lowmem event _before_ it scans the inactive page list. > If > > > the > > > > ARC frees enough memory, then vm_pageout_scan_inactive won't need to > free > > > > any. Is that order really correct? For reference, here's the > relevant > > > > code, from vm_pageout_worker: > > > > > > That was the case even before r334508. Note that prior to that > revision > > > vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0, before > > > scanning the inactive queue. During a memory shortage we have pass > > 0. > > > pass == 0 only when the page daemon is scanning the active queue. > > > > > > > shortage = pidctrl_daemon(&vmd->vmd_pid, vmd->vmd_free_count); > > > > if (shortage > 0) { > > > > ofree = vmd->vmd_free_count; > > > > if (vm_pageout_lowmem() && vmd->vmd_free_count > ofree) > > > > shortage -= min(vmd->vmd_free_count - ofree, > > > > (u_int)shortage); > > > > target_met = vm_pageout_scan_inactive(vmd, shortage, > > > > &addl_shortage); > > > > } else > > > > addl_shortage = 0 > > > > > > > > Raising vfs.zfs.arc_min seems to workaround the problem. But ideally > > > that > > > > wouldn't be necessary. > > > > > > vm_lowmem is too primitive: it doesn't tell subscribing subsystems > > > anything about the magnitude of the shortage. At the same time, the VM > > > doesn't know much about how much memory they are consuming. A better > > > strategy, at least for the ARC, would be reclaim memory based on the > > > relative memory consumption of each subsystem. In your case, when the > > > page daemon goes to reclaim memory, it should use the inactive queue to > > > make up ~85% of the shortfall and reclaim the rest from the ARC. Even > > > better would be if the ARC could use the page cache as a second-level > > > cache, like the buffer cache does. > > > > > > Today I believe the ARC treats vm_lowmem as a signal to shed some > > > arbitrary fraction of evictable data. If the ARC is able to quickly > > > answer the question, "how much memory can I release if asked?", then > > > the page daemon could use that to determine how much of its reclamation > > > target should come from the ARC vs. the page cache. > > > > > > > I guess I don't understand why you would ever free from the ARC rather > than > > from the inactive list. When is inactive memory ever useful? > > Pages in the inactive queue are either unmapped or haven't had their > mappings referenced recently. But they may still be frequently accessed > by file I/O operations like sendfile(2). That's not to say that > reclaiming from other subsystems first is always the right strategy, but > note also that the page daemon may scan the inactive queue many times in > between vm_lowmem calls. > So By default ZFS tries to free (arc_target / 128) bytes of memory in arc_lowmem. That's huge! On this server, pidctrl_daemon typically requests 0-10MB, and arc_lowmem tries to free 600 MB. It looks like it would be easy to modify vm_lowmem to include the total amount of memory that it wants freed. I could make such a patch. My next question is: what's the fastest way to generate a lot of inactive memory? My first attempt was "find . | xargs md5", but that isn't terribly effective. The production machines are doing a lot of "zfs recv" and running some busy Go programs, among other things, but I can't easily replicate that workload on a development system. -Alan --000000000000ee500b05c2a372c3 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Tue, May 18, 2021 at 4:10 PM Mark Johnston <markj@freebsd.org> wrote:
On Tue, May 18, 2021 at 04:00:14PM -060= 0, Alan Somers wrote:
> On Tue, May 18, 2021 at 3:45 PM Mark Johnston <markj@freebsd.org> wrote:
>
> > On Tue, May 18, 2021 at 03:07:44PM -0600, Alan Somers wrote:
> > > I'm using ZFS on servers with tons of RAM and running Fr= eeBSD
> > > 12.2-RELEASE.=C2=A0 Sometimes they get into a pathological s= ituation where
> > most
> > > of that RAM sits unused.=C2=A0 For example, right now one of= them has:
> > >
> > > 2 GB=C2=A0 =C2=A0Active
> > > 529 GB Inactive
> > > 16 GB=C2=A0 Free
> > > 99 GB=C2=A0 ARC total
> > > 469 GB ARC max
> > > 86 GB=C2=A0 ARC target
> > >
> > > When a server gets into this situation, it stays there for d= ays, with the
> > > ARC target barely budging.=C2=A0 All that inactive memory ne= ver gets reclaimed
> > > and put to a good use.=C2=A0 Frequently the server never rec= overs until a
> > reboot.
> > >
> > > I have a theory for what's going on.=C2=A0 Ever since r3= 34508^ the pagedaemon
> > > sends the vm_lowmem event _before_ it scans the inactive pag= e list.=C2=A0 If
> > the
> > > ARC frees enough memory, then vm_pageout_scan_inactive won&#= 39;t need to free
> > > any.=C2=A0 Is that order really correct?=C2=A0 For reference= , here's the relevant
> > > code, from vm_pageout_worker:
> >
> > That was the case even before r334508.=C2=A0 Note that prior to t= hat revision
> > vm_pageout_scan_inactive() would trigger vm_lowmem if pass > 0= , before
> > scanning the inactive queue.=C2=A0 During a memory shortage we ha= ve pass > 0.
> > pass =3D=3D 0 only when the page daemon is scanning the active qu= eue.
> >
> > > shortage =3D pidctrl_daemon(&vmd->vmd_pid, vmd->vm= d_free_count);
> > > if (shortage > 0) {
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ofree =3D vmd->vmd_free_= count;
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (vm_pageout_lowmem() &am= p;& vmd->vmd_free_count > ofree)
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0shortage -=3D min(vmd->vmd_free_count - ofree,
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0(u_int)shortage);
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0target_met =3D vm_pageout_s= can_inactive(vmd, shortage,
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0&addl_sho= rtage);
> > > } else
> > >=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0addl_shortage =3D 0
> > >
> > > Raising vfs.zfs.arc_min seems to workaround the problem.=C2= =A0 But ideally
> > that
> > > wouldn't be necessary.
> >
> > vm_lowmem is too primitive: it doesn't tell subscribing subsy= stems
> > anything about the magnitude of the shortage.=C2=A0 At the same t= ime, the VM
> > doesn't know much about how much memory they are consuming.= =C2=A0 A better
> > strategy, at least for the ARC, would be reclaim memory based on = the
> > relative memory consumption of each subsystem.=C2=A0 In your case= , when the
> > page daemon goes to reclaim memory, it should use the inactive qu= eue to
> > make up ~85% of the shortfall and reclaim the rest from the ARC.= =C2=A0 Even
> > better would be if the ARC could use the page cache as a second-l= evel
> > cache, like the buffer cache does.
> >
> > Today I believe the ARC treats vm_lowmem as a signal to shed some=
> > arbitrary fraction of evictable data.=C2=A0 If the ARC is able to= quickly
> > answer the question, "how much memory can I release if asked= ?", then
> > the page daemon could use that to determine how much of its recla= mation
> > target should come from the ARC vs. the page cache.
> >
>
> I guess I don't understand why you would ever free from the ARC ra= ther than
> from the inactive list.=C2=A0 When is inactive memory ever useful?

Pages in the inactive queue are either unmapped or haven't had their mappings referenced recently.=C2=A0 But they may still be frequently access= ed
by file I/O operations like sendfile(2).=C2=A0 That's not to say that reclaiming from other subsystems first is always the right strategy, but note also that the page daemon may scan the inactive queue many times in between vm_lowmem calls.

So By default = ZFS tries to free (arc_target / 128) bytes of memory in arc_lowmem.=C2=A0 T= hat's huge!=C2=A0 On this server, pidctrl_daemon typically requests 0-1= 0MB, and arc_lowmem tries to free 600 MB.=C2=A0 It looks like it would be e= asy to modify vm_lowmem to include the total amount of memory that it wants= freed.=C2=A0 I could make such a patch.=C2=A0 My next question is: what= 9;s the fastest way to generate a lot of inactive memory?=C2=A0 My first at= tempt was "find . | xargs md5", but that isn't terribly effec= tive.=C2=A0 The production machines are doing a lot of "zfs recv"= and running some busy Go programs, among other things, but I can't eas= ily replicate that workload on a development system.
-Alan
--000000000000ee500b05c2a372c3--