Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 08 Dec 2023 04:50:55 +0000
From:      bugzilla-noreply@freebsd.org
To:        fs@FreeBSD.org
Subject:   [Bug 275594] High CPU usage by arc_prune; analysis and fix
Message-ID:  <bug-275594-3630-Qw9Q4fMorA@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-275594-3630@https.bugs.freebsd.org/bugzilla/>
References:  <bug-275594-3630@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D275594

--- Comment #5 from Seigo Tanimura <seigo.tanimura@gmail.com> ---
(In reply to Mark Johnston from comment #3)

The build has completed.

Build time: 07:40:56 (278 pkgs / hr)

arc_prune stopped shortly after poudriere finished.  The pileup of arc_prune
has indeed been fixed by FreeBSD-EN-23:18.openzfs, but the essential problem
should be in somewhere else.

Right now, I am testing with the following setup after the reboot:

- vfs.vnode.vnlru.max_free_per_call: 10000 (out-of-box)
- vfs.zfs.arc.prune_interval: 1000 (my fix enabled)

About 2 hours after the start, the CPU usage of arc_prune was at 20 - 25% w=
ith
the occasional drops.  Poudriere was working on lang/rust and lang/gcc12 at
that time.

A correction of the description:

> * Test Environment: VM & OS
>   - RAM: 20 GB (not 16 GB)

A note on the ZFS configuration:

> vfs.zfs.arc_max=3D4294967296 (4GiB)

This limit has been added because this host is a build server, not a file
server.  AFAIK, ZFS tends to take up to about 1/4 of the available RAM for =
the
ARC.  While that may be fair as a file server, an application server wants =
more
RAM in general.

Under the limit above, the demand upon the ARC pruning is expected and the =
OS
must be ready to deal with that.


> arc_prune_async() is rather dumb on FreeBSD, as you point out: it tries t=
o reclaim vnodes from the global free list, but doing so might not alleviat=
e pressure.  Really we want some way to shrink a per-mountpoint or per-file=
system cache.

I thought you would say that; I almost thought of the same thing more than =
20
years ago while implementing the initial version of vnlru along with Matt
Dillon :)

The per-mountpoint / per-filesystem vnode design has at least two challenge=
s:

A) Balancing the vnodes across the mountpoints / filesystems, and
B) Splitting the name cache.

I suspect B) is the more difficult one.  As of now, the global name cache
allows the vnode lookup in a single place with just one pass.  The behaviour
and performance under the per-mountpoint / per-filesystem name cache would
depend on the interaction across multiple filesystems, and hence be very
complicated to analyse and tune.

The interval between the ARC pruning executions is much more simple and yet
effective, under my key findings out of the first test in the description:

- The ARC pruning indeed works as long as that is a one-shot run.
- The modern hardware is fast enough to walk through all vnodes, again as l=
ong
as that is a one-shot run.
- The ARC pruning and vnlru are the vnode maintainers, not the users.  They
must guarantee the fairness upon the vnode use to the true vnode users, nam=
ely
the user processes and threads. (and maybe the NFS server threads for a net=
work
file server)

After the current build, I will try vfs.vnode.vnlru.max_free_per_call=3D400=
0000.=20
This value is the same as vfs.vnode.param.limit, so there will be no limit =
upon
the ARC pruning workload except for the giveup condition.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-275594-3630-Qw9Q4fMorA>