Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 Dec 2023 08:45:26 +0000
From:      bugzilla-noreply@freebsd.org
To:        fs@FreeBSD.org
Subject:   [Bug 275594] High CPU usage by arc_prune; analysis and fix
Message-ID:  <bug-275594-3630-hq1UZRLc2l@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-275594-3630@https.bugs.freebsd.org/bugzilla/>
References:  <bug-275594-3630@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D275594

--- Comment #10 from Seigo Tanimura <seigo.tanimura@gmail.com> ---
(In reply to Mark Johnston from comment #9)

> vnodes live on a global list, chained by v_vnodelist, and this list appea=
rs to be used purely for reclamation.

The free vnodes are indeed chained to vnode_list in sys/kern/vfs_subr.c, but
this "free" means "not opened by any user processes," ie vp->v_usecount > 0.

Besides the user processes, the kernel may use a "free" vnode on its own
purpose.  In such the case, the kernel "holds" the vnode by vhold(9), making
vp->v_holdcnt > 0.  A vnode held by the kernel in this way cannot be recycl=
ed
even if it is not opened by the user process.

vnlru_free_impl() checks if the vnode in question is held, and skips recycl=
ing
if so.  I have seen, out of the tests so far, that vnlru_free_impl() tends =
to
skip many vnodes, especially during the late phase of "poudriere bulk".  The
results and findings are shown at the end of this comment.

-----

> If arc_prune() is spending most of its time reclaiming tmpfs vnodes, then=
 it does nothing to address its targets; it may as well do nothing.

Again, the mixed use of tmpfs and ZFS has actually turned out as rather a m=
inor
problem.  Please refer to my findings.

Also, there are some easier workarounds that can be tried first, if this is
really the issue:

- Perform the test of vp->v_mount->mnt_op before vp->v_holdcnt.  This should
work somehow for now because ZFS is the only filesystem that calls
vnlru_free_vfsops() with the valid mnt_op.
- After a preconfigured number of consecutive skips, move the marker vnode =
to
the restart point, release vnode_list_mtx and yield the CPU.  This actually
happens when a vnode is recycled, which may block.

> Suppose that arc_prune is disabled outright.  How does your test fare?

Difficult to tell.  I am sure the ARC size should keep increasing first, but
cannot tell if it eventually comes to an equilibrium point because of the
builder cleanup or keeps rising.

-----

In order to investigate the detail of the held vnodes found in
vnlru_free_impl(), I have conducted another test with some additional count=
ers.

Source on GitHub:
- Repo:
https://github.com/altimeter-130ft/freebsd-freebsd-src/tree/release/14.0.0/=
release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters
- Branch:
release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters

Test setup:
The same as "Ongoing test" in bug #275594, comment #6.

- vfs.vnode.vnlru.max_free_per_call: 4000000 (=3D=3D
vfs.vnode.vnlru.max_free_per_call)
- vfs.zfs.arc.prune_interval: 1000 (my fix enabled)

Build time:
06:32:57 (325 pkgs / hr)

Counters after completing the build, with some remarks:
# The iteration attempts in vnlru_free_impl().
# This includes the retry from the head of vnode_list.
vfs.vnode.free.free_attempt: 29695926809

# The number of the vnodes recycled successfully, including vtryrecycle().
vfs.vnode.free.free_success: 30841748

# The number of the iteration skips due to a held vnode. ("phase 2" hereaft=
er)
vfs.vnode.free.free_phase2_retry: 11909948307

# The number of the phase 2 skips upon the VREG (regular file) vnodes.
vfs.vnode.free.free_phase2_retry_reg: 7877197761

# The number of the phase 2 skips upon the VBAD (being recycled) vnodes.
vfs.vnode.free.free_phase2_retry_bad: 3101137010

# The number of the phase 2 skips upon the VDIR (directory) vnodes.
vfs.vnode.free.free_phase2_retry_dir: 899106296

# The number of the phase 2 skips upon the VNON (being created) vnodes.
vfs.vnode.free.free_phase2_retry_non: 2046379

# The number of the phase 2 skips upon the doomed (being destroyed) vnodes.
vfs.vnode.free.free_phase2_retry_doomed: 3101137196

# The number of the iteration skips due to the filesystem mismatch. ("phase=
 3"
hereafter)
vfs.vnode.free.free_phase3_retry: 17755077891

Analysis and Findings:
Out of ~30G iteration attempts in vnlru_free_impl(), ~12G failed in phase 2=
.=20
(Phase 3 failure is ~18G, but there are some workaround ideas shown above)

Among the phase 2 failures, the most dominant vnode type is VREG.  On this
type, I suspect the residential VM pages alive in the kernel; a VM object h=
olds
the backend vnode if the object has at least one page allocated out of it.=
=20
Please refer to vm_page_insert_after() and vm_page_insert_radixdone() for t=
he
implementation.

Technically, such the vnodes can be recycled as long as the prerequisites
checked in vtryrecycle() are met with the sufficient locks, which does not
include the residential VM pages.  vnode_destroy_vobject(), called in vgone=
l(),
takes care of those pages.  I suppose we have to do this if the more work is
required on vnlru_free_impl(), maybe during the retry after reaching the en=
d of
vnode_list.

The further fix above assumes that ZFS takes the appropriate work to reduce=
 the
ARC size upon reclaiming a ZFS vnode.

The rest of the cases are either difficult or impossible for any further wo=
rk.

A VDIR vnode is held by the name cache to improve the path resolution
performance, both forward and backward.  While the vnodes of this kind can =
be
reclaimed somehow, a significant performance penalty is expected upon the p=
ath
resolution.

VBAD and VNON are actually the states rather than the types of the vnodes.=
=20
Both of the states are not eligible for recycling by design.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-275594-3630-hq1UZRLc2l>