Date: Mon, 11 Dec 2023 08:45:26 +0000 From: bugzilla-noreply@freebsd.org To: fs@FreeBSD.org Subject: [Bug 275594] High CPU usage by arc_prune; analysis and fix Message-ID: <bug-275594-3630-hq1UZRLc2l@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-275594-3630@https.bugs.freebsd.org/bugzilla/> References: <bug-275594-3630@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D275594 --- Comment #10 from Seigo Tanimura <seigo.tanimura@gmail.com> --- (In reply to Mark Johnston from comment #9) > vnodes live on a global list, chained by v_vnodelist, and this list appea= rs to be used purely for reclamation. The free vnodes are indeed chained to vnode_list in sys/kern/vfs_subr.c, but this "free" means "not opened by any user processes," ie vp->v_usecount > 0. Besides the user processes, the kernel may use a "free" vnode on its own purpose. In such the case, the kernel "holds" the vnode by vhold(9), making vp->v_holdcnt > 0. A vnode held by the kernel in this way cannot be recycl= ed even if it is not opened by the user process. vnlru_free_impl() checks if the vnode in question is held, and skips recycl= ing if so. I have seen, out of the tests so far, that vnlru_free_impl() tends = to skip many vnodes, especially during the late phase of "poudriere bulk". The results and findings are shown at the end of this comment. ----- > If arc_prune() is spending most of its time reclaiming tmpfs vnodes, then= it does nothing to address its targets; it may as well do nothing. Again, the mixed use of tmpfs and ZFS has actually turned out as rather a m= inor problem. Please refer to my findings. Also, there are some easier workarounds that can be tried first, if this is really the issue: - Perform the test of vp->v_mount->mnt_op before vp->v_holdcnt. This should work somehow for now because ZFS is the only filesystem that calls vnlru_free_vfsops() with the valid mnt_op. - After a preconfigured number of consecutive skips, move the marker vnode = to the restart point, release vnode_list_mtx and yield the CPU. This actually happens when a vnode is recycled, which may block. > Suppose that arc_prune is disabled outright. How does your test fare? Difficult to tell. I am sure the ARC size should keep increasing first, but cannot tell if it eventually comes to an equilibrium point because of the builder cleanup or keeps rising. ----- In order to investigate the detail of the held vnodes found in vnlru_free_impl(), I have conducted another test with some additional count= ers. Source on GitHub: - Repo: https://github.com/altimeter-130ft/freebsd-freebsd-src/tree/release/14.0.0/= release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters - Branch: release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters Test setup: The same as "Ongoing test" in bug #275594, comment #6. - vfs.vnode.vnlru.max_free_per_call: 4000000 (=3D=3D vfs.vnode.vnlru.max_free_per_call) - vfs.zfs.arc.prune_interval: 1000 (my fix enabled) Build time: 06:32:57 (325 pkgs / hr) Counters after completing the build, with some remarks: # The iteration attempts in vnlru_free_impl(). # This includes the retry from the head of vnode_list. vfs.vnode.free.free_attempt: 29695926809 # The number of the vnodes recycled successfully, including vtryrecycle(). vfs.vnode.free.free_success: 30841748 # The number of the iteration skips due to a held vnode. ("phase 2" hereaft= er) vfs.vnode.free.free_phase2_retry: 11909948307 # The number of the phase 2 skips upon the VREG (regular file) vnodes. vfs.vnode.free.free_phase2_retry_reg: 7877197761 # The number of the phase 2 skips upon the VBAD (being recycled) vnodes. vfs.vnode.free.free_phase2_retry_bad: 3101137010 # The number of the phase 2 skips upon the VDIR (directory) vnodes. vfs.vnode.free.free_phase2_retry_dir: 899106296 # The number of the phase 2 skips upon the VNON (being created) vnodes. vfs.vnode.free.free_phase2_retry_non: 2046379 # The number of the phase 2 skips upon the doomed (being destroyed) vnodes. vfs.vnode.free.free_phase2_retry_doomed: 3101137196 # The number of the iteration skips due to the filesystem mismatch. ("phase= 3" hereafter) vfs.vnode.free.free_phase3_retry: 17755077891 Analysis and Findings: Out of ~30G iteration attempts in vnlru_free_impl(), ~12G failed in phase 2= .=20 (Phase 3 failure is ~18G, but there are some workaround ideas shown above) Among the phase 2 failures, the most dominant vnode type is VREG. On this type, I suspect the residential VM pages alive in the kernel; a VM object h= olds the backend vnode if the object has at least one page allocated out of it.= =20 Please refer to vm_page_insert_after() and vm_page_insert_radixdone() for t= he implementation. Technically, such the vnodes can be recycled as long as the prerequisites checked in vtryrecycle() are met with the sufficient locks, which does not include the residential VM pages. vnode_destroy_vobject(), called in vgone= l(), takes care of those pages. I suppose we have to do this if the more work is required on vnlru_free_impl(), maybe during the retry after reaching the en= d of vnode_list. The further fix above assumes that ZFS takes the appropriate work to reduce= the ARC size upon reclaiming a ZFS vnode. The rest of the cases are either difficult or impossible for any further wo= rk. A VDIR vnode is held by the name cache to improve the path resolution performance, both forward and backward. While the vnodes of this kind can = be reclaimed somehow, a significant performance penalty is expected upon the p= ath resolution. VBAD and VNON are actually the states rather than the types of the vnodes.= =20 Both of the states are not eligible for recycling by design. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-275594-3630-hq1UZRLc2l>