Date: Wed, 27 Dec 2023 10:31:56 +0000 From: bugzilla-noreply@freebsd.org To: fs@FreeBSD.org Subject: [Bug 275594] High CPU usage by arc_prune; analysis and fix Message-ID: <bug-275594-3630-Lbf73IbBEE@https.bugs.freebsd.org/bugzilla/> In-Reply-To: <bug-275594-3630@https.bugs.freebsd.org/bugzilla/> References: <bug-275594-3630@https.bugs.freebsd.org/bugzilla/>
next in thread | previous in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D275594 --- Comment #18 from Seigo Tanimura <seigo.tanimura@gmail.com> --- (In reply to Seigo Tanimura from comment #16) * The results of the comparision between the estimated ZFS open files and kern.openfiles Test Summary: - Date: 26 Dec 2023 00:50Z - 26 Dec 2023 06:42Z - Build time: 06:41:18 (319 pkgs / hr) - Failed ports: 4 - Setup - vfs.vnode.vnlru.max_free_per_call: 4000000 (=3D=3D vfs.vnode.vnlru.max_free_per_call) - vfs.zfs.arc.prune_interval: 1000 (my fix for arc_prune interval enabled) - vfs.vnode.vnlru.extra_recycle: 1 (extra vnode recycle fix enabled) - vfs.zfs.arc.dnode_limit=3D2684354560 (2.5G, larger than the max actual = value observed so far) Results: * Estimated ZFS open files | (A) | (B) | (C) | | Phase 2 regular file retries | | | (Estimated ZFS open files | ZFS = open files UTC Time | Vnode free call period [s] | seen by vnlru_free_impl()) | estimated by kern.openfiles =3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D 02:00Z | 1.27 | 354 |=20= =20=20=20=20=20=20=20=20 491 ---------+----------------------------+------------------------------+-----= ---------------------------- 03:00Z | 1.32 | 411 |=20= =20=20=20=20=20=20=20=20 439 ---------+----------------------------+------------------------------+-----= ---------------------------- 04:00Z | 1.35 | 477 |=20= =20=20=20=20=20=20=20=20 425 ---------+----------------------------+------------------------------+-----= ---------------------------- 05:00Z | 1.69 | 193 |=20= =20=20=20=20=20=20=20=20 242 ---------+----------------------------+------------------------------+-----= ---------------------------- 06:00Z | 1.88 | 702 |=20= =20=20=20=20=20=20=20=20 232 ---------+----------------------------+------------------------------+-----= ---------------------------- 07:00Z | 1.54 | 299 |=20= =20=20=20=20=20=20=20=20 237 where (A): 1 / ((vnode free calls) / (5 * 60)) (5 * 60) is the time granularity on the chart in seconds. This applies to = (B) as well. (B): (number of retries) / (5 * 60) * (A) (C): 0.7 * (kern.openfiles value) 0.7 is the observer general ratio of the ZFS vnodes in the kernel. (bug #27= 5594 comment #16) * Chart archive: poudriere-bulk-2023-12-26_09h50m17s.7z * Charts: zfs-vnode-free-calls.png, zfs-vnode-recycle-phase2-reg-retries.pn= g, kernel-open-files.png. (B) and (C) sometimes match on the most significant figure, and do not in t= he other times. Out of these results, I understand that the unrecyclable ZFS vnodes are caused by opening them in an indirect way. The detail of the "indirect" way is discussed next. ----- * The ZFS vnodes in use by nullfs(5) Nullfs(5) involved in my poudriere jail setup is now suspected for the unrecyclable ZFS vnodes. My poudriere setup uses "-m null" on the poudriere jail. > root@pkgfactory2:/home/pkgfactory2/tanimura/work/freebsd-git/ports-head #= poudriere jail -l > release-13_2_0 13.2-RELEASE amd64 null 2023-04-13 03:14:26 /home/poudri= ere.jailroot/release-13.2.0 > release-14_0_0 14.0-RELEASE amd64 null 2023-11-23 15:14:17 /home/poudri= ere.jailroot/release-14.0.0 > root@pkgfactory2:/home/pkgfactory2/tanimura/work/freebsd-git/ports-head #= =20 Under this setup, poudriere-bulk(8) mounts the jail filesystems onto each builder by nullfs(5). A nullfs(5) vnode adds one v_usecount to the lower v= node (asserted in null_nodeget()) so that the pointer to the lower vnode does not dangle. This lasts even after the nullfs(5) vnode is inactivated and put o= nto the free list, until the nullfs(5) vnode gets reclaimed. The nullfs(5) design above explains the results of the estimation upon the unrecyclable ZFS vnodes. As the builders open the more files in ZFS via nullfs(5), the more unrecyclable ZFS vnodes are made. In the detail, howev= er, the estimation makes the errors because the multiple builders can open the = same ZFS file. The massive free of the vnodes after the build is also explained by the nullfs(5) design. The cleanup of the builder filesystems dismisses a lot of nullfs(5) vnodes, which, in turn, drops v_usecount of the lower ZFS vnodes = so that they can be evicted. ----- The finding above introduces a new question: should the ZFS vnodes used by nullfs(5) be recycled? My answer is no. The major hurdle is the search of the vnode stacking link= .=20 It is essentially a tree with the ZFS (or any non-nullfs(5)) vnode as the r= oot, spanning to multiple nullfs(5) vnode leaves and depth levels. The search is likely to be even more complex than the linear scanning of the vnode list. In addition, all vnodes in the tree must be recyclable for the ZFS vnode at= the tree root to be recyclable as well. This is likely to put a complex depend= ency for the ZFS vnode recycling. ----- My investigation so far, including this one, has proven that it costs too m= uch to scan over all vnodes without any positive estimation in advance. We nee= d a way to check if the ARC pruning will yield the fruitful result in the way m= uch cheaper than the vnode scan. It may be good to account the number of the ZFS vnodes not in use. Before starting an ARC pruning, we can check that count and defer pruning if that = is too low. This has already been implemented in arc_evict_impl() for the eviction of the ARC data and metadata by checking the evictable size. The = ARC data and metadata eviction is skipped if there are zero evictable bytes. * My next work Figure out the requirement and design of the accounting above. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-275594-3630-Lbf73IbBEE>