From nobody Mon Dec 11 08:45:26 2023 X-Original-To: fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Spb135YvJz53FNd for ; Mon, 11 Dec 2023 08:45:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Spb132wg8z4SB4 for ; Mon, 11 Dec 2023 08:45:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1702284327; a=rsa-sha256; cv=none; b=Jh08Vpge9fCj0cb5/X5QwJejF2NOSxHZnosiJvGPIzTCakJqcAIpTYhoZfPgudI+OYw6ch 7iDk46dgtY2uPPV8gB5mgAbIy/Oqg293tIckCbEjwDVRygRhyl6tjMc2c08w7VexUBJtJG CTSqvOzphP5BKkA5uHFz3Npry/36Cj4ZRASRt1dX7m1/6kOEAw33TdI/wXZ8YHKocWRxRI 0avo/Q9XCxld3sQtpm7dJS9qMbjPsf2Oz2qEcN0/cZsNec1rMTbs+1X1lpWDWPs+WgCpSS A4q1uxgZcZXuLjZft9MMR6tx032GDRa2rC2ddcJYJ3/Dm3z2fMLbNBu8ErFCiA== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1702284327; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9E2CyCh5bBYDLKD4L5r+vLqdHXA8gwB3wmGuVSK1zaU=; b=Uhgk7WImkzFnbRQL7KNbYf73b5rt/fhOcFXehSL4x5gmkEG4pPRRhqwD5Ewzq1DO98KRxa jNTgVfrpI66Wr0uN52HJh5E5l18Ymj2XFEuoKVsX/ocvHKe4N0gtXXZJ/XTvYzZ19EdSZo yBR3Vp01Lpj2DW+T1oAQv61ekc72mfF3EOMiRU7tbpf9aBeDJxk7wnbjSYrQbIz98FbG23 +2VFeOfIfKMaqg/at1jk0yrdPjYBhwVsV2lpZdvQ4JH4ZaZfDsG6PzbGmql9dkc+sGf2XE NaJ8sqoXk+5MK/dyaykxkQXw37q2nHOairHWOrLcViMAS23iQ0ZQ6B3iRiSjqg== Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4Spb131phSz915 for ; Mon, 11 Dec 2023 08:45:27 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 3BB8jRYh098562 for ; Mon, 11 Dec 2023 08:45:27 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 3BB8jR8v098560 for fs@FreeBSD.org; Mon, 11 Dec 2023 08:45:27 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: fs@FreeBSD.org Subject: [Bug 275594] High CPU usage by arc_prune; analysis and fix Date: Mon, 11 Dec 2023 08:45:26 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 14.0-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: seigo.tanimura@gmail.com X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D275594 --- Comment #10 from Seigo Tanimura --- (In reply to Mark Johnston from comment #9) > vnodes live on a global list, chained by v_vnodelist, and this list appea= rs to be used purely for reclamation. The free vnodes are indeed chained to vnode_list in sys/kern/vfs_subr.c, but this "free" means "not opened by any user processes," ie vp->v_usecount > 0. Besides the user processes, the kernel may use a "free" vnode on its own purpose. In such the case, the kernel "holds" the vnode by vhold(9), making vp->v_holdcnt > 0. A vnode held by the kernel in this way cannot be recycl= ed even if it is not opened by the user process. vnlru_free_impl() checks if the vnode in question is held, and skips recycl= ing if so. I have seen, out of the tests so far, that vnlru_free_impl() tends = to skip many vnodes, especially during the late phase of "poudriere bulk". The results and findings are shown at the end of this comment. ----- > If arc_prune() is spending most of its time reclaiming tmpfs vnodes, then= it does nothing to address its targets; it may as well do nothing. Again, the mixed use of tmpfs and ZFS has actually turned out as rather a m= inor problem. Please refer to my findings. Also, there are some easier workarounds that can be tried first, if this is really the issue: - Perform the test of vp->v_mount->mnt_op before vp->v_holdcnt. This should work somehow for now because ZFS is the only filesystem that calls vnlru_free_vfsops() with the valid mnt_op. - After a preconfigured number of consecutive skips, move the marker vnode = to the restart point, release vnode_list_mtx and yield the CPU. This actually happens when a vnode is recycled, which may block. > Suppose that arc_prune is disabled outright. How does your test fare? Difficult to tell. I am sure the ARC size should keep increasing first, but cannot tell if it eventually comes to an equilibrium point because of the builder cleanup or keeps rising. ----- In order to investigate the detail of the held vnodes found in vnlru_free_impl(), I have conducted another test with some additional count= ers. Source on GitHub: - Repo: https://github.com/altimeter-130ft/freebsd-freebsd-src/tree/release/14.0.0/= release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters - Branch: release/14.0.0/release-14_0_0-p2-topic-openzfs-arc_prune-interval-counters Test setup: The same as "Ongoing test" in bug #275594, comment #6. - vfs.vnode.vnlru.max_free_per_call: 4000000 (=3D=3D vfs.vnode.vnlru.max_free_per_call) - vfs.zfs.arc.prune_interval: 1000 (my fix enabled) Build time: 06:32:57 (325 pkgs / hr) Counters after completing the build, with some remarks: # The iteration attempts in vnlru_free_impl(). # This includes the retry from the head of vnode_list. vfs.vnode.free.free_attempt: 29695926809 # The number of the vnodes recycled successfully, including vtryrecycle(). vfs.vnode.free.free_success: 30841748 # The number of the iteration skips due to a held vnode. ("phase 2" hereaft= er) vfs.vnode.free.free_phase2_retry: 11909948307 # The number of the phase 2 skips upon the VREG (regular file) vnodes. vfs.vnode.free.free_phase2_retry_reg: 7877197761 # The number of the phase 2 skips upon the VBAD (being recycled) vnodes. vfs.vnode.free.free_phase2_retry_bad: 3101137010 # The number of the phase 2 skips upon the VDIR (directory) vnodes. vfs.vnode.free.free_phase2_retry_dir: 899106296 # The number of the phase 2 skips upon the VNON (being created) vnodes. vfs.vnode.free.free_phase2_retry_non: 2046379 # The number of the phase 2 skips upon the doomed (being destroyed) vnodes. vfs.vnode.free.free_phase2_retry_doomed: 3101137196 # The number of the iteration skips due to the filesystem mismatch. ("phase= 3" hereafter) vfs.vnode.free.free_phase3_retry: 17755077891 Analysis and Findings: Out of ~30G iteration attempts in vnlru_free_impl(), ~12G failed in phase 2= .=20 (Phase 3 failure is ~18G, but there are some workaround ideas shown above) Among the phase 2 failures, the most dominant vnode type is VREG. On this type, I suspect the residential VM pages alive in the kernel; a VM object h= olds the backend vnode if the object has at least one page allocated out of it.= =20 Please refer to vm_page_insert_after() and vm_page_insert_radixdone() for t= he implementation. Technically, such the vnodes can be recycled as long as the prerequisites checked in vtryrecycle() are met with the sufficient locks, which does not include the residential VM pages. vnode_destroy_vobject(), called in vgone= l(), takes care of those pages. I suppose we have to do this if the more work is required on vnlru_free_impl(), maybe during the retry after reaching the en= d of vnode_list. The further fix above assumes that ZFS takes the appropriate work to reduce= the ARC size upon reclaiming a ZFS vnode. The rest of the cases are either difficult or impossible for any further wo= rk. A VDIR vnode is held by the name cache to improve the path resolution performance, both forward and backward. While the vnodes of this kind can = be reclaimed somehow, a significant performance penalty is expected upon the p= ath resolution. VBAD and VNON are actually the states rather than the types of the vnodes.= =20 Both of the states are not eligible for recycling by design. --=20 You are receiving this mail because: You are the assignee for the bug.=