From nobody Wed Dec 27 10:31:56 2023 X-Original-To: fs@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4T0ScX4jmmz55Mdy for ; Wed, 27 Dec 2023 10:31:56 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4T0ScX24PWz4HGd for ; Wed, 27 Dec 2023 10:31:56 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1703673116; a=rsa-sha256; cv=none; b=l0lM1wvuh6P2u7IUk6HfWk3MPTuqH3XKVBdJfNsShTRmJB6S05dGQN0qaToHfJv76rJdMC ZoX0opHhDlen3LWds6Fm+9orBO97r4eVNPAxFdCPWSFRshec9EGkQSGCA4IfoZE9zwU+HI kFyACsxdplfSZYuMCfo7ZB0kP9jg94+V/f3Qrd8P3QimYvG9Bf+xr4W0OkespxcNUaOD0N rn6rp13Xp/zVE0269m4k30Ukobd3f3siBj5PXT2PT1STvmdccdETPVpjOkFcKTeiGiC3eU N50zoEWOs2EBWSt7pNUwsfZnEj9A6TEjfG7bEjQCunUEe6e42LqwTiGQeyVlfA== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1703673116; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=egIW8tMkhu6qzAZr+Qa1NKObEYk+uqeYN8qhgqP94oc=; b=A7jTtfL89WHAUefxGH7WwRP3niaGomaXDEyneLsB0rR0wjJBjTt6qiJSyPMU1sr40R99en CBdcUVlViEXLEBPzxIaMR6NcQmXZdYJZBKLIv4nmClH15gq7getWahO99y6PKswLG4wiGu WRpJbZU769q8m9+Mnuxl1+WN6A6hCO2HLpvtN+DDfJguOvKXqeXv4DTkRQmsXE87hvkqU3 9sYt6FH8TxsCNZ6CbsqXcJYrOx/Q/zkcFcoD4/kc97k/pKegwM8YiH9vOXNJ/v8HevLnBu Sl4HNjcR9AbFHopIoH1anTcc6jy5pIsF0OQSA0DfhUizgUdyHWxc5f65B21jcw== Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2610:1c1:1:606c::50:1d]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4T0ScX17nJzbJ0 for ; Wed, 27 Dec 2023 10:31:56 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org ([127.0.1.5]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 3BRAVuRT085422 for ; Wed, 27 Dec 2023 10:31:56 GMT (envelope-from bugzilla-noreply@freebsd.org) Received: (from www@localhost) by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 3BRAVuDV085421 for fs@FreeBSD.org; Wed, 27 Dec 2023 10:31:56 GMT (envelope-from bugzilla-noreply@freebsd.org) X-Authentication-Warning: kenobi.freebsd.org: www set sender to bugzilla-noreply@freebsd.org using -f From: bugzilla-noreply@freebsd.org To: fs@FreeBSD.org Subject: [Bug 275594] High CPU usage by arc_prune; analysis and fix Date: Wed, 27 Dec 2023 10:31:56 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 14.0-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: seigo.tanimura@gmail.com X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated List-Id: Filesystems List-Archive: https://lists.freebsd.org/archives/freebsd-fs List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-fs@freebsd.org MIME-Version: 1.0 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D275594 --- Comment #18 from Seigo Tanimura --- (In reply to Seigo Tanimura from comment #16) * The results of the comparision between the estimated ZFS open files and kern.openfiles Test Summary: - Date: 26 Dec 2023 00:50Z - 26 Dec 2023 06:42Z - Build time: 06:41:18 (319 pkgs / hr) - Failed ports: 4 - Setup - vfs.vnode.vnlru.max_free_per_call: 4000000 (=3D=3D vfs.vnode.vnlru.max_free_per_call) - vfs.zfs.arc.prune_interval: 1000 (my fix for arc_prune interval enabled) - vfs.vnode.vnlru.extra_recycle: 1 (extra vnode recycle fix enabled) - vfs.zfs.arc.dnode_limit=3D2684354560 (2.5G, larger than the max actual = value observed so far) Results: * Estimated ZFS open files | (A) | (B) | (C) | | Phase 2 regular file retries | | | (Estimated ZFS open files | ZFS = open files UTC Time | Vnode free call period [s] | seen by vnlru_free_impl()) | estimated by kern.openfiles =3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D+=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D 02:00Z | 1.27 | 354 |=20= =20=20=20=20=20=20=20=20 491 ---------+----------------------------+------------------------------+-----= ---------------------------- 03:00Z | 1.32 | 411 |=20= =20=20=20=20=20=20=20=20 439 ---------+----------------------------+------------------------------+-----= ---------------------------- 04:00Z | 1.35 | 477 |=20= =20=20=20=20=20=20=20=20 425 ---------+----------------------------+------------------------------+-----= ---------------------------- 05:00Z | 1.69 | 193 |=20= =20=20=20=20=20=20=20=20 242 ---------+----------------------------+------------------------------+-----= ---------------------------- 06:00Z | 1.88 | 702 |=20= =20=20=20=20=20=20=20=20 232 ---------+----------------------------+------------------------------+-----= ---------------------------- 07:00Z | 1.54 | 299 |=20= =20=20=20=20=20=20=20=20 237 where (A): 1 / ((vnode free calls) / (5 * 60)) (5 * 60) is the time granularity on the chart in seconds. This applies to = (B) as well. (B): (number of retries) / (5 * 60) * (A) (C): 0.7 * (kern.openfiles value) 0.7 is the observer general ratio of the ZFS vnodes in the kernel. (bug #27= 5594 comment #16) * Chart archive: poudriere-bulk-2023-12-26_09h50m17s.7z * Charts: zfs-vnode-free-calls.png, zfs-vnode-recycle-phase2-reg-retries.pn= g, kernel-open-files.png. (B) and (C) sometimes match on the most significant figure, and do not in t= he other times. Out of these results, I understand that the unrecyclable ZFS vnodes are caused by opening them in an indirect way. The detail of the "indirect" way is discussed next. ----- * The ZFS vnodes in use by nullfs(5) Nullfs(5) involved in my poudriere jail setup is now suspected for the unrecyclable ZFS vnodes. My poudriere setup uses "-m null" on the poudriere jail. > root@pkgfactory2:/home/pkgfactory2/tanimura/work/freebsd-git/ports-head #= poudriere jail -l > release-13_2_0 13.2-RELEASE amd64 null 2023-04-13 03:14:26 /home/poudri= ere.jailroot/release-13.2.0 > release-14_0_0 14.0-RELEASE amd64 null 2023-11-23 15:14:17 /home/poudri= ere.jailroot/release-14.0.0 > root@pkgfactory2:/home/pkgfactory2/tanimura/work/freebsd-git/ports-head #= =20 Under this setup, poudriere-bulk(8) mounts the jail filesystems onto each builder by nullfs(5). A nullfs(5) vnode adds one v_usecount to the lower v= node (asserted in null_nodeget()) so that the pointer to the lower vnode does not dangle. This lasts even after the nullfs(5) vnode is inactivated and put o= nto the free list, until the nullfs(5) vnode gets reclaimed. The nullfs(5) design above explains the results of the estimation upon the unrecyclable ZFS vnodes. As the builders open the more files in ZFS via nullfs(5), the more unrecyclable ZFS vnodes are made. In the detail, howev= er, the estimation makes the errors because the multiple builders can open the = same ZFS file. The massive free of the vnodes after the build is also explained by the nullfs(5) design. The cleanup of the builder filesystems dismisses a lot of nullfs(5) vnodes, which, in turn, drops v_usecount of the lower ZFS vnodes = so that they can be evicted. ----- The finding above introduces a new question: should the ZFS vnodes used by nullfs(5) be recycled? My answer is no. The major hurdle is the search of the vnode stacking link= .=20 It is essentially a tree with the ZFS (or any non-nullfs(5)) vnode as the r= oot, spanning to multiple nullfs(5) vnode leaves and depth levels. The search is likely to be even more complex than the linear scanning of the vnode list. In addition, all vnodes in the tree must be recyclable for the ZFS vnode at= the tree root to be recyclable as well. This is likely to put a complex depend= ency for the ZFS vnode recycling. ----- My investigation so far, including this one, has proven that it costs too m= uch to scan over all vnodes without any positive estimation in advance. We nee= d a way to check if the ARC pruning will yield the fruitful result in the way m= uch cheaper than the vnode scan. It may be good to account the number of the ZFS vnodes not in use. Before starting an ARC pruning, we can check that count and defer pruning if that = is too low. This has already been implemented in arc_evict_impl() for the eviction of the ARC data and metadata by checking the evictable size. The = ARC data and metadata eviction is skipped if there are zero evictable bytes. * My next work Figure out the requirement and design of the accounting above. --=20 You are receiving this mail because: You are the assignee for the bug.=