From nobody Sat Oct 14 01:08:53 2023 X-Original-To: dev-commits-src-all@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4S6ld21jK9z4x4Q2; Sat, 14 Oct 2023 01:08:54 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4S6ld20kFRz3XnM; Sat, 14 Oct 2023 01:08:54 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1697245734; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=6VNyramRx+F9jFTZqwW/toBxs85+YOALqdP1WnlMG3k=; b=jqpQcnXMJg092NmfyoCi8R4BC+cZhXu7MbIF3EBVn7csk8UNkvPBPvktFLrEj3vgToAb5n pH7MHotf3lQ9PL7BDTTOvN60wXle9dO5ibLJDhUHE8Op6TeHDirYuXpRt0YkQhIZbxPKNj 1w/EXGIJNteKi0I3DfI1nO0b/j42EaTLxWom4zaUlF1/cLGlx32j9Xpo/9woZaTsAmhGBh MEHO2yXTsIxJi3qBdJ9RQkuC5L+pKNF+Lb4RvOLClPqPyf1geal87m9r5WWFDGtJendAyX EYhty83mt99NCosrIidfXNPM33w39XSbN5bRnl+oUCELi9N72hE+rmZ3gvVt0g== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1697245734; a=rsa-sha256; cv=none; b=KK9YrE1KdZ25WDZsRnRRtdWGi/4V3maY0hD4N6wEn2gxfKkK5gOqYb++uBaYU3VX4sr+I0 xtcOFUTwaE+qBjtPMroCp1mwHPsaN7IO3l9WVRL5ooCaqSuMFp6cBDXZtPxwpHK5QUDQQc oSLR2UO5fLkrfDB8cAn1gZeNwBGYoYIGNv5b87uB4dZu4daFMb6n8EjcRTWcbGOyFgdM67 pgKuhMHLkni0IULMvj5oUTMFqgXiECuiR3vPGyxf1kmJ19U0eIY5qx575GOYrJVxBK+GIh mqysSqHHmBnZpWkk8SwgQiDVNq6PrZAKfGhm5q4QAVsXWtgwB/Vpz1l5Y1hkrA== ARC-Authentication-Results: i=1; mx1.freebsd.org; none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1697245734; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=6VNyramRx+F9jFTZqwW/toBxs85+YOALqdP1WnlMG3k=; b=tcidAl58oGWnmQDmbs2VXzqPFRULqeQeFY/qrZnR1FZpWivvo2uikXR9gs1Kw+UoJ5OfKW goZiO7s2TX4TQUkk06SmAUtMgsZgEhnwsQ6p1I2wZPwnOxfbXUS8RJCK7AyPqocS83baB/ BdQ1Rohh+MUGhTQKapNT13M51+UT68KRLXdI5DM6w3UMkAhb0Lbvp05olf6OV+uXX1Yky0 EntaUQi4hUG2572RKS/rcQsLcHYLrHR2SIHh61Yo9Brgx/1IjfMm9waZ/3NHQTb7H1aP0p waVhnJCUDzcYgK8DIzcnLCYLBww3fACXPUUPlacZwYOSzAGT53jqzhM74ZoqEw== Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4S6ld16qvKz19Hs; Sat, 14 Oct 2023 01:08:53 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.17.1/8.17.1) with ESMTP id 39E18rSa052452; Sat, 14 Oct 2023 01:08:53 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.17.1/8.17.1/Submit) id 39E18rDC052449; Sat, 14 Oct 2023 01:08:53 GMT (envelope-from git) Date: Sat, 14 Oct 2023 01:08:53 GMT Message-Id: <202310140108.39E18rDC052449@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-branches@FreeBSD.org From: Mateusz Guzik Subject: git: 899b59500d24 - releng/14.0 - vfs: prefix regular vnlru with a special case for free vnodes List-Id: Commit messages for all branches of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-all List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-all@freebsd.org X-BeenThere: dev-commits-src-all@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: mjg X-Git-Repository: src X-Git-Refname: refs/heads/releng/14.0 X-Git-Reftype: branch X-Git-Commit: 899b59500d24f81a8b84751df225ecdb323a68a0 Auto-Submitted: auto-generated The branch releng/14.0 has been updated by mjg: URL: https://cgit.FreeBSD.org/src/commit/?id=899b59500d24f81a8b84751df225ecdb323a68a0 commit 899b59500d24f81a8b84751df225ecdb323a68a0 Author: Mateusz Guzik AuthorDate: 2023-09-14 19:08:40 +0000 Commit: Mateusz Guzik CommitDate: 2023-10-14 01:07:34 +0000 vfs: prefix regular vnlru with a special case for free vnodes Works around severe performance problems in certain corner cases, see the commentary added. Modifying vnlru logic has proven rather error prone in the past and a release is near, thus take the easy way out and fix it without having to dig into the current machinery. (cherry picked from commit 90a008e94bb205e5b8f3c41d57e155b59a6be95d) (cherry picked from commit cfbc3927613a8455498db5ff3e67046a4dab15a1) Approved by: re (gjb) --- sys/kern/vfs_subr.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 113 insertions(+), 4 deletions(-) diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 5e39a149ef36..5834feff080c 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -1427,6 +1427,14 @@ vnlru_free_locked(int count) return (ret); } +static int +vnlru_free(int count) +{ + + mtx_lock(&vnode_list_mtx); + return (vnlru_free_locked(count)); +} + void vnlru_free_vfsops(int count, struct vfsops *mnt_op, struct vnode *mvp) { @@ -1594,6 +1602,106 @@ vnlru_kick_cond(void) mtx_unlock(&vnode_list_mtx); } +static void +vnlru_proc_sleep(void) +{ + + if (vnlruproc_sig) { + vnlruproc_sig = 0; + wakeup(&vnlruproc_sig); + } + msleep(vnlruproc, &vnode_list_mtx, PVFS|PDROP, "vlruwt", hz); +} + +/* + * A lighter version of the machinery below. + * + * Tries to reach goals only by recycling free vnodes and does not invoke + * uma_reclaim(UMA_RECLAIM_DRAIN). + * + * This works around pathological behavior in vnlru in presence of tons of free + * vnodes, but without having to rewrite the machinery at this time. Said + * behavior boils down to continuously trying to reclaim all kinds of vnodes + * (cycling through all levels of "force") when the count is transiently above + * limit. This happens a lot when all vnodes are used up and vn_alloc + * speculatively increments the counter. + * + * Sample testcase: vnode limit 8388608, 20 separate directory trees each with + * 1 million files in total and 20 find(1) processes stating them in parallel + * (one per each tree). + * + * On a kernel with only stock machinery this needs anywhere between 60 and 120 + * seconds to execute (time varies *wildly* between runs). With the workaround + * it consistently stays around 20 seconds. + * + * That is to say the entire thing needs a fundamental redesign (most notably + * to accommodate faster recycling), the above only tries to get it ouf the way. + * + * Return values are: + * -1 -- fallback to regular vnlru loop + * 0 -- do nothing, go to sleep + * >0 -- recycle this many vnodes + */ +static long +vnlru_proc_light_pick(void) +{ + u_long rnumvnodes, rfreevnodes; + + if (vstir || vnlruproc_sig == 1) + return (-1); + + rnumvnodes = atomic_load_long(&numvnodes); + rfreevnodes = vnlru_read_freevnodes(); + + /* + * vnode limit might have changed and now we may be at a significant + * excess. Bail if we can't sort it out with free vnodes. + */ + if (rnumvnodes > desiredvnodes) { + if (rnumvnodes - rfreevnodes >= desiredvnodes || + rfreevnodes <= wantfreevnodes) { + return (-1); + } + + return (rnumvnodes - desiredvnodes); + } + + /* + * Don't try to reach wantfreevnodes target if there are too few vnodes + * to begin with. + */ + if (rnumvnodes < wantfreevnodes) { + return (0); + } + + if (rfreevnodes < wantfreevnodes) { + return (-1); + } + + return (0); +} + +static bool +vnlru_proc_light(void) +{ + long freecount; + + mtx_assert(&vnode_list_mtx, MA_NOTOWNED); + + freecount = vnlru_proc_light_pick(); + if (freecount == -1) + return (false); + + if (freecount != 0) { + vnlru_free(freecount); + } + + mtx_lock(&vnode_list_mtx); + vnlru_proc_sleep(); + mtx_assert(&vnode_list_mtx, MA_NOTOWNED); + return (true); +} + static void vnlru_proc(void) { @@ -1609,6 +1717,10 @@ vnlru_proc(void) want_reread = false; for (;;) { kproc_suspend_check(vnlruproc); + + if (force == 0 && vnlru_proc_light()) + continue; + mtx_lock(&vnode_list_mtx); rnumvnodes = atomic_load_long(&numvnodes); @@ -1639,10 +1751,7 @@ vnlru_proc(void) vstir = false; } if (force == 0 && !vnlru_under(rnumvnodes, vlowat)) { - vnlruproc_sig = 0; - wakeup(&vnlruproc_sig); - msleep(vnlruproc, &vnode_list_mtx, - PVFS|PDROP, "vlruwt", hz); + vnlru_proc_sleep(); continue; } rfreevnodes = vnlru_read_freevnodes();