From nobody Thu Aug 24 22:13:53 2023 X-Original-To: dev-commits-src-main@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4RWy6974Jbz4rdm8; Thu, 24 Aug 2023 22:13:53 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4RWy696Z1Fz3YnN; Thu, 24 Aug 2023 22:13:53 +0000 (UTC) (envelope-from git@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1692915233; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=+PfHmLKBUZqDcWTcIGsdUj0HR2GiLIcEvDoPDCM5Is0=; b=lmLOiokM03bSRIxw7tJTVEkM1qPq6ky8K1cczryP9JroE8en6hO8DYruP9onhDT3lujFAo nJNQMopXYcC1ZG63s+N/C0fSDBwQ91Jwp9pLCVZCIGf9YC4EFg8Vh8UCflH4Vn0axrZuLn gGVypCEZ5DU7HqZIqppOW2N55TdkQIK+JOuOtFzEv9d6aV/xvDFHEaRmoaEZ/LfX1tdFWC /WrTg0jCFMjilzJnAT6CKAwqPGECJYBPW/SfeaaOsfIrQYveIC9O1TZJhCttJCt6Q7XT3z npQHleefFUL8zjJVF9kSBsaCurXhJ94VQfDdgn+1nGqIWDeufM++92M7AbLqkw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1692915233; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=+PfHmLKBUZqDcWTcIGsdUj0HR2GiLIcEvDoPDCM5Is0=; b=IWo+061airKmLSaXKadBcfPunV2izId1vf6THiSqKbwghACX0cuu3xugg0QJzJguidPMQ8 z/m9kGhACm7/E4WV69xvqrbwBLU51zhzHKp6wnYhCpDjNbfkSc/ikAyWjexvfa2jq64EUr WiEDyP3yGWtY74YobnOsKYiK13H2TDzkG+Cu0ixSq5YJ87Bt+dGYNcsQo1Gs+KN447ZSHl sp6nqPiDFaE0t2XZXuwhoMqLiHodhiP+4t6Ominaar42RFNaVDWjzyVSK8mY9/bs7VIz5X J3orby8aDVVWPjLxjE6eUalS0U3ccwnQwg7ezmYDQ1mr4gsPBjVxC2B2lHHJbw== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1692915233; a=rsa-sha256; cv=none; b=fO72bz4LYeEbReBULAOUeZVnMp1TCrSGlIJdof642ZzNVBu84DRK8P0roM4Zayh/M4nSkn syp5VtaaEpJwYrX1yuPKn41DH4NQ12PRrZSPuAGApViA9xClNA9otamrrLFlW51SWx6ofr oqT31uBUrJ4QJoXxmbJuFAZ1LDpicQsfUZ7IGOw1twK0x6dO4TRsZXhIte6ihPKq0gQVsT u2zA7+sKIVEUE9ZhY/tYEDcQBsHJ24e3oQVRqGHMeeAKvzGq5OwpHQ5YOy0EtRUo2LdEVx W+pUvY3RneQ1+R53M6NtF742U3KD0qQeTtXYv4Q6NBXkGT85Npekh1Vq+ijFKw== ARC-Authentication-Results: i=1; mx1.freebsd.org; none Received: from gitrepo.freebsd.org (gitrepo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 4RWy695cKpzkmN; Thu, 24 Aug 2023 22:13:53 +0000 (UTC) (envelope-from git@FreeBSD.org) Received: from gitrepo.freebsd.org ([127.0.1.44]) by gitrepo.freebsd.org (8.17.1/8.17.1) with ESMTP id 37OMDrP2074452; Thu, 24 Aug 2023 22:13:53 GMT (envelope-from git@gitrepo.freebsd.org) Received: (from git@localhost) by gitrepo.freebsd.org (8.17.1/8.17.1/Submit) id 37OMDrns074449; Thu, 24 Aug 2023 22:13:53 GMT (envelope-from git) Date: Thu, 24 Aug 2023 22:13:53 GMT Message-Id: <202308242213.37OMDrns074449@gitrepo.freebsd.org> To: src-committers@FreeBSD.org, dev-commits-src-all@FreeBSD.org, dev-commits-src-main@FreeBSD.org From: Mateusz Guzik Subject: git: c1d85ac3df82 - main - vfs: try harder to find free vnodes when recycling List-Id: Commit messages for the main branch of the src repository List-Archive: https://lists.freebsd.org/archives/dev-commits-src-main List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-dev-commits-src-main@freebsd.org X-BeenThere: dev-commits-src-main@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Git-Committer: mjg X-Git-Repository: src X-Git-Refname: refs/heads/main X-Git-Reftype: branch X-Git-Commit: c1d85ac3df82df721e3d33b292579c4de491488e Auto-Submitted: auto-generated The branch main has been updated by mjg: URL: https://cgit.FreeBSD.org/src/commit/?id=c1d85ac3df82df721e3d33b292579c4de491488e commit c1d85ac3df82df721e3d33b292579c4de491488e Author: Mateusz Guzik AuthorDate: 2023-08-24 05:34:08 +0000 Commit: Mateusz Guzik CommitDate: 2023-08-24 22:12:40 +0000 vfs: try harder to find free vnodes when recycling The free vnode marker can slide past eligible entries. Artificially reducing vnode limit to 300k and spawning 104 workers each creating a million files results in all of them trying to recycle, which often fails when it should not have to. Because of the excessive traffic in this scenario, the trylock to requeue is virtually guaranteed to fail, meaning nothing gets pushed forward. Since no vnodes were found, the most unfortunate sleep for 1 second is induced (see vn_alloc_hard, the "vlruwk" msleep). Without the fix the machine is mostly idle with almost everyone stuck off CPU waiting for the sleep to finish. With the fix it is busy creating files. Unrelated to the above problem the marker could have landed in a similarly problematic spot for because of any failure in vtryrecycle. Originally reported as poudriere builders stalling in a vnode-count restricted setup. Fixes: 138a5dafba31 ("vfs: trylock vnode requeue") Reported by: Mark Millard --- sys/kern/vfs_subr.c | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c index 0f3f00abfd4a..f1e1d1e3a0ca 100644 --- a/sys/kern/vfs_subr.c +++ b/sys/kern/vfs_subr.c @@ -196,6 +196,10 @@ static counter_u64_t recycles_free_count; SYSCTL_COUNTER_U64(_vfs, OID_AUTO, recycles_free, CTLFLAG_RD, &recycles_free_count, "Number of free vnodes recycled to meet vnode cache targets"); +static counter_u64_t vnode_skipped_requeues; +SYSCTL_COUNTER_U64(_vfs, OID_AUTO, vnode_skipped_requeues, CTLFLAG_RD, &vnode_skipped_requeues, + "Number of times LRU requeue was skipped due to lock contention"); + static u_long deferred_inact; SYSCTL_ULONG(_vfs, OID_AUTO, deferred_inact, CTLFLAG_RD, &deferred_inact, 0, "Number of times inactive processing was deferred"); @@ -732,6 +736,7 @@ vntblinit(void *dummy __unused) vnodes_created = counter_u64_alloc(M_WAITOK); recycles_count = counter_u64_alloc(M_WAITOK); recycles_free_count = counter_u64_alloc(M_WAITOK); + vnode_skipped_requeues = counter_u64_alloc(M_WAITOK); /* * Initialize the filesystem syncer. @@ -1280,11 +1285,13 @@ vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp) struct vnode *vp; struct mount *mp; int ocount; + bool retried; mtx_assert(&vnode_list_mtx, MA_OWNED); if (count > max_vnlru_free) count = max_vnlru_free; ocount = count; + retried = false; vp = mvp; for (;;) { if (count == 0) { @@ -1292,6 +1299,24 @@ vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp) } vp = TAILQ_NEXT(vp, v_vnodelist); if (__predict_false(vp == NULL)) { + /* + * The free vnode marker can be past eligible vnodes: + * 1. if vdbatch_process trylock failed + * 2. if vtryrecycle failed + * + * If so, start the scan from scratch. + */ + if (!retried && vnlru_read_freevnodes() > 0) { + TAILQ_REMOVE(&vnode_list, mvp, v_vnodelist); + TAILQ_INSERT_HEAD(&vnode_list, mvp, v_vnodelist); + vp = mvp; + retried++; + continue; + } + + /* + * Give up + */ TAILQ_REMOVE(&vnode_list, mvp, v_vnodelist); TAILQ_INSERT_TAIL(&vnode_list, mvp, v_vnodelist); break; @@ -3528,6 +3553,17 @@ vdbatch_process(struct vdbatch *vd) MPASS(curthread->td_pinned > 0); MPASS(vd->index == VDBATCH_SIZE); + /* + * Attempt to requeue the passed batch, but give up easily. + * + * Despite batching the mechanism is prone to transient *significant* + * lock contention, where vnode_list_mtx becomes the primary bottleneck + * if multiple CPUs get here (one real-world example is highly parallel + * do-nothing make , which will stat *tons* of vnodes). Since it is + * quasi-LRU (read: not that great even if fully honoured) just dodge + * the problem. Parties which don't like it are welcome to implement + * something better. + */ critical_enter(); if (mtx_trylock(&vnode_list_mtx)) { for (i = 0; i < VDBATCH_SIZE; i++) { @@ -3540,6 +3576,8 @@ vdbatch_process(struct vdbatch *vd) } mtx_unlock(&vnode_list_mtx); } else { + counter_u64_add(vnode_skipped_requeues, 1); + for (i = 0; i < VDBATCH_SIZE; i++) { vp = vd->tab[i]; vd->tab[i] = NULL;