From owner-svn-src-head@freebsd.org Tue Oct 27 18:12:07 2020 Return-Path: Delivered-To: svn-src-head@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id E1D49456D88; Tue, 27 Oct 2020 18:12:07 +0000 (UTC) (envelope-from mjg@FreeBSD.org) Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org [IPv6:2610:1c1:1:606c::19:3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "mxrelay.nyi.freebsd.org", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4CLKY35m6dz3ymj; Tue, 27 Oct 2020 18:12:07 +0000 (UTC) (envelope-from mjg@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id A99C9FC90; Tue, 27 Oct 2020 18:12:07 +0000 (UTC) (envelope-from mjg@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id 09RIC7dq040815; Tue, 27 Oct 2020 18:12:07 GMT (envelope-from mjg@FreeBSD.org) Received: (from mjg@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id 09RIC7pn040814; Tue, 27 Oct 2020 18:12:07 GMT (envelope-from mjg@FreeBSD.org) Message-Id: <202010271812.09RIC7pn040814@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: mjg set sender to mjg@FreeBSD.org using -f From: Mateusz Guzik Date: Tue, 27 Oct 2020 18:12:07 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r367089 - head/sys/kern X-SVN-Group: head X-SVN-Commit-Author: mjg X-SVN-Commit-Paths: head/sys/kern X-SVN-Commit-Revision: 367089 X-SVN-Commit-Repository: base MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Oct 2020 18:12:07 -0000 Author: mjg Date: Tue Oct 27 18:12:07 2020 New Revision: 367089 URL: https://svnweb.freebsd.org/changeset/base/367089 Log: vfs: fix vnode reclaim races against getnwevnode All vnodes allocated by UMA are present on the global list used by vnlru. getnewvnode modifies the state of the vnode (most notably altering v_holdcnt) but never locks it. Moreover filesystems also modify it in arbitrary manners sometimes before taking the vnode lock or adding any other indicator that the vnode can be used. Picking up such a vnode by vnlru would be problematic. To that end there are 2 fixes: - vlrureclaim, not recycling v_holdcnt == 0 vnodes, takes the interlock and verifies that v_mount has been set. It is an invariant that the vnode lock is held by that point, providing the necessary serialisation against locking after vhold. - vnlru_free_locked, only wanting to free v_holdcnt == 0 vnodes, now makes sure to only transition the count 0->1 and newly allocated vnodes start with v_holdcnt == VHOLD_NO_SMR. getnewvnode will only transition VHOLD_NO_SMR->1 once more making the hold fail Tested by: pho Modified: head/sys/kern/vfs_subr.c Modified: head/sys/kern/vfs_subr.c ============================================================================== --- head/sys/kern/vfs_subr.c Tue Oct 27 18:11:11 2020 (r367088) +++ head/sys/kern/vfs_subr.c Tue Oct 27 18:12:07 2020 (r367089) @@ -109,7 +109,7 @@ static void syncer_shutdown(void *arg, int howto); static int vtryrecycle(struct vnode *vp); static void v_init_counters(struct vnode *); static void vgonel(struct vnode *); -static bool vhold_recycle(struct vnode *); +static bool vhold_recycle_free(struct vnode *); static void vfs_knllock(void *arg); static void vfs_knlunlock(void *arg); static void vfs_knl_assert_locked(void *arg); @@ -561,6 +561,11 @@ vnode_init(void *mem, int size, int flags) vp->v_dbatchcpu = NOCPU; + /* + * Check vhold_recycle_free for an explanation. + */ + vp->v_holdcnt = VHOLD_NO_SMR; + vp->v_type = VNON; mtx_lock(&vnode_list_mtx); TAILQ_INSERT_BEFORE(vnode_list_free_marker, vp, v_vnodelist); mtx_unlock(&vnode_list_mtx); @@ -1127,8 +1132,25 @@ restart: goto next_iter; } - if (!vhold_recycle(vp)) + /* + * Handle races against vnode allocation. Filesystems lock the + * vnode some time after it gets returned from getnewvnode, + * despite type and hold count being manipulated earlier. + * Resorting to checking v_mount restores guarantees present + * before the global list was reworked to contain all vnodes. + */ + if (!VI_TRYLOCK(vp)) goto next_iter; + if (__predict_false(vp->v_type == VBAD || vp->v_type == VNON)) { + VI_UNLOCK(vp); + goto next_iter; + } + if (vp->v_mount == NULL) { + VI_UNLOCK(vp); + goto next_iter; + } + vholdl(vp); + VI_UNLOCK(vp); TAILQ_REMOVE(&vnode_list, mvp, v_vnodelist); TAILQ_INSERT_AFTER(&vnode_list, vp, mvp, v_vnodelist); mtx_unlock(&vnode_list_mtx); @@ -1228,13 +1250,13 @@ restart: mp->mnt_op != mnt_op)) { continue; } - TAILQ_REMOVE(&vnode_list, mvp, v_vnodelist); - TAILQ_INSERT_AFTER(&vnode_list, vp, mvp, v_vnodelist); if (__predict_false(vp->v_type == VBAD || vp->v_type == VNON)) { continue; } - if (!vhold_recycle(vp)) + if (!vhold_recycle_free(vp)) continue; + TAILQ_REMOVE(&vnode_list, mvp, v_vnodelist); + TAILQ_INSERT_AFTER(&vnode_list, vp, mvp, v_vnodelist); count--; mtx_unlock(&vnode_list_mtx); vtryrecycle(vp); @@ -3251,11 +3273,13 @@ vholdnz(struct vnode *vp) * However, while this is more performant, it hinders debugging by eliminating * the previously mentioned invariant. */ -static bool __always_inline -_vhold_cond(struct vnode *vp) +bool +vhold_smr(struct vnode *vp) { int count; + VFS_SMR_ASSERT_ENTERED(); + count = atomic_load_int(&vp->v_holdcnt); for (;;) { if (count & VHOLD_NO_SMR) { @@ -3263,7 +3287,6 @@ _vhold_cond(struct vnode *vp) ("non-zero hold count with flags %d\n", count)); return (false); } - VNASSERT(count >= 0, vp, ("invalid hold count %d\n", count)); if (atomic_fcmpset_int(&vp->v_holdcnt, &count, count + 1)) { if (count == 0) @@ -3273,26 +3296,45 @@ _vhold_cond(struct vnode *vp) } } -bool -vhold_smr(struct vnode *vp) -{ - - VFS_SMR_ASSERT_ENTERED(); - return (_vhold_cond(vp)); -} - /* - * Special case for vnode recycling. + * Hold a free vnode for recycling. * - * Vnodes are present on the global list until UMA takes them out. - * Attempts to recycle only need the relevant lock and have no use for SMR. + * Note: vnode_init references this comment. + * + * Attempts to recycle only need the global vnode list lock and have no use for + * SMR. + * + * However, vnodes get inserted into the global list before they get fully + * initialized and stay there until UMA decides to free the memory. This in + * particular means the target can be found before it becomes usable and after + * it becomes recycled. Picking up such vnodes is guarded with v_holdcnt set to + * VHOLD_NO_SMR. + * + * Note: the vnode may gain more references after we transition the count 0->1. */ static bool -vhold_recycle(struct vnode *vp) +vhold_recycle_free(struct vnode *vp) { + int count; mtx_assert(&vnode_list_mtx, MA_OWNED); - return (_vhold_cond(vp)); + + count = atomic_load_int(&vp->v_holdcnt); + for (;;) { + if (count & VHOLD_NO_SMR) { + VNASSERT((count & ~VHOLD_NO_SMR) == 0, vp, + ("non-zero hold count with flags %d\n", count)); + return (false); + } + VNASSERT(count >= 0, vp, ("invalid hold count %d\n", count)); + if (count > 0) { + return (false); + } + if (atomic_fcmpset_int(&vp->v_holdcnt, &count, count + 1)) { + vn_freevnodes_dec(); + return (true); + } + } } static void __noinline