From owner-freebsd-fs@freebsd.org  Thu Jun 25 12:32:03 2015
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8855B98CD7F
 for <freebsd-fs@mailman.ysv.freebsd.org>; Thu, 25 Jun 2015 12:32:03 +0000 (UTC)
 (envelope-from mjguzik@gmail.com)
Received: from mail-wi0-x229.google.com (mail-wi0-x229.google.com
 [IPv6:2a00:1450:400c:c05::229])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id EFC0C1610
 for <freebsd-fs@freebsd.org>; Thu, 25 Jun 2015 12:32:02 +0000 (UTC)
 (envelope-from mjguzik@gmail.com)
Received: by wiwl6 with SMTP id l6so16589561wiw.0
 for <freebsd-fs@freebsd.org>; Thu, 25 Jun 2015 05:32:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-type:content-disposition:in-reply-to:user-agent;
 bh=YJ4es8Qs8l+PimabFp9qms7h05LRfscde5bSIfyQev4=;
 b=oc/j+QA7L9eklWwB6sS36arYkjpSW71LC5ggP6KlFbhjC/ZXXKb2yVr4C6NdJ/IdHf
 gN2FC09jyjkmSoPK5EfzH4OXBoQa9LT3Umry6bOhtZd/Mu4+P+7a+momPiIGdUhDtBxl
 DyPIQ2Pnqp5QmY11PRmMNDjvkSCSg9SkhEuaEnMU/SRYONqisVmhrEVAAjtOSiFk2FG1
 sezHeV2nj5UWW6f+ybCkek/z5X7z0nKlBWxPOMvqm6pPr6YWi0Te9mE62+H7Dft0Iv75
 qMPMZGXMy3oJ3aEsUVkDQIPM+iQCKraf5AIjbAwioVRwWl1t1/4CuzIL+JomXvOkPwr4
 r4eA==
X-Received: by 10.180.106.195 with SMTP id gw3mr5346562wib.25.1435235521492;
 Thu, 25 Jun 2015 05:32:01 -0700 (PDT)
Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net.
 [2001:470:1f08:1f7::2])
 by mx.google.com with ESMTPSA id g15sm7405619wiv.22.2015.06.25.05.31.59
 (version=TLSv1.2 cipher=RC4-SHA bits=128/128);
 Thu, 25 Jun 2015 05:31:59 -0700 (PDT)
Date: Thu, 25 Jun 2015 14:31:57 +0200
From: Mateusz Guzik <mjguzik@gmail.com>
To: Konstantin Belousov <kostikbel@gmail.com>
Cc: freebsd-fs@freebsd.org
Subject: Re: atomic v_usecount and v_holdcnt
Message-ID: <20150625123156.GA29667@dft-labs.eu>
References: <20141122002812.GA32289@dft-labs.eu>
 <20141122092527.GT17068@kib.kiev.ua>
 <20141122211147.GA23623@dft-labs.eu>
 <20141124095251.GH17068@kib.kiev.ua>
 <20150314225226.GA15302@dft-labs.eu>
 <20150316094643.GZ2379@kib.kiev.ua>
 <20150317014412.GA10819@dft-labs.eu>
 <20150318104442.GS2379@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20150318104442.GS2379@kib.kiev.ua>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 25 Jun 2015 12:32:03 -0000

On Wed, Mar 18, 2015 at 12:44:42PM +0200, Konstantin Belousov wrote:
> On Tue, Mar 17, 2015 at 02:44:12AM +0100, Mateusz Guzik wrote:
> > On Mon, Mar 16, 2015 at 11:46:43AM +0200, Konstantin Belousov wrote:
> > > On Sat, Mar 14, 2015 at 11:52:26PM +0100, Mateusz Guzik wrote:
> > > > On Mon, Nov 24, 2014 at 11:52:52AM +0200, Konstantin Belousov wrote:
> > > > > On Sat, Nov 22, 2014 at 10:11:47PM +0100, Mateusz Guzik wrote:
> > > > > > On Sat, Nov 22, 2014 at 11:25:27AM +0200, Konstantin Belousov wrote:
> > > > > > > On Sat, Nov 22, 2014 at 01:28:12AM +0100, Mateusz Guzik wrote:
> > > > > > > > The idea is that we don't need an interlock as long as we don't
> > > > > > > > transition either counter 1->0 or 0->1.
> > > > > > > I already said that something along the lines of the patch should work.
> > > > > > > In fact, you need vnode lock when hold count changes between 0 and 1,
> > > > > > > and probably the same for use count.
> > > > > > > 
> > > > > > 
> > > > > > I don't see why this would be required (not that I'm an VFS expert).
> > > > > > vnode recycling seems to be protected with the interlock.
> > > > > > 
> > > > > > In fact I would argue that if this is really needed, current code is
> > > > > > buggy.
> > > > > Yes, it is already (somewhat) buggy.
> > > > > 
> > > > > Most need of the lock is for the case of counts coming from 1 to 0.
> > > > > The reason is the handling of the active vnode list, which is used
> > > > > for limiting the amount of vnode list walking in syncer.  When hold
> > > > > count is decremented to 0, vnode is removed from the active list.
> > > > > When use count is decremented to 0, vnode is supposedly inactivated,
> > > > > and vinactive() cleans the cached pages belonging to vnode.  In other
> > > > > words, VI_OWEINACT for dirty vnode is sort of bug.
> > > > > 
> > > > 
> > > > Modified the patch to no longer have the usecount + interlock dropped +
> > > > VI_OWEINACT set window.
> > > > 
> > > > Extended 0->1 hold count + vnode not locked window remains. I can fix
> > > > that if it is really necessary by having _vhold return with interlock
> > > > held if it did such transition.
> > > 
> > > In v_upgrade_usecount(), you call v_incr_devcount() without without interlock
> > > held.  What prevents the devfs vnode from being recycled, in particular,
> > > from invalidation of v_rdev pointer ?
> > > 
> > 
> > Right, that was buggy. Fixed in the patch below.
> Why non-atomicity of updates to several counters is safe ?  This at least
> requires an explanation in the comment, I mean holdcnt/usecnt pair.
> 

The patch below was tested with make -j 40 buildworld in a loop for 7 hours
and it survived.

I started a comment above vget, unfinished yet.

Further playing around revealed that zfs will vref a vnode with no
usecount (zfs_lookup -> zfs_dirlook -> zfs_dirent_lock -> zfs_zget ->
VN_HOLD) and it is possible that it will have VI_OWEINACT set (tested on
a kernel without my patch). VN_HOLD is defined as vref(). The code can
sleep, so some shuffling around can be done to call vinactive() if it
happens to be exclusively locked (but most of the time it is locked
shared).

However, it seems that vputx deals with such consumers:
if (vp->v_usecount > 0)
	vp->v_iflag &= ~VI_OWEINACT;

Given that there are possibly more consumers like zfs how about:
In vputx assert that the flag is unset if the usecount went to > 0. Clear
the flag in vref and vget if transitioning 0->1 and assert it is unset
otherwise.

The way I read it is that in the stock kernel with properly timed vref
the flag would be cleared anyway, with vinactive() only called if it was
done by vget and only with the vnode exclusively locked.

With a aforementioned change likelyhood of vinactive() remains the same,
but now the flag state can be asserted.

> Assume the thread increased the v_usecount, but did not managed to
> acquire dev_mtx. Another thread performs vrele() and progressed to
> v_decr_devcount(). It decreases the si_usecount, which might allow yet
> another thread to see the si_usecount as too low and start unwanted
> action. I think that the tests for VCHR must be done at the very
> start of the functions, and devfs vnodes must hold vnode interlock
> unconditionally.
> 

Inserted v_type != VCHR checks in relevant places, vi_usecount
manipulation functions now assert that the interlock is held.

> > 
> > > I think that refcount_acquire_if_greater() KPI is excessive.  You always
> > > calls acquire with val == 0, and release with val == 1.
> > > 
> > 
> > Yea i noted in my prevoius e-mail it should be changed (see below).
> > 
> > I replaced them with refcount_acquire_if_not_zero and
> > refcount_release_if_not_last.
> I dislike the length of the names.  Can you propose something shorter ?
> 

Unfortunately the original API is alreday quite verbose and I don't have
anything readable which would retain "refcount_acquire" (instead of a
"ref_get" or "ref_acq"). Adding "_nz" as a suffix does not look good
("refcount_acquire_if_nz").

> The type for the local variable old in both functions should be u_int.
> 

Done.

> > 
> > > WRT to _refcount_release_lock, why is lock_object->lc_lock/lc_unlock KPI
> > > cannot be used ? This allows to make refcount_release_lock() a function
> > > instead of gcc extension macros.  Not to mention that the macro is unused.
> > 
> > These were supposed to be used by other code, forgot to remove it from
> > the patch I sent here.
> > 
> > We can discuss this in another thread.
> > 
> > Striclty speaking we could use it here for vnode interlock, but I did
> > not want to get around VI_LOCK macro (which right now is just a
> > mtx_lock, but this may change).
> > 
> > Updated patch is below:
> Do not introduce ASSERT_VI_LOCK, the name difference between
> ASSERT_VI_LOCKED and ASSERT_VI_LOCK is only in the broken grammar.
> I do not see anything wrong with explicit if() statements where needed,
> in all four places.

Done.

> 
> In vputx(), wrap the long line (if (refcount_release() || VI_DOINGINACT)).

Done.


diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/vnode.c b/sys/cddl/contrib/opensolaris/uts/common/fs/vnode.c
index 83f29c1..b587ebd 100644
--- a/sys/cddl/contrib/opensolaris/uts/common/fs/vnode.c
+++ b/sys/cddl/contrib/opensolaris/uts/common/fs/vnode.c
@@ -99,6 +99,6 @@ vn_rele_async(vnode_t *vp, taskq_t *taskq)
 		    (task_func_t *)vn_rele_inactive, vp, TQ_SLEEP) != 0);
 		return;
 	}
-	vp->v_usecount--;
+	refcount_release(&vp->v_usecount);
 	vdropl(vp);
 }
diff --git a/sys/kern/vfs_cache.c b/sys/kern/vfs_cache.c
index 19ef783..cb4ea94 100644
--- a/sys/kern/vfs_cache.c
+++ b/sys/kern/vfs_cache.c
@@ -661,12 +661,12 @@ success:
 		ltype = VOP_ISLOCKED(dvp);
 		VOP_UNLOCK(dvp, 0);
 	}
-	VI_LOCK(*vpp);
+	vhold(*vpp);
 	if (wlocked)
 		CACHE_WUNLOCK();
 	else
 		CACHE_RUNLOCK();
-	error = vget(*vpp, cnp->cn_lkflags | LK_INTERLOCK, cnp->cn_thread);
+	error = vget(*vpp, cnp->cn_lkflags | LK_VNHELD, cnp->cn_thread);
 	if (cnp->cn_flags & ISDOTDOT) {
 		vn_lock(dvp, ltype | LK_RETRY);
 		if (dvp->v_iflag & VI_DOOMED) {
@@ -1366,9 +1366,9 @@ vn_dir_dd_ino(struct vnode *vp)
 		if ((ncp->nc_flag & NCF_ISDOTDOT) != 0)
 			continue;
 		ddvp = ncp->nc_dvp;
-		VI_LOCK(ddvp);
+		vhold(ddvp);
 		CACHE_RUNLOCK();
-		if (vget(ddvp, LK_INTERLOCK | LK_SHARED | LK_NOWAIT, curthread))
+		if (vget(ddvp, LK_SHARED | LK_NOWAIT | LK_VNHELD, curthread))
 			return (NULL);
 		return (ddvp);
 	}
diff --git a/sys/kern/vfs_hash.c b/sys/kern/vfs_hash.c
index 930fca1..48601e7 100644
--- a/sys/kern/vfs_hash.c
+++ b/sys/kern/vfs_hash.c
@@ -84,9 +84,9 @@ vfs_hash_get(const struct mount *mp, u_int hash, int flags, struct thread *td, s
 				continue;
 			if (fn != NULL && fn(vp, arg))
 				continue;
-			VI_LOCK(vp);
+			vhold(vp);
 			rw_runlock(&vfs_hash_lock);
-			error = vget(vp, flags | LK_INTERLOCK, td);
+			error = vget(vp, flags | LK_VNHELD, td);
 			if (error == ENOENT && (flags & LK_NOWAIT) == 0)
 				break;
 			if (error)
@@ -128,9 +128,9 @@ vfs_hash_insert(struct vnode *vp, u_int hash, int flags, struct thread *td, stru
 				continue;
 			if (fn != NULL && fn(vp2, arg))
 				continue;
-			VI_LOCK(vp2);
+			vhold(vp2);
 			rw_wunlock(&vfs_hash_lock);
-			error = vget(vp2, flags | LK_INTERLOCK, td);
+			error = vget(vp2, flags | LK_VNHELD, td);
 			if (error == ENOENT && (flags & LK_NOWAIT) == 0)
 				break;
 			rw_wlock(&vfs_hash_lock);
diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 1f1a7b6..a8cd2cb 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -68,6 +68,7 @@ __FBSDID("$FreeBSD$");
 #include <sys/pctrie.h>
 #include <sys/priv.h>
 #include <sys/reboot.h>
+#include <sys/refcount.h>
 #include <sys/rwlock.h>
 #include <sys/sched.h>
 #include <sys/sleepqueue.h>
@@ -102,9 +103,8 @@ static int	flushbuflist(struct bufv *bufv, int flags, struct bufobj *bo,
 static void	syncer_shutdown(void *arg, int howto);
 static int	vtryrecycle(struct vnode *vp);
 static void	v_incr_usecount(struct vnode *);
-static void	v_decr_usecount(struct vnode *);
-static void	v_decr_useonly(struct vnode *);
-static void	v_upgrade_usecount(struct vnode *);
+static void	v_incr_devcount(struct vnode *);
+static void	v_decr_devcount(struct vnode *);
 static void	vnlru_free(int);
 static void	vgonel(struct vnode *);
 static void	vfs_knllock(void *arg);
@@ -868,7 +868,7 @@ vnlru_free(int count)
 		 */
 		freevnodes--;
 		vp->v_iflag &= ~VI_FREE;
-		vp->v_holdcnt++;
+		refcount_acquire(&vp->v_holdcnt);
 
 		mtx_unlock(&vnode_free_list_mtx);
 		VI_UNLOCK(vp);
@@ -2079,78 +2079,68 @@ reassignbuf(struct buf *bp)
 
 /*
  * Increment the use and hold counts on the vnode, taking care to reference
- * the driver's usecount if this is a chardev.  The vholdl() will remove
- * the vnode from the free list if it is presently free.  Requires the
- * vnode interlock and returns with it held.
+ * the driver's usecount if this is a chardev.  The _vhold() will remove
+ * the vnode from the free list if it is presently free.
  */
 static void
 v_incr_usecount(struct vnode *vp)
 {
 
+	ASSERT_VI_UNLOCKED(vp, __func__);
 	CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
-	vholdl(vp);
-	vp->v_usecount++;
-	if (vp->v_type == VCHR && vp->v_rdev != NULL) {
-		dev_lock();
-		vp->v_rdev->si_usecount++;
-		dev_unlock();
-	}
-}
 
-/*
- * Turn a holdcnt into a use+holdcnt such that only one call to
- * v_decr_usecount is needed.
- */
-static void
-v_upgrade_usecount(struct vnode *vp)
-{
+	if (vp->v_type == VCHR) {
+		VI_LOCK(vp);
+		_vhold(vp, true);
+		if (vp->v_iflag & VI_OWEINACT) {
+			VNASSERT(vp->v_usecount == 0, vp,
+			    ("vnode with usecount and VI_OWEINACT set"));
+			vp->v_iflag &= ~VI_OWEINACT;
+		}
+		refcount_acquire(&vp->v_usecount);
+		v_incr_devcount(vp);
+		VI_UNLOCK(vp);
+		return;
+	}
 
-	CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
-	vp->v_usecount++;
-	if (vp->v_type == VCHR && vp->v_rdev != NULL) {
-		dev_lock();
-		vp->v_rdev->si_usecount++;
-		dev_unlock();
+	_vhold(vp, false);
+	if (refcount_acquire_if_not_zero(&vp->v_usecount)) {
+		VNASSERT((vp->v_iflag & VI_OWEINACT) == 0, vp,
+		    ("vnode with usecount and VI_OWEINACT set"));
+	} else {
+		VI_LOCK(vp);
+		if (vp->v_iflag & VI_OWEINACT)
+			vp->v_iflag &= ~VI_OWEINACT;
+		refcount_acquire(&vp->v_usecount);
+		VI_UNLOCK(vp);
 	}
 }
 
 /*
- * Decrement the vnode use and hold count along with the driver's usecount
- * if this is a chardev.  The vdropl() below releases the vnode interlock
- * as it may free the vnode.
+ * Increment si_usecount of the associated device, if any.
  */
 static void
-v_decr_usecount(struct vnode *vp)
+v_incr_devcount(struct vnode *vp)
 {
 
-	ASSERT_VI_LOCKED(vp, __FUNCTION__);
-	VNASSERT(vp->v_usecount > 0, vp,
-	    ("v_decr_usecount: negative usecount"));
-	CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
-	vp->v_usecount--;
+	ASSERT_VI_LOCKED(vp, __func__);
+
 	if (vp->v_type == VCHR && vp->v_rdev != NULL) {
 		dev_lock();
-		vp->v_rdev->si_usecount--;
+		vp->v_rdev->si_usecount++;
 		dev_unlock();
 	}
-	vdropl(vp);
 }
 
 /*
- * Decrement only the use count and driver use count.  This is intended to
- * be paired with a follow on vdropl() to release the remaining hold count.
- * In this way we may vgone() a vnode with a 0 usecount without risk of
- * having it end up on a free list because the hold count is kept above 0.
+ * Decrement si_usecount of the associated device, if any.
  */
 static void
-v_decr_useonly(struct vnode *vp)
+v_decr_devcount(struct vnode *vp)
 {
 
-	ASSERT_VI_LOCKED(vp, __FUNCTION__);
-	VNASSERT(vp->v_usecount > 0, vp,
-	    ("v_decr_useonly: negative usecount"));
-	CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
-	vp->v_usecount--;
+	ASSERT_VI_LOCKED(vp, __func__);
+
 	if (vp->v_type == VCHR && vp->v_rdev != NULL) {
 		dev_lock();
 		vp->v_rdev->si_usecount--;
@@ -2164,21 +2154,38 @@ v_decr_useonly(struct vnode *vp)
  * is being destroyed.  Only callers who specify LK_RETRY will
  * see doomed vnodes.  If inactive processing was delayed in
  * vput try to do it here.
+ *
+ * Notes on lockless counter manipulation:
+ * The hold count prevents the vnode from being freed, while the
+ * use count prevents it from being recycled.
+ *
+ * Only 1->0 and 0->1 transitions require atomicity with respect to
+ * other operations (e.g. taking the vnode off of a free list).
+ * In such a case the interlock is taken, which provides mutual
+ * exclusion against threads transitioning the other way.
  */
 int
 vget(struct vnode *vp, int flags, struct thread *td)
 {
-	int error;
+	int error, oweinact;
 
-	error = 0;
 	VNASSERT((flags & LK_TYPE_MASK) != 0, vp,
 	    ("vget: invalid lock operation"));
+
+	if ((flags & LK_INTERLOCK) != 0)
+		ASSERT_VI_LOCKED(vp, __func__);
+	else
+		ASSERT_VI_UNLOCKED(vp, __func__);
+	if ((flags & LK_VNHELD) != 0)
+		VNASSERT((vp->v_holdcnt > 0), vp,
+		    ("vget: LK_VNHELD passed but vnode not held"));
+
 	CTR3(KTR_VFS, "%s: vp %p with flags %d", __func__, vp, flags);
 
-	if ((flags & LK_INTERLOCK) == 0)
-		VI_LOCK(vp);
-	vholdl(vp);
-	if ((error = vn_lock(vp, flags | LK_INTERLOCK)) != 0) {
+	if ((flags & LK_VNHELD) == 0)
+		_vhold(vp, (flags & LK_INTERLOCK) != 0);
+
+	if ((error = vn_lock(vp, flags)) != 0) {
 		vdrop(vp);
 		CTR2(KTR_VFS, "%s: impossible to lock vnode %p", __func__,
 		    vp);
@@ -2186,22 +2193,34 @@ vget(struct vnode *vp, int flags, struct thread *td)
 	}
 	if (vp->v_iflag & VI_DOOMED && (flags & LK_RETRY) == 0)
 		panic("vget: vn_lock failed to return ENOENT\n");
-	VI_LOCK(vp);
-	/* Upgrade our holdcnt to a usecount. */
-	v_upgrade_usecount(vp);
+
 	/*
 	 * We don't guarantee that any particular close will
 	 * trigger inactive processing so just make a best effort
 	 * here at preventing a reference to a removed file.  If
 	 * we don't succeed no harm is done.
+	 *
+	 * Upgrade our holdcnt to a usecount.
 	 */
-	if (vp->v_iflag & VI_OWEINACT) {
-		if (VOP_ISLOCKED(vp) == LK_EXCLUSIVE &&
+	if (vp->v_type != VCHR &&
+	    refcount_acquire_if_not_zero(&vp->v_usecount)) {
+		VNASSERT((vp->v_iflag & VI_OWEINACT) == 0, vp,
+		    ("vnode with usecount and VI_OWEINACT set"));
+	} else {
+		VI_LOCK(vp);
+		if ((vp->v_iflag & VI_OWEINACT) == 0) {
+			oweinact = 0;
+		} else {
+			oweinact = 1;
+			vp->v_iflag &= ~VI_OWEINACT;
+		}
+		refcount_acquire(&vp->v_usecount);
+		v_incr_devcount(vp);
+		if (oweinact && VOP_ISLOCKED(vp) == LK_EXCLUSIVE &&
 		    (flags & LK_NOWAIT) == 0)
 			vinactive(vp, td);
-		vp->v_iflag &= ~VI_OWEINACT;
+		VI_UNLOCK(vp);
 	}
-	VI_UNLOCK(vp);
 	return (0);
 }
 
@@ -2213,36 +2232,34 @@ vref(struct vnode *vp)
 {
 
 	CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
-	VI_LOCK(vp);
 	v_incr_usecount(vp);
-	VI_UNLOCK(vp);
 }
 
 /*
  * Return reference count of a vnode.
  *
- * The results of this call are only guaranteed when some mechanism other
- * than the VI lock is used to stop other processes from gaining references
- * to the vnode.  This may be the case if the caller holds the only reference.
- * This is also useful when stale data is acceptable as race conditions may
- * be accounted for by some other means.
+ * The results of this call are only guaranteed when some mechanism is used to
+ * stop other processes from gaining references to the vnode.  This may be the
+ * case if the caller holds the only reference.  This is also useful when stale
+ * data is acceptable as race conditions may be accounted for by some other
+ * means.
  */
 int
 vrefcnt(struct vnode *vp)
 {
-	int usecnt;
 
-	VI_LOCK(vp);
-	usecnt = vp->v_usecount;
-	VI_UNLOCK(vp);
-
-	return (usecnt);
+	return (vp->v_usecount);
 }
 
 #define	VPUTX_VRELE	1
 #define	VPUTX_VPUT	2
 #define	VPUTX_VUNREF	3
 
+/*
+ * Decrement the use and hold counts for a vnode.
+ *
+ * See an explanation near vget() as to why atomic operation is safe.
+ */
 static void
 vputx(struct vnode *vp, int func)
 {
@@ -2255,33 +2272,44 @@ vputx(struct vnode *vp, int func)
 		ASSERT_VOP_LOCKED(vp, "vput");
 	else
 		KASSERT(func == VPUTX_VRELE, ("vputx: wrong func"));
+	ASSERT_VI_UNLOCKED(vp, __func__);
 	CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
-	VI_LOCK(vp);
-
-	/* Skip this v_writecount check if we're going to panic below. */
-	VNASSERT(vp->v_writecount < vp->v_usecount || vp->v_usecount < 1, vp,
-	    ("vputx: missed vn_close"));
-	error = 0;
 
-	if (vp->v_usecount > 1 || ((vp->v_iflag & VI_DOINGINACT) &&
-	    vp->v_usecount == 1)) {
+	if (vp->v_type != VCHR &&
+	    refcount_release_if_not_last(&vp->v_usecount)) {
 		if (func == VPUTX_VPUT)
 			VOP_UNLOCK(vp, 0);
-		v_decr_usecount(vp);
+		vdrop(vp);
 		return;
 	}
 
-	if (vp->v_usecount != 1) {
-		vprint("vputx: negative ref count", vp);
-		panic("vputx: negative ref cnt");
-	}
-	CTR2(KTR_VFS, "%s: return vnode %p to the freelist", __func__, vp);
+	VI_LOCK(vp);
+
 	/*
 	 * We want to hold the vnode until the inactive finishes to
 	 * prevent vgone() races.  We drop the use count here and the
 	 * hold count below when we're done.
 	 */
-	v_decr_useonly(vp);
+	if (!refcount_release(&vp->v_usecount) ||
+	    (vp->v_iflag & VI_DOINGINACT)) {
+		if (func == VPUTX_VPUT)
+			VOP_UNLOCK(vp, 0);
+		v_decr_devcount(vp);
+		vdropl(vp);
+		return;
+	}
+
+	v_decr_devcount(vp);
+
+	error = 0;
+
+	if (vp->v_usecount != 0) {
+		vprint("vputx: usecount not zero", vp);
+		panic("vputx: usecount not zero");
+	}
+
+	CTR2(KTR_VFS, "%s: return vnode %p to the freelist", __func__, vp);
+
 	/*
 	 * We must call VOP_INACTIVE with the node locked. Mark
 	 * as VI_DOINGINACT to avoid recursion.
@@ -2307,7 +2335,8 @@ vputx(struct vnode *vp, int func)
 		break;
 	}
 	if (vp->v_usecount > 0)
-		vp->v_iflag &= ~VI_OWEINACT;
+		VNASSERT((vp->v_iflag & VI_OWEINACT) == 0, vp,
+		    ("vnode with usecount and VI_OWEINACT set"));
 	if (error == 0) {
 		if (vp->v_iflag & VI_OWEINACT)
 			vinactive(vp, curthread);
@@ -2351,36 +2380,36 @@ vunref(struct vnode *vp)
 }
 
 /*
- * Somebody doesn't want the vnode recycled.
- */
-void
-vhold(struct vnode *vp)
-{
-
-	VI_LOCK(vp);
-	vholdl(vp);
-	VI_UNLOCK(vp);
-}
-
-/*
  * Increase the hold count and activate if this is the first reference.
  */
 void
-vholdl(struct vnode *vp)
+_vhold(struct vnode *vp, bool locked)
 {
 	struct mount *mp;
 
+	if (locked)
+		ASSERT_VI_LOCKED(vp, __func__);
+	else
+		ASSERT_VI_UNLOCKED(vp, __func__);
 	CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
-#ifdef INVARIANTS
-	/* getnewvnode() calls v_incr_usecount() without holding interlock. */
-	if (vp->v_type != VNON || vp->v_data != NULL)
-		ASSERT_VI_LOCKED(vp, "vholdl");
-#endif
-	vp->v_holdcnt++;
-	if ((vp->v_iflag & VI_FREE) == 0)
+	if (!locked && refcount_acquire_if_not_zero(&vp->v_holdcnt)) {
+		VNASSERT((vp->v_iflag & VI_FREE) == 0, vp,
+		    ("_vhold: vnode with holdcnt is free"));
 		return;
-	VNASSERT(vp->v_holdcnt == 1, vp, ("vholdl: wrong hold count"));
-	VNASSERT(vp->v_op != NULL, vp, ("vholdl: vnode already reclaimed."));
+	}
+
+	if (!locked)
+		VI_LOCK(vp);
+	if ((vp->v_iflag & VI_FREE) == 0) {
+		refcount_acquire(&vp->v_holdcnt);
+		if (!locked)
+			VI_UNLOCK(vp);
+		return;
+	}
+	VNASSERT(vp->v_holdcnt == 0, vp,
+	    ("%s: wrong hold count", __func__));
+	VNASSERT(vp->v_op != NULL, vp,
+	    ("%s: vnode already reclaimed.", __func__));
 	/*
 	 * Remove a vnode from the free list, mark it as in use,
 	 * and put it on the active list.
@@ -2396,18 +2425,9 @@ vholdl(struct vnode *vp)
 	TAILQ_INSERT_HEAD(&mp->mnt_activevnodelist, vp, v_actfreelist);
 	mp->mnt_activevnodelistsize++;
 	mtx_unlock(&vnode_free_list_mtx);
-}
-
-/*
- * Note that there is one less who cares about this vnode.
- * vdrop() is the opposite of vhold().
- */
-void
-vdrop(struct vnode *vp)
-{
-
-	VI_LOCK(vp);
-	vdropl(vp);
+	refcount_acquire(&vp->v_holdcnt);
+	if (!locked)
+		VI_UNLOCK(vp);
 }
 
 /*
@@ -2416,20 +2436,28 @@ vdrop(struct vnode *vp)
  * (marked VI_DOOMED) in which case we will free it.
  */
 void
-vdropl(struct vnode *vp)
+_vdrop(struct vnode *vp, bool locked)
 {
 	struct bufobj *bo;
 	struct mount *mp;
 	int active;
 
-	ASSERT_VI_LOCKED(vp, "vdropl");
+	if (locked)
+		ASSERT_VI_LOCKED(vp, __func__);
+	else
+		ASSERT_VI_UNLOCKED(vp, __func__);
 	CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
-	if (vp->v_holdcnt <= 0)
+	if ((int)vp->v_holdcnt <= 0)
 		panic("vdrop: holdcnt %d", vp->v_holdcnt);
-	vp->v_holdcnt--;
-	VNASSERT(vp->v_holdcnt >= vp->v_usecount, vp,
-	    ("hold count less than use count"));
-	if (vp->v_holdcnt > 0) {
+	if (refcount_release_if_not_last(&vp->v_holdcnt)) {
+		if (locked)
+			VI_UNLOCK(vp);
+		return;
+	}
+
+	if (!locked)
+		VI_LOCK(vp);
+	if (refcount_release(&vp->v_holdcnt) == 0) {
 		VI_UNLOCK(vp);
 		return;
 	}
diff --git a/sys/sys/lockmgr.h b/sys/sys/lockmgr.h
index ff0473d..a74d5f5 100644
--- a/sys/sys/lockmgr.h
+++ b/sys/sys/lockmgr.h
@@ -159,6 +159,7 @@ _lockmgr_args_rw(struct lock *lk, u_int flags, struct rwlock *ilk,
 #define	LK_SLEEPFAIL	0x000800
 #define	LK_TIMELOCK	0x001000
 #define	LK_NODDLKTREAT	0x002000
+#define	LK_VNHELD	0x004000
 
 /*
  * Operations for lockmgr().
diff --git a/sys/sys/refcount.h b/sys/sys/refcount.h
index 4611664..d3f817c 100644
--- a/sys/sys/refcount.h
+++ b/sys/sys/refcount.h
@@ -64,4 +64,32 @@ refcount_release(volatile u_int *count)
 	return (old == 1);
 }
 
+static __inline int
+refcount_acquire_if_not_zero(volatile u_int *count)
+{
+	u_int old;
+
+	for (;;) {
+		old = *count;
+		if (old == 0)
+			return (0);
+		if (atomic_cmpset_int(count, old, old + 1))
+			return (1);
+	}
+}
+
+static __inline int
+refcount_release_if_not_last(volatile u_int *count)
+{
+	u_int old;
+
+	for (;;) {
+		old = *count;
+		if (old == 1)
+			return (0);
+		if (atomic_cmpset_int(count, old, old - 1))
+			return (1);
+	}
+}
+
 #endif	/* ! __SYS_REFCOUNT_H__ */
diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h
index 36ef8af..9286a4e 100644
--- a/sys/sys/vnode.h
+++ b/sys/sys/vnode.h
@@ -162,8 +162,8 @@ struct vnode {
 	daddr_t	v_lastw;			/* v last write  */
 	int	v_clen;				/* v length of cur. cluster */
 
-	int	v_holdcnt;			/* i prevents recycling. */
-	int	v_usecount;			/* i ref count of users */
+	u_int	v_holdcnt;			/* i prevents recycling. */
+	u_int	v_usecount;			/* i ref count of users */
 	u_int	v_iflag;			/* i vnode flags (see below) */
 	u_int	v_vflag;			/* v vnode flags */
 	int	v_writecount;			/* v ref count of writers */
@@ -652,13 +652,15 @@ int	vaccess_acl_posix1e(enum vtype type, uid_t file_uid,
 	    struct ucred *cred, int *privused);
 void	vattr_null(struct vattr *vap);
 int	vcount(struct vnode *vp);
-void	vdrop(struct vnode *);
-void	vdropl(struct vnode *);
+#define	vdrop(vp)	_vdrop((vp), 0)
+#define	vdropl(vp)	_vdrop((vp), 1)
+void	_vdrop(struct vnode *, bool);
 int	vflush(struct mount *mp, int rootrefs, int flags, struct thread *td);
 int	vget(struct vnode *vp, int lockflag, struct thread *td);
 void	vgone(struct vnode *vp);
-void	vhold(struct vnode *);
-void	vholdl(struct vnode *);
+#define	vhold(vp)	_vhold((vp), 0)
+#define	vholdl(vp)	_vhold((vp), 1)
+void	_vhold(struct vnode *, bool);
 void	vinactive(struct vnode *, struct thread *);
 int	vinvalbuf(struct vnode *vp, int save, int slpflag, int slptimeo);
 int	vtruncbuf(struct vnode *vp, struct ucred *cred, off_t length,