From owner-freebsd-current@freebsd.org  Fri Nov  6 00:55:29 2015
Return-Path: <owner-freebsd-current@freebsd.org>
Delivered-To: freebsd-current@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 98320A25F13
 for <freebsd-current@mailman.ysv.freebsd.org>;
 Fri,  6 Nov 2015 00:55:29 +0000 (UTC)
 (envelope-from mjguzik@gmail.com)
Received: from mail-wm0-x232.google.com (mail-wm0-x232.google.com
 [IPv6:2a00:1450:400c:c09::232])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 254B01EFC;
 Fri,  6 Nov 2015 00:55:29 +0000 (UTC)
 (envelope-from mjguzik@gmail.com)
Received: by wmll128 with SMTP id l128so24487985wml.0;
 Thu, 05 Nov 2015 16:55:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=date:from:to:cc:subject:message-id:mail-followup-to:references
 :mime-version:content-type:content-disposition:in-reply-to
 :user-agent; bh=m2+6WZxcgKyY3gZlPzGPpC4gfGv13cLyO364EvgTAv8=;
 b=Y5cNcxBzyA/ZEerSAiA4cXckakheET5qzpFzIoUoPrGY9z/llft5DE/ql8s/xSXnYP
 PXYLmIEbIaI7oY/fSicw9+qF6rcV3sb7XUF/3WlcwqFJFrbPhiQG5PU1Nv9Nkn9vc0kw
 dFeEcTYsmcUq6w6rvXO1rSNAniwV6K0+kfQtray+LDxLupljxmzhZDk4RCQxoqElzlPc
 qv3q1uBn4rGmHb12GX0ZkLZ1rBwgFY+rdrKBquHq6ts0IeF5RjyblArn9WP8XG7jNyzL
 Nf8tisSkFRwAAz/2N4Wrzhol5rwuPNN/gZb4NgUqXJamqejfcWC42P7cWqgb8dfXHL/L
 mI9w==
X-Received: by 10.28.46.143 with SMTP id u137mr7449019wmu.52.1446771327697;
 Thu, 05 Nov 2015 16:55:27 -0800 (PST)
Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net.
 [2001:470:1f08:1f7::2])
 by smtp.gmail.com with ESMTPSA id m135sm503958wmb.0.2015.11.05.16.55.26
 (version=TLSv1.2 cipher=RC4-SHA bits=128/128);
 Thu, 05 Nov 2015 16:55:26 -0800 (PST)
Date: Fri, 6 Nov 2015 01:55:25 +0100
From: Mateusz Guzik <mjguzik@gmail.com>
To: Ian Lepore <ian@freebsd.org>
Cc: John Baldwin <jhb@freebsd.org>, Adrian Chadd <adrian.chadd@gmail.com>,
 freebsd-current <freebsd-current@freebsd.org>,
 Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: [PATCH] microoptimize by trying to avoid locking a locked mutex
Message-ID: <20151106005524.GA4877@dft-labs.eu>
Mail-Followup-To: Mateusz Guzik <mjguzik@gmail.com>,
 Ian Lepore <ian@freebsd.org>, John Baldwin <jhb@freebsd.org>,
 Adrian Chadd <adrian.chadd@gmail.com>,
 freebsd-current <freebsd-current@freebsd.org>,
 Konstantin Belousov <kostikbel@gmail.com>
References: <20151104233218.GA27709@dft-labs.eu>
 <20151105192623.GB27709@dft-labs.eu>
 <CAJ-VmonnH4JJg0XqX1SoBXBa+9Xfmk+HFv58ETaQ9v1-uAAhdQ@mail.gmail.com>
 <1563180.x0Z3Ou4xid@ralph.baldwin.cx>
 <1446766522.91534.412.camel@freebsd.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <1446766522.91534.412.camel@freebsd.org>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 06 Nov 2015 00:55:29 -0000

On Thu, Nov 05, 2015 at 04:35:22PM -0700, Ian Lepore wrote:
> On Thu, 2015-11-05 at 14:19 -0800, John Baldwin wrote:
> > On Thursday, November 05, 2015 01:45:19 PM Adrian Chadd wrote:
> > > On 5 November 2015 at 11:26, Mateusz Guzik <mjguzik@gmail.com>
> > > wrote:
> > > > On Thu, Nov 05, 2015 at 11:04:13AM -0800, John Baldwin wrote:
> > > > > On Thursday, November 05, 2015 04:26:28 PM Konstantin Belousov
> > > > > wrote:
> > > > > > On Thu, Nov 05, 2015 at 12:32:18AM +0100, Mateusz Guzik
> > > > > > wrote:
> > > > > > > mtx_lock will unconditionally try to grab the lock and if
> > > > > > > that fails,
> > > > > > > will call __mtx_lock_sleep which will immediately try to do
> > > > > > > the same
> > > > > > > atomic op again.
> > > > > > > 
> > > > > > > So, the obvious microoptimization is to check the state in
> > > > > > > __mtx_lock_sleep and avoid the operation if the lock is not
> > > > > > > free.
> > > > > > > 
> > > > > > > This gives me ~40% speedup in a microbenchmark of 40 find
> > > > > > > processes
> > > > > > > traversing tmpfs and contending on mount mtx (only used as
> > > > > > > an easy
> > > > > > > benchmark, I have WIP patches to get rid of it).
> > > > > > > 
> > > > > > > Second part of the patch is optional and just checks the
> > > > > > > state of the
> > > > > > > lock prior to doing any atomic operations, but it gives a
> > > > > > > very modest
> > > > > > > speed up when applied on top of the __mtx_lock_sleep
> > > > > > > change. As such,
> > > > > > > I'm not going to defend this part.
> > > > > > Shouldn't the same consideration applied to all spinning
> > > > > > loops, i.e.
> > > > > > also to the spin/thread mutexes, and to the spinning parts of
> > > > > > sx and
> > > > > > lockmgr ?
> > > > > 
> > > > > I agree.  I think both changes are good and worth doing in our
> > > > > other
> > > > > primitives.
> > > > > 
> > > > 
> > > > I glanced over e.g. rw_rlock and it did not have the issue, now
> > > > that I
> > > > see _sx_xlock_hard it wuld indeed use fixing.
> > > > 
> > > > Expect a patch in few h for all primitives I'll find. I'll stress
> > > > test
> > > > the kernel, but it is unlikely I'll do microbenchmarks for
> > > > remaining
> > > > primitives.
> > > 
> > > Is this stuff you're proposing still valid for non-x86 platforms?
> > 
> > Yes.  It just does a read before trying the atomic compare and swap
> > and
> > falls through to the hard case as if the atomic op failed if the
> > result
> > of the read would result in a compare failure.
> > 
> 
> The atomic ops include barriers, the new pre-read of the variable
> doesn't.  Will that cause problems, especially for code inside a loop
> where the compiler may decide to shuffle things around?
> 

Lock value is volatile, so the compiler should not screw us up. I removed
the store to the variable due to a different concern. In worst case we
would have slept instead of spinning. (see below)

> I suspect the performance gain will be biggest on the platforms where
> atomic ops are expensive (I gather from various code comments that's
> the case on x86).
> 

I don't know about other architectures, on x86 atomic op performance on
contended cachelines deteriorates drastically.

=======================================================

For the most port the patch below does 2 simple kinds of changes:
1. in macros for lock/unlock primitives the lock value is fetched and
  compared against relevant _UNLOCKED macro prior to the atomic op
2. in functions:

before:
while (!_mtx_obtain_lock(m, tid)) {
    	v = m->mtx_lock;
	if (v != MTX_UNOWNED) {
		....
	}
	.....
}

after:
for (;;) {
	if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
		break;
	v = m->mtx_lock;
	if (v != MTX_UNOWNED) {
		....
	}
	.....
}

The original patch preloaded the 'v' variable. If the value was
MTX_UNOWNED and _mtx_obtain_lock failed, we just lost the race.  In
order to spin waiting for the other thread to release the lock, we have
to re-read the variable. Thus for simplicity it is no longer stored in
'v'. Note this is contrary to kib's earlier suggestion which would put
such threads directly to sleep. I don't have proper measurements to have
any strong opinion here, I went the aforementioned route to minimise
changes.

Note this is a trivial patch to improve the situation a little bit, I
have no interest in trying to polish these primitives at this time.

For overall results, on a machine with 32GB of RAM and the following:
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (2800.06-MHz K8-class
CPU)
FreeBSD/SMP: 2 package(s) x 10 core(s) x 2 SMT threads

this reduced make -j 40 kernel in a 10.2 jail from ~2:15 to ~2:05.

=======================================================

diff --git a/sys/kern/kern_lock.c b/sys/kern/kern_lock.c
index aa67180..198e93e 100644
--- a/sys/kern/kern_lock.c
+++ b/sys/kern/kern_lock.c
@@ -787,8 +787,10 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk,
 			break;
 		}
 
-		while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED,
-		    tid)) {
+		for (;;) {
+			if (lk->lk_lock == LK_UNLOCKED &&
+			    atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid))
+				break;
 #ifdef HWPMC_HOOKS
 			PMC_SOFT_CALL( , , lock, failed);
 #endif
@@ -1124,7 +1126,11 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk,
 			    __func__, iwmesg, file, line);
 		}
 
-		while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) {
+		for (;;) {
+			if (lk->lk_lock == LK_UNLOCKED &&
+			    atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid))
+				break;
+
 #ifdef HWPMC_HOOKS
 			PMC_SOFT_CALL( , , lock, failed);
 #endif
diff --git a/sys/kern/kern_mutex.c b/sys/kern/kern_mutex.c
index bec8f6b..51e82a3 100644
--- a/sys/kern/kern_mutex.c
+++ b/sys/kern/kern_mutex.c
@@ -419,7 +419,9 @@ __mtx_lock_sleep(volatile uintptr_t *c, uintptr_t tid, int opts,
 	all_time -= lockstat_nsecs(&m->lock_object);
 #endif
 
-	while (!_mtx_obtain_lock(m, tid)) {
+	for (;;) {
+		if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+			break;
 #ifdef KDTRACE_HOOKS
 		spin_cnt++;
 #endif
@@ -602,8 +604,9 @@ _mtx_lock_spin_cookie(volatile uintptr_t *c, uintptr_t tid, int opts,
 #ifdef KDTRACE_HOOKS
 	spin_time -= lockstat_nsecs(&m->lock_object);
 #endif
-	while (!_mtx_obtain_lock(m, tid)) {
-
+	for (;;) {
+		if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+			break;
 		/* Give interrupts a chance while we spin. */
 		spinlock_exit();
 		while (m->mtx_lock != MTX_UNOWNED) {
@@ -675,7 +678,9 @@ retry:
 			    m->lock_object.lo_name, file, line));
 		WITNESS_CHECKORDER(&m->lock_object,
 		    opts | LOP_NEWORDER | LOP_EXCLUSIVE, file, line, NULL);
-		while (!_mtx_obtain_lock(m, tid)) {
+		for (;;) {
+			if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid))
+				break;
 			if (m->mtx_lock == tid) {
 				m->mtx_recurse++;
 				break;
diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c
index 6541724..6a904d2 100644
--- a/sys/kern/kern_rwlock.c
+++ b/sys/kern/kern_rwlock.c
@@ -771,7 +771,9 @@ __rw_wlock_hard(volatile uintptr_t *c, uintptr_t tid, const char *file,
 	all_time -= lockstat_nsecs(&rw->lock_object);
 	state = rw->rw_lock;
 #endif
-	while (!_rw_write_lock(rw, tid)) {
+	for (;;) {
+		if (rw->rw_lock == RW_UNLOCKED && _rw_write_lock(rw, tid))
+			break;
 #ifdef KDTRACE_HOOKS
 		spin_cnt++;
 #endif
diff --git a/sys/kern/kern_sx.c b/sys/kern/kern_sx.c
index 96e117b..2a81c04 100644
--- a/sys/kern/kern_sx.c
+++ b/sys/kern/kern_sx.c
@@ -544,7 +544,10 @@ _sx_xlock_hard(struct sx *sx, uintptr_t tid, int opts, const char *file,
 	all_time -= lockstat_nsecs(&sx->lock_object);
 	state = sx->sx_lock;
 #endif
-	while (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) {
+	for (;;) {
+		if (sx->sx_lock == SX_LOCK_UNLOCKED &&
+		    atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
+			break;
 #ifdef KDTRACE_HOOKS
 		spin_cnt++;
 #endif
diff --git a/sys/sys/mutex.h b/sys/sys/mutex.h
index a9ec072..0443922 100644
--- a/sys/sys/mutex.h
+++ b/sys/sys/mutex.h
@@ -185,7 +185,7 @@ void	thread_lock_flags_(struct thread *, int, const char *, int);
 #define __mtx_lock(mp, tid, opts, file, line) do {			\
 	uintptr_t _tid = (uintptr_t)(tid);				\
 									\
-	if (!_mtx_obtain_lock((mp), _tid))				\
+	if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid)))\
 		_mtx_lock_sleep((mp), _tid, (opts), (file), (line));	\
 	else								\
 		LOCKSTAT_PROFILE_OBTAIN_LOCK_SUCCESS(adaptive__acquire,	\
@@ -203,7 +203,7 @@ void	thread_lock_flags_(struct thread *, int, const char *, int);
 	uintptr_t _tid = (uintptr_t)(tid);				\
 									\
 	spinlock_enter();						\
-	if (!_mtx_obtain_lock((mp), _tid)) {				\
+	if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid))) {\
 		if ((mp)->mtx_lock == _tid)				\
 			(mp)->mtx_recurse++;				\
 		else							\
@@ -232,7 +232,7 @@ void	thread_lock_flags_(struct thread *, int, const char *, int);
 									\
 	if ((mp)->mtx_recurse == 0)					\
 		LOCKSTAT_PROFILE_RELEASE_LOCK(adaptive__release, mp);	\
-	if (!_mtx_release_lock((mp), _tid))				\
+	if ((mp)->mtx_lock != _tid || !_mtx_release_lock((mp), _tid))	\
 		_mtx_unlock_sleep((mp), (opts), (file), (line));	\
 } while (0)
 
diff --git a/sys/sys/rwlock.h b/sys/sys/rwlock.h
index f8947c5..6b4f505 100644
--- a/sys/sys/rwlock.h
+++ b/sys/sys/rwlock.h
@@ -96,7 +96,7 @@
 #define	__rw_wlock(rw, tid, file, line) do {				\
 	uintptr_t _tid = (uintptr_t)(tid);				\
 									\
-	if (!_rw_write_lock((rw), _tid))				\
+	if ((rw)->rw_lock != RW_UNLOCKED || !_rw_write_lock((rw), _tid))\
 		_rw_wlock_hard((rw), _tid, (file), (line));		\
 	else 								\
 		LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw,	\
@@ -112,7 +112,7 @@
 	else {								\
 		LOCKSTAT_PROFILE_RELEASE_RWLOCK(rw__release, rw,	\
 		    LOCKSTAT_WRITER);					\
-		if (!_rw_write_unlock((rw), _tid))			\
+		if ((rw)->rw_lock != _tid || !_rw_write_unlock((rw), _tid))\
 			_rw_wunlock_hard((rw), _tid, (file), (line));	\
 	}								\
 } while (0)
diff --git a/sys/sys/sx.h b/sys/sys/sx.h
index 96a664f..144cab5 100644
--- a/sys/sys/sx.h
+++ b/sys/sys/sx.h
@@ -150,7 +150,8 @@ __sx_xlock(struct sx *sx, struct thread *td, int opts, const char *file,
 	uintptr_t tid = (uintptr_t)td;
 	int error = 0;
 
-	if (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
+	if (sx->sx_lock != SX_LOCK_UNLOCKED ||
+	    !atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid))
 		error = _sx_xlock_hard(sx, tid, opts, file, line);
 	else 
 		LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx,
@@ -168,7 +169,8 @@ __sx_xunlock(struct sx *sx, struct thread *td, const char *file, int line)
 	if (sx->sx_recurse == 0)
 		LOCKSTAT_PROFILE_RELEASE_RWLOCK(sx__release, sx,
 		    LOCKSTAT_WRITER);
-	if (!atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
+	if (sx->sx_lock != tid ||
+	    !atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED))
 		_sx_xunlock_hard(sx, tid, file, line);
 }