From owner-freebsd-current@freebsd.org Fri Nov 6 00:55:29 2015 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 98320A25F13 for ; Fri, 6 Nov 2015 00:55:29 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-wm0-x232.google.com (mail-wm0-x232.google.com [IPv6:2a00:1450:400c:c09::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 254B01EFC; Fri, 6 Nov 2015 00:55:29 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: by wmll128 with SMTP id l128so24487985wml.0; Thu, 05 Nov 2015 16:55:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:mail-followup-to:references :mime-version:content-type:content-disposition:in-reply-to :user-agent; bh=m2+6WZxcgKyY3gZlPzGPpC4gfGv13cLyO364EvgTAv8=; b=Y5cNcxBzyA/ZEerSAiA4cXckakheET5qzpFzIoUoPrGY9z/llft5DE/ql8s/xSXnYP PXYLmIEbIaI7oY/fSicw9+qF6rcV3sb7XUF/3WlcwqFJFrbPhiQG5PU1Nv9Nkn9vc0kw dFeEcTYsmcUq6w6rvXO1rSNAniwV6K0+kfQtray+LDxLupljxmzhZDk4RCQxoqElzlPc qv3q1uBn4rGmHb12GX0ZkLZ1rBwgFY+rdrKBquHq6ts0IeF5RjyblArn9WP8XG7jNyzL Nf8tisSkFRwAAz/2N4Wrzhol5rwuPNN/gZb4NgUqXJamqejfcWC42P7cWqgb8dfXHL/L mI9w== X-Received: by 10.28.46.143 with SMTP id u137mr7449019wmu.52.1446771327697; Thu, 05 Nov 2015 16:55:27 -0800 (PST) Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net. [2001:470:1f08:1f7::2]) by smtp.gmail.com with ESMTPSA id m135sm503958wmb.0.2015.11.05.16.55.26 (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Thu, 05 Nov 2015 16:55:26 -0800 (PST) Date: Fri, 6 Nov 2015 01:55:25 +0100 From: Mateusz Guzik To: Ian Lepore Cc: John Baldwin , Adrian Chadd , freebsd-current , Konstantin Belousov Subject: Re: [PATCH] microoptimize by trying to avoid locking a locked mutex Message-ID: <20151106005524.GA4877@dft-labs.eu> Mail-Followup-To: Mateusz Guzik , Ian Lepore , John Baldwin , Adrian Chadd , freebsd-current , Konstantin Belousov References: <20151104233218.GA27709@dft-labs.eu> <20151105192623.GB27709@dft-labs.eu> <1563180.x0Z3Ou4xid@ralph.baldwin.cx> <1446766522.91534.412.camel@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1446766522.91534.412.camel@freebsd.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 06 Nov 2015 00:55:29 -0000 On Thu, Nov 05, 2015 at 04:35:22PM -0700, Ian Lepore wrote: > On Thu, 2015-11-05 at 14:19 -0800, John Baldwin wrote: > > On Thursday, November 05, 2015 01:45:19 PM Adrian Chadd wrote: > > > On 5 November 2015 at 11:26, Mateusz Guzik > > > wrote: > > > > On Thu, Nov 05, 2015 at 11:04:13AM -0800, John Baldwin wrote: > > > > > On Thursday, November 05, 2015 04:26:28 PM Konstantin Belousov > > > > > wrote: > > > > > > On Thu, Nov 05, 2015 at 12:32:18AM +0100, Mateusz Guzik > > > > > > wrote: > > > > > > > mtx_lock will unconditionally try to grab the lock and if > > > > > > > that fails, > > > > > > > will call __mtx_lock_sleep which will immediately try to do > > > > > > > the same > > > > > > > atomic op again. > > > > > > > > > > > > > > So, the obvious microoptimization is to check the state in > > > > > > > __mtx_lock_sleep and avoid the operation if the lock is not > > > > > > > free. > > > > > > > > > > > > > > This gives me ~40% speedup in a microbenchmark of 40 find > > > > > > > processes > > > > > > > traversing tmpfs and contending on mount mtx (only used as > > > > > > > an easy > > > > > > > benchmark, I have WIP patches to get rid of it). > > > > > > > > > > > > > > Second part of the patch is optional and just checks the > > > > > > > state of the > > > > > > > lock prior to doing any atomic operations, but it gives a > > > > > > > very modest > > > > > > > speed up when applied on top of the __mtx_lock_sleep > > > > > > > change. As such, > > > > > > > I'm not going to defend this part. > > > > > > Shouldn't the same consideration applied to all spinning > > > > > > loops, i.e. > > > > > > also to the spin/thread mutexes, and to the spinning parts of > > > > > > sx and > > > > > > lockmgr ? > > > > > > > > > > I agree. I think both changes are good and worth doing in our > > > > > other > > > > > primitives. > > > > > > > > > > > > > I glanced over e.g. rw_rlock and it did not have the issue, now > > > > that I > > > > see _sx_xlock_hard it wuld indeed use fixing. > > > > > > > > Expect a patch in few h for all primitives I'll find. I'll stress > > > > test > > > > the kernel, but it is unlikely I'll do microbenchmarks for > > > > remaining > > > > primitives. > > > > > > Is this stuff you're proposing still valid for non-x86 platforms? > > > > Yes. It just does a read before trying the atomic compare and swap > > and > > falls through to the hard case as if the atomic op failed if the > > result > > of the read would result in a compare failure. > > > > The atomic ops include barriers, the new pre-read of the variable > doesn't. Will that cause problems, especially for code inside a loop > where the compiler may decide to shuffle things around? > Lock value is volatile, so the compiler should not screw us up. I removed the store to the variable due to a different concern. In worst case we would have slept instead of spinning. (see below) > I suspect the performance gain will be biggest on the platforms where > atomic ops are expensive (I gather from various code comments that's > the case on x86). > I don't know about other architectures, on x86 atomic op performance on contended cachelines deteriorates drastically. ======================================================= For the most port the patch below does 2 simple kinds of changes: 1. in macros for lock/unlock primitives the lock value is fetched and compared against relevant _UNLOCKED macro prior to the atomic op 2. in functions: before: while (!_mtx_obtain_lock(m, tid)) { v = m->mtx_lock; if (v != MTX_UNOWNED) { .... } ..... } after: for (;;) { if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid)) break; v = m->mtx_lock; if (v != MTX_UNOWNED) { .... } ..... } The original patch preloaded the 'v' variable. If the value was MTX_UNOWNED and _mtx_obtain_lock failed, we just lost the race. In order to spin waiting for the other thread to release the lock, we have to re-read the variable. Thus for simplicity it is no longer stored in 'v'. Note this is contrary to kib's earlier suggestion which would put such threads directly to sleep. I don't have proper measurements to have any strong opinion here, I went the aforementioned route to minimise changes. Note this is a trivial patch to improve the situation a little bit, I have no interest in trying to polish these primitives at this time. For overall results, on a machine with 32GB of RAM and the following: CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz (2800.06-MHz K8-class CPU) FreeBSD/SMP: 2 package(s) x 10 core(s) x 2 SMT threads this reduced make -j 40 kernel in a 10.2 jail from ~2:15 to ~2:05. ======================================================= diff --git a/sys/kern/kern_lock.c b/sys/kern/kern_lock.c index aa67180..198e93e 100644 --- a/sys/kern/kern_lock.c +++ b/sys/kern/kern_lock.c @@ -787,8 +787,10 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk, break; } - while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, - tid)) { + for (;;) { + if (lk->lk_lock == LK_UNLOCKED && + atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) + break; #ifdef HWPMC_HOOKS PMC_SOFT_CALL( , , lock, failed); #endif @@ -1124,7 +1126,11 @@ __lockmgr_args(struct lock *lk, u_int flags, struct lock_object *ilk, __func__, iwmesg, file, line); } - while (!atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) { + for (;;) { + if (lk->lk_lock == LK_UNLOCKED && + atomic_cmpset_acq_ptr(&lk->lk_lock, LK_UNLOCKED, tid)) + break; + #ifdef HWPMC_HOOKS PMC_SOFT_CALL( , , lock, failed); #endif diff --git a/sys/kern/kern_mutex.c b/sys/kern/kern_mutex.c index bec8f6b..51e82a3 100644 --- a/sys/kern/kern_mutex.c +++ b/sys/kern/kern_mutex.c @@ -419,7 +419,9 @@ __mtx_lock_sleep(volatile uintptr_t *c, uintptr_t tid, int opts, all_time -= lockstat_nsecs(&m->lock_object); #endif - while (!_mtx_obtain_lock(m, tid)) { + for (;;) { + if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid)) + break; #ifdef KDTRACE_HOOKS spin_cnt++; #endif @@ -602,8 +604,9 @@ _mtx_lock_spin_cookie(volatile uintptr_t *c, uintptr_t tid, int opts, #ifdef KDTRACE_HOOKS spin_time -= lockstat_nsecs(&m->lock_object); #endif - while (!_mtx_obtain_lock(m, tid)) { - + for (;;) { + if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid)) + break; /* Give interrupts a chance while we spin. */ spinlock_exit(); while (m->mtx_lock != MTX_UNOWNED) { @@ -675,7 +678,9 @@ retry: m->lock_object.lo_name, file, line)); WITNESS_CHECKORDER(&m->lock_object, opts | LOP_NEWORDER | LOP_EXCLUSIVE, file, line, NULL); - while (!_mtx_obtain_lock(m, tid)) { + for (;;) { + if (m->mtx_lock == MTX_UNOWNED && _mtx_obtain_lock(m, tid)) + break; if (m->mtx_lock == tid) { m->mtx_recurse++; break; diff --git a/sys/kern/kern_rwlock.c b/sys/kern/kern_rwlock.c index 6541724..6a904d2 100644 --- a/sys/kern/kern_rwlock.c +++ b/sys/kern/kern_rwlock.c @@ -771,7 +771,9 @@ __rw_wlock_hard(volatile uintptr_t *c, uintptr_t tid, const char *file, all_time -= lockstat_nsecs(&rw->lock_object); state = rw->rw_lock; #endif - while (!_rw_write_lock(rw, tid)) { + for (;;) { + if (rw->rw_lock == RW_UNLOCKED && _rw_write_lock(rw, tid)) + break; #ifdef KDTRACE_HOOKS spin_cnt++; #endif diff --git a/sys/kern/kern_sx.c b/sys/kern/kern_sx.c index 96e117b..2a81c04 100644 --- a/sys/kern/kern_sx.c +++ b/sys/kern/kern_sx.c @@ -544,7 +544,10 @@ _sx_xlock_hard(struct sx *sx, uintptr_t tid, int opts, const char *file, all_time -= lockstat_nsecs(&sx->lock_object); state = sx->sx_lock; #endif - while (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) { + for (;;) { + if (sx->sx_lock == SX_LOCK_UNLOCKED && + atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) + break; #ifdef KDTRACE_HOOKS spin_cnt++; #endif diff --git a/sys/sys/mutex.h b/sys/sys/mutex.h index a9ec072..0443922 100644 --- a/sys/sys/mutex.h +++ b/sys/sys/mutex.h @@ -185,7 +185,7 @@ void thread_lock_flags_(struct thread *, int, const char *, int); #define __mtx_lock(mp, tid, opts, file, line) do { \ uintptr_t _tid = (uintptr_t)(tid); \ \ - if (!_mtx_obtain_lock((mp), _tid)) \ + if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid)))\ _mtx_lock_sleep((mp), _tid, (opts), (file), (line)); \ else \ LOCKSTAT_PROFILE_OBTAIN_LOCK_SUCCESS(adaptive__acquire, \ @@ -203,7 +203,7 @@ void thread_lock_flags_(struct thread *, int, const char *, int); uintptr_t _tid = (uintptr_t)(tid); \ \ spinlock_enter(); \ - if (!_mtx_obtain_lock((mp), _tid)) { \ + if (((mp)->mtx_lock != MTX_UNOWNED || !_mtx_obtain_lock((mp), _tid))) {\ if ((mp)->mtx_lock == _tid) \ (mp)->mtx_recurse++; \ else \ @@ -232,7 +232,7 @@ void thread_lock_flags_(struct thread *, int, const char *, int); \ if ((mp)->mtx_recurse == 0) \ LOCKSTAT_PROFILE_RELEASE_LOCK(adaptive__release, mp); \ - if (!_mtx_release_lock((mp), _tid)) \ + if ((mp)->mtx_lock != _tid || !_mtx_release_lock((mp), _tid)) \ _mtx_unlock_sleep((mp), (opts), (file), (line)); \ } while (0) diff --git a/sys/sys/rwlock.h b/sys/sys/rwlock.h index f8947c5..6b4f505 100644 --- a/sys/sys/rwlock.h +++ b/sys/sys/rwlock.h @@ -96,7 +96,7 @@ #define __rw_wlock(rw, tid, file, line) do { \ uintptr_t _tid = (uintptr_t)(tid); \ \ - if (!_rw_write_lock((rw), _tid)) \ + if ((rw)->rw_lock != RW_UNLOCKED || !_rw_write_lock((rw), _tid))\ _rw_wlock_hard((rw), _tid, (file), (line)); \ else \ LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(rw__acquire, rw, \ @@ -112,7 +112,7 @@ else { \ LOCKSTAT_PROFILE_RELEASE_RWLOCK(rw__release, rw, \ LOCKSTAT_WRITER); \ - if (!_rw_write_unlock((rw), _tid)) \ + if ((rw)->rw_lock != _tid || !_rw_write_unlock((rw), _tid))\ _rw_wunlock_hard((rw), _tid, (file), (line)); \ } \ } while (0) diff --git a/sys/sys/sx.h b/sys/sys/sx.h index 96a664f..144cab5 100644 --- a/sys/sys/sx.h +++ b/sys/sys/sx.h @@ -150,7 +150,8 @@ __sx_xlock(struct sx *sx, struct thread *td, int opts, const char *file, uintptr_t tid = (uintptr_t)td; int error = 0; - if (!atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) + if (sx->sx_lock != SX_LOCK_UNLOCKED || + !atomic_cmpset_acq_ptr(&sx->sx_lock, SX_LOCK_UNLOCKED, tid)) error = _sx_xlock_hard(sx, tid, opts, file, line); else LOCKSTAT_PROFILE_OBTAIN_RWLOCK_SUCCESS(sx__acquire, sx, @@ -168,7 +169,8 @@ __sx_xunlock(struct sx *sx, struct thread *td, const char *file, int line) if (sx->sx_recurse == 0) LOCKSTAT_PROFILE_RELEASE_RWLOCK(sx__release, sx, LOCKSTAT_WRITER); - if (!atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED)) + if (sx->sx_lock != tid || + !atomic_cmpset_rel_ptr(&sx->sx_lock, tid, SX_LOCK_UNLOCKED)) _sx_xunlock_hard(sx, tid, file, line); }