From owner-freebsd-smp@FreeBSD.ORG Tue Sep 25 17:17:50 2007 Return-Path: Delivered-To: freebsd-smp@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 618D216A418; Tue, 25 Sep 2007 17:17:50 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from speedfactory.net (mail6.speedfactory.net [66.23.216.219]) by mx1.freebsd.org (Postfix) with ESMTP id D0CD713C459; Tue, 25 Sep 2007 17:17:49 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (unverified [66.23.211.162]) by speedfactory.net (SurgeMail 3.8p) with ESMTP id 211331053-1834499 for multiple; Tue, 25 Sep 2007 13:15:47 -0400 Received: from localhost.corp.yahoo.com (john@localhost [127.0.0.1]) (authenticated bits=0) by server.baldwin.cx (8.13.8/8.13.8) with ESMTP id l8PHGvJV011655; Tue, 25 Sep 2007 13:17:00 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Jeff Roberson Date: Tue, 25 Sep 2007 13:14:42 -0400 User-Agent: KMail/1.9.6 References: <3bbf2fe10709221932i386f65b9h6f47ab4bee08c528@mail.gmail.com> <200709241152.41660.jhb@freebsd.org> <20070924135554.F547@10.0.0.1> In-Reply-To: <20070924135554.F547@10.0.0.1> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200709251314.42707.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH authentication, not delayed by milter-greylist-2.0.2 (server.baldwin.cx [127.0.0.1]); Tue, 25 Sep 2007 13:17:01 -0400 (EDT) X-Virus-Scanned: ClamAV 0.88.3/4391/Tue Sep 25 11:10:50 2007 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.4 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.3 X-Spam-Checker-Version: SpamAssassin 3.1.3 (2006-06-01) on server.baldwin.cx Cc: Attilio Rao , freebsd-smp@freebsd.org, freebsd-arch@freebsd.org Subject: Re: rwlocks: poor performance with adaptive spinning X-BeenThere: freebsd-smp@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: FreeBSD SMP implementation group List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Sep 2007 17:17:50 -0000 On Monday 24 September 2007 04:57:06 pm Jeff Roberson wrote: > On Mon, 24 Sep 2007, John Baldwin wrote: > > > On Saturday 22 September 2007 10:32:06 pm Attilio Rao wrote: > >> Recently several people have reported problems of starvation with rwlocks. > >> In particular, users which tried to use rwlock on big SMP environment > >> (16+ CPUs) found them rather subjected to poor performances and to > >> starvation of waiters. > >> > >> Inspecting the code, something strange about adaptive spinning popped > >> up: basically, for rwlocks, adaptive spinning stubs seem to be > >> customed too down in the decisioning-loop. > >> The desposition of the stub will let the thread that would adaptively > >> spin, to set the respecitve (both read or write) waiters flag on, > >> which means that the owner of the lock will go down in the hard path > >> of locking functions and will performe a full wakeup even if the > >> waiters queues can result empty. This is a big penalty for adaptive > >> spinning which can make it completely useless. > >> In addiction to this, adaptive spinning only runs in the turnstile > >> spinlock path which is not ideal. > >> This patch ports the approach alredy used for adaptive spinning in sx > >> locks to rwlocks: > >> http://users.gufi.org/~rookie/works/patches/kern_rwlock.diff > >> > >> In sx it is unlikely to see big benefits because they are held for too > >> long times, but for rwlocks situation is rather different. > >> I would like to see if people can do benchmarks with this patch (maybe > >> in private environments?) as I'm not able to do them in short times. > >> > >> Adaptive spinning in rwlocks can be improved further with other tricks > >> (like adding a backoff counter, for example, or trying to spin with > >> the lock held in read mode too), but we first should be sure to start > >> with a solid base. > > > > I did this for mutexes and rwlocks over a year ago and Kris found it was > > slower in benchmarks. www.freebsd.org/~jhb/patches/lock_adapt.patch is the > > last thing I sent kris@ to test (it only has the mutex changes). This might > > be more optimal post-thread_lock since thread_lock seems to have heavily > > pessimized adaptive spinning because it now enqueues the thread and then > > dequeues it again before doing the adaptive spin. I liked the approach > > orginially because it simplifies the code a lot. A separate issue is that > > writers don't spin at all if a reader holds the lock, and I think one thing > > to test for that would be an adaptive spin with a static timeout. > > We don't enqueue the thread until the same place. We just acquire an > extra spinlock. The thread is not enqueued until turnstile_wait() as > before. So I looked at this some more and now I don't understand why the trywait() and cancel() changes were done. Some background: When the initial turnstile code was split out of the the mutex code, the main loop in mtx_lock_sleep() looked like this: while (!_obtain_lock) { ts = turnstile_lookup(); /* (A) locked ts chain and did O(n) lookup to find existing ts */ if (lock unlocked) { (B) turnstile_release(); continue; } if (failed to set contested flag) { (C) turnstile_release(); continue; } if (owner is running) { (D) turnstile_release(); ... spin ... continue; } turnstile_wait(ts, ...); (E) } So one thing I noticed after a while is that every iteration of this loop was doing the O(n) lookup to find the turnstile even though we only used that in (E). The lookup was wasted overhead for the (B), (C), and (D) cases. Hence this commit a while later: jhb 2004-10-12 18:36:20 UTC FreeBSD src repository Modified files: sys/kern kern_condvar.c kern_mutex.c kern_synch.c subr_sleepqueue.c subr_turnstile.c sys/sys sleepqueue.h turnstile.h Log: Refine the turnstile and sleep queue interfaces just a bit: - Add a new _lock() call to each API that locks the associated chain lock for a lock_object pointer or wait channel. The _lookup() functions now require that the chain lock be locked via _lock() when they are called. - Change sleepq_add(), turnstile_wait() and turnstile_claim() to lookup the associated queue structure internally via _lookup() rather than accepting a pointer from the caller. For turnstiles, this means that the actual lookup of the turnstile in the hash table is only done when the thread actually blocks rather than being done on each loop iteration in _mtx_lock_sleep(). For sleep queues, this means that sleepq_lookup() is no longer used outside of the sleep queue code except to implement an assertion in cv_destroy(). - Change sleepq_broadcast() and sleepq_signal() to require that the chain lock is already required. For condition variables, this lets the cv_broadcast() and cv_signal() functions lock the sleep queue chain lock while testing the waiters count. This means that the waiters count internal to condition variables is no longer protected by the interlock mutex and cv_broadcast() and cv_signal() now no longer require that the interlock be held when they are called. This lets consumers of condition variables drop the lock before waking other threads which can result in fewer context switches. MFC after: 1 month Revision Changes Path 1.52 +16 -15 src/sys/kern/kern_condvar.c 1.151 +4 -5 src/sys/kern/kern_mutex.c 1.263 +4 -3 src/sys/kern/kern_synch.c 1.13 +31 -14 src/sys/kern/subr_sleepqueue.c 1.150 +34 -12 src/sys/kern/subr_turnstile.c 1.5 +11 -12 src/sys/sys/sleepqueue.h 1.5 +17 -16 src/sys/sys/turnstile.h After that commit the mutex loop looked like this: while (!_obtain_lock) { turnstile_lock(); /* (A) locked ts chain only */ if (lock unlocked) { (B) turnstile_release(); continue; } if (failed to set contested flag) { (C) turnstile_release(); continue; } if (owner is running) { (D) turnstile_release(); ... spin ... continue; } turnstile_wait(ts, ...); /* (E) Does O(n) lookup */ } That is, after the above commit, we only did the O(n) lookup if we needed the actual turnstile object in (E). This decreased the overhead for the (B), (C), and (D) cases. The changes to add trywait() and cancel() basically seem to have reverted the above commit by going back to doing the O(n) lookup in (A) thus re-adding the overhead to the (B), (C), and (D) cases. In addition trywait() / cancel() actually maintain some additional state and acquire another lock so that (B), (C), and (D) have even more overhead than the first cut of the turnstile code including having to back out some of the changes in cancel() rather than just a single spinlock unlock. This is actually why I had assumed you had enqueued the turnstile in trywait() as I thought you had dropped the chain lock in trywait() (which would require enqueing the thread in trywait(), but would also greatly complicate cancel()). Otherwise, the current trywait() / cancel() changes seem like a step backwards to me. Do you really think it is necessary to do the O(n) lookup even when we don't need it (B, C, D) rather than just doing it in turnstile_wait() (E)? -- John Baldwin