Date: Mon, 14 Jan 2013 13:58:32 -0500 From: John Baldwin <jhb@freebsd.org> To: David Chisnall <theraven@freebsd.org> Cc: toolchain@freebsd.org, Jilles Tjoelker <jilles@stack.nl>, freebsd-arch@freebsd.org Subject: Re: Fast sigblock (AKA rtld speedup) Message-ID: <201301141358.33216.jhb@freebsd.org> In-Reply-To: <D6772A0E-FBA4-4168-B152-7E7694720A16@FreeBSD.org> References: <20130107182235.GA65279@kib.kiev.ua> <20130114174703.GB88220@stack.nl> <D6772A0E-FBA4-4168-B152-7E7694720A16@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Monday, January 14, 2013 1:24:04 pm David Chisnall wrote: > On 14 Jan 2013, at 17:47, Jilles Tjoelker wrote: > > > The code which does that check is actually under contrib/gcc. Problem > > is, they designed __gthread_active_p() to distinguish threaded and > > unthreaded programming environments -- it must be known in advance and > > cannot be changed later. The code for the unthreaded environment then > > takes advantage of this by not even allocating memory for mutexes in > > some cases. > > It's worth taking a step back and asking why this code exists at all, and the main reason is that acquiring a mutex used to be really expensive. It still is on some fruit-flavoured operating systems, but elsewhere it's a single atomic operation in the uncontended case, and in that case the cache line will already be exclusively owned by the calling core in single-threaded code. > > I would much rather that we followed the example of Solaris and made the multithreaded case fast and the default than keep piling on hacks that allow code to shave off a few clock cycles in the single-threaded case. In particular, the popularity of multicore systems means that it is increasingly rare for code to be both single threaded and performance critical, so this seems like misplaced optimisation. We have single-threaded performance critical applications that run on multicore systems (we just run several copies) and if we link in libthr, then pthread_mutex operations (even on uncontested locks) show up as one of the top consumers of CPU time when we profile our applications. > I strongly suspect that making it possible to inline the uncontended lock case for a pthread mutex and eliminating all of the branches on __isthreaded would give us a net speedup in both single and multithreaded cases. I'm less certain. Note that you can't inline mutex ops until you expose the mutexes themselves to userland (that is, making pthread_mutex_t not be opaque). -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201301141358.33216.jhb>