From owner-freebsd-threads@FreeBSD.ORG  Tue Sep 23 21:20:06 2014
Return-Path: <owner-freebsd-threads@FreeBSD.ORG>
Delivered-To: freebsd-threads@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 8EB51B3;
 Tue, 23 Sep 2014 21:20:06 +0000 (UTC)
Received: from mx1.stack.nl (relay02.stack.nl [IPv6:2001:610:1108:5010::104])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client CN "mailhost.stack.nl",
 Issuer "CA Cert Signing Authority" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 3C0D2AB6;
 Tue, 23 Sep 2014 21:20:06 +0000 (UTC)
Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131])
 by mx1.stack.nl (Postfix) with ESMTP id A807C3592E6;
 Tue, 23 Sep 2014 23:20:00 +0200 (CEST)
Received: by snail.stack.nl (Postfix, from userid 1677)
 id 8C51128494; Tue, 23 Sep 2014 23:20:00 +0200 (CEST)
Date: Tue, 23 Sep 2014 23:20:00 +0200
From: Jilles Tjoelker <jilles@stack.nl>
To: John Baldwin <jhb@freebsd.org>
Subject: Re: sem_post() performance
Message-ID: <20140923212000.GA78110@stack.nl>
References: <20140921213742.GA46868@stack.nl>
 <1531724.MPBlj40xOW@ralph.baldwin.cx>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1531724.MPBlj40xOW@ralph.baldwin.cx>
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: adrian@freebsd.org, freebsd-threads@freebsd.org
X-BeenThere: freebsd-threads@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Threading on FreeBSD <freebsd-threads.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-threads>, 
 <mailto:freebsd-threads-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-threads/>
List-Post: <mailto:freebsd-threads@freebsd.org>
List-Help: <mailto:freebsd-threads-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-threads>,
 <mailto:freebsd-threads-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 23 Sep 2014 21:20:06 -0000

On Mon, Sep 22, 2014 at 03:53:13PM -0400, John Baldwin wrote:
> On Sunday, September 21, 2014 11:37:42 PM Jilles Tjoelker wrote:
> > It has been reported that POSIX semaphores are slow, in contexts such as
> > Python. Note that POSIX semaphores are the only synchronization objects
> > that support use by different processes in shared memory; this does not
> > work for mutexes and condition variables because they are pointers to
> > the actual data structure.

> > In fact, sem_post() unconditionally performs an umtx system call.

> *sigh*  I was worried that that might be the case.

> > To avoid both lost wakeups and possible writes to a destroyed semaphore,
> > an uncontested sem_post() must check the _has_waiters flag atomically
> > with incrementing _count.

> > The proper way to do this would be to take one bit from _count and
> > use it for the _has_waiters flag; the definition of SEM_VALUE_MAX
> > permits this. However, this would require a new set of umtx
> > semaphore operations and will break ABI of process-shared semaphores
> > (things may break if an old and a new libc access the same semaphore
> > over shared memory).

> > This diff only affects 32-bit aligned but 64-bit misaligned
> > semaphores on 64-bit systems, and changes _count and _has_waiters
> > atomically using a 64-bit atomic operation. It probably needs a
> > may_alias attribute for correctness, but <sys/cdefs.h> does not have
> > a wrapper for that.

> It wasn't clear on first reading, but you are using aliasing to get
> around the need for new umtx calls by using a 64-bit atomic op to
> adjust two ints at the same time, yes?  Note that since a failing
> semaphore op calls into the kernel for the "hard" case, you might in
> fact be able to change the ABI without breaking process-shared
> semaphores.  That is, suppose you left 'has_waiters' as always true
> and reused the high bit of count for has_waiters.

> Would old binaries always trap into the kernel?  (Not sure they will,
> especially the case where an old binary creates the semaphore, a new
> binary would have to force has_waiters to true in every sem op, but
> even that might not be enough.)

I think that everything will break when a binary linked to old and new
libcs use the same semaphore. If the new contested bit is set, the old
sem_getvalue() will return garbage, the old sem_trywait() will fail even
if the real count is greater than 0, the old sem_wait() and
sem_timedwait() may spin if the real count is greater than 0 and the old
sem_post() will fail with [EOVERFLOW].

That the "hard" path always issues a system call does not help much,
since the system calls do not write to _count (this is an throughput
optimization, allowing a fast-path thread through while a slow-path
thread is entering or leaving the kernel).

> > Some x86 CPUs may cope with misaligned atomic ops without destroying
> > performance (especially if they do not cross a cache line), so the
> > alignment restriction could be relaxed to make the patch more practical.

> > Many CPUs in the i386 architecture have a 64-bit atomic op (cmpxchg8b)
> > which could be used here.

> > This appears to restore performance of 10-stable uncontested semaphores
> > with the strange alignment to 9-stable levels (a tight loop with
> > sem_wait and sem_post). I have not tested in any real workload.

> > Index: lib/libc/gen/sem_new.c
> > ===================================================================
> > --- lib/libc/gen/sem_new.c	(revision 269952)
> > +++ lib/libc/gen/sem_new.c	(working copy)
> > @@ -437,6 +437,32 @@ _sem_post(sem_t *sem)
> >  	if (sem_check_validity(sem) != 0)
> >  		return (-1);
> > 
> > +#ifdef __LP64__
> > +	if (((uintptr_t)&sem->_kern._count & 7) == 0) {
> > +		uint64_t oldval, newval;
> > +
> > +		while (!sem->_kern._has_waiters) {
> > +			count = sem->_kern._count;
> > +			if (count + 1 > SEM_VALUE_MAX)
> > +				return (EOVERFLOW);
> > +			/*
> > +			 * Expect _count == count and _has_waiters == 0.
> > +			 */
> > +#if BYTE_ORDER == LITTLE_ENDIAN
> > +			oldval = (uint64_t)count << 32;
> > +			newval = (uint64_t)(count + 1) << 32;
> > +#elif BYTE_ORDER == BIG_ENDIAN
> > +			oldval = (uint64_t)count;
> > +			newval = (uint64_t)(count + 1);
> > +#else
> > +#error Unknown byte order
> > +#endif
> > +			if (atomic_cmpset_rel_64((uint64_t *)&sem->_kern._count,
> > +			    oldval, newval))
> > +				return (0);
> > +		}
> > +	}
> > +#endif

> I think this looks ok, but it's a shame we can't fix it more cleanly.

I don't like at all that you have to either force an exotic alignment or
use unaligned atomic ops.

I saw another interesting sem_ implementation in musl libc. It also uses
a separate waiters word (but an int instead of a boolean), and is based
on the futex calls (which we already have as wait_uint/wake umtx calls).
It looks rather complicated but the musl libc people are usually good at
getting this kind of stuff right. I suppose this will also break subtly
when interoperating with old libc, and adding new umtx ops will simplify
the algorithm.

Consideration: just declare mixing process-shared semaphores with
sufficiently different libc unsupported, and change SEM_MAGIC to enforce
that? (This does not prevent running old binaries, as long as they're
dynamically linked to libc and you use a new libc.so.)


Consideration 2: use the new implementation only for process-private
semaphores and keep using the old slow one for process-shared
semaphores?

-- 
Jilles Tjoelker