From owner-freebsd-threads@FreeBSD.ORG  Sun Sep 21 21:37:45 2014
Return-Path: <owner-freebsd-threads@FreeBSD.ORG>
Delivered-To: freebsd-threads@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id AFD73914;
 Sun, 21 Sep 2014 21:37:45 +0000 (UTC)
Received: from mx1.stack.nl (relay02.stack.nl [IPv6:2001:610:1108:5010::104])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client CN "mailhost.stack.nl",
 Issuer "CA Cert Signing Authority" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 755B2C3E;
 Sun, 21 Sep 2014 21:37:45 +0000 (UTC)
Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131])
 by mx1.stack.nl (Postfix) with ESMTP id 7A079358C54;
 Sun, 21 Sep 2014 23:37:42 +0200 (CEST)
Received: by snail.stack.nl (Postfix, from userid 1677)
 id 4287F28494; Sun, 21 Sep 2014 23:37:42 +0200 (CEST)
Date: Sun, 21 Sep 2014 23:37:42 +0200
From: Jilles Tjoelker <jilles@stack.nl>
To: freebsd-threads@freebsd.org
Subject: sem_post() performance
Message-ID: <20140921213742.GA46868@stack.nl>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
Cc: adrian@freebsd.org
X-BeenThere: freebsd-threads@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Threading on FreeBSD <freebsd-threads.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-threads>, 
 <mailto:freebsd-threads-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-threads/>
List-Post: <mailto:freebsd-threads@freebsd.org>
List-Help: <mailto:freebsd-threads-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-threads>,
 <mailto:freebsd-threads-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 21 Sep 2014 21:37:45 -0000

It has been reported that POSIX semaphores are slow, in contexts such as
Python. Note that POSIX semaphores are the only synchronization objects
that support use by different processes in shared memory; this does not
work for mutexes and condition variables because they are pointers to
the actual data structure.

In fact, sem_post() unconditionally performs an umtx system call.

To avoid both lost wakeups and possible writes to a destroyed semaphore,
an uncontested sem_post() must check the _has_waiters flag atomically
with incrementing _count.

The proper way to do this would be to take one bit from _count and use
it for the _has_waiters flag; the definition of SEM_VALUE_MAX permits
this. However, this would require a new set of umtx semaphore operations
and will break ABI of process-shared semaphores (things may break if an
old and a new libc access the same semaphore over shared memory).

This diff only affects 32-bit aligned but 64-bit misaligned semaphores
on 64-bit systems, and changes _count and _has_waiters atomically using
a 64-bit atomic operation. It probably needs a may_alias attribute for
correctness, but <sys/cdefs.h> does not have a wrapper for that.

Some x86 CPUs may cope with misaligned atomic ops without destroying
performance (especially if they do not cross a cache line), so the
alignment restriction could be relaxed to make the patch more practical.

Many CPUs in the i386 architecture have a 64-bit atomic op (cmpxchg8b)
which could be used here.

This appears to restore performance of 10-stable uncontested semaphores
with the strange alignment to 9-stable levels (a tight loop with
sem_wait and sem_post). I have not tested in any real workload.

Index: lib/libc/gen/sem_new.c
===================================================================
--- lib/libc/gen/sem_new.c	(revision 269952)
+++ lib/libc/gen/sem_new.c	(working copy)
@@ -437,6 +437,32 @@ _sem_post(sem_t *sem)
 	if (sem_check_validity(sem) != 0)
 		return (-1);
 
+#ifdef __LP64__
+	if (((uintptr_t)&sem->_kern._count & 7) == 0) {
+		uint64_t oldval, newval;
+
+		while (!sem->_kern._has_waiters) {
+			count = sem->_kern._count;
+			if (count + 1 > SEM_VALUE_MAX)
+				return (EOVERFLOW);
+			/*
+			 * Expect _count == count and _has_waiters == 0.
+			 */
+#if BYTE_ORDER == LITTLE_ENDIAN
+			oldval = (uint64_t)count << 32;
+			newval = (uint64_t)(count + 1) << 32;
+#elif BYTE_ORDER == BIG_ENDIAN
+			oldval = (uint64_t)count;
+			newval = (uint64_t)(count + 1);
+#else
+#error Unknown byte order
+#endif
+			if (atomic_cmpset_rel_64((uint64_t *)&sem->_kern._count,
+			    oldval, newval))
+				return (0);
+		}
+	}
+#endif
 	do {
 		count = sem->_kern._count;
 		if (count + 1 > SEM_VALUE_MAX)

-- 
Jilles Tjoelker