From owner-freebsd-threads@FreeBSD.ORG Tue Sep 23 21:20:06 2014 Return-Path: Delivered-To: freebsd-threads@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8EB51B3; Tue, 23 Sep 2014 21:20:06 +0000 (UTC) Received: from mx1.stack.nl (relay02.stack.nl [IPv6:2001:610:1108:5010::104]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client CN "mailhost.stack.nl", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 3C0D2AB6; Tue, 23 Sep 2014 21:20:06 +0000 (UTC) Received: from snail.stack.nl (snail.stack.nl [IPv6:2001:610:1108:5010::131]) by mx1.stack.nl (Postfix) with ESMTP id A807C3592E6; Tue, 23 Sep 2014 23:20:00 +0200 (CEST) Received: by snail.stack.nl (Postfix, from userid 1677) id 8C51128494; Tue, 23 Sep 2014 23:20:00 +0200 (CEST) Date: Tue, 23 Sep 2014 23:20:00 +0200 From: Jilles Tjoelker To: John Baldwin Subject: Re: sem_post() performance Message-ID: <20140923212000.GA78110@stack.nl> References: <20140921213742.GA46868@stack.nl> <1531724.MPBlj40xOW@ralph.baldwin.cx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1531724.MPBlj40xOW@ralph.baldwin.cx> User-Agent: Mutt/1.5.21 (2010-09-15) Cc: adrian@freebsd.org, freebsd-threads@freebsd.org X-BeenThere: freebsd-threads@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Threading on FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 23 Sep 2014 21:20:06 -0000 On Mon, Sep 22, 2014 at 03:53:13PM -0400, John Baldwin wrote: > On Sunday, September 21, 2014 11:37:42 PM Jilles Tjoelker wrote: > > It has been reported that POSIX semaphores are slow, in contexts such as > > Python. Note that POSIX semaphores are the only synchronization objects > > that support use by different processes in shared memory; this does not > > work for mutexes and condition variables because they are pointers to > > the actual data structure. > > In fact, sem_post() unconditionally performs an umtx system call. > *sigh* I was worried that that might be the case. > > To avoid both lost wakeups and possible writes to a destroyed semaphore, > > an uncontested sem_post() must check the _has_waiters flag atomically > > with incrementing _count. > > The proper way to do this would be to take one bit from _count and > > use it for the _has_waiters flag; the definition of SEM_VALUE_MAX > > permits this. However, this would require a new set of umtx > > semaphore operations and will break ABI of process-shared semaphores > > (things may break if an old and a new libc access the same semaphore > > over shared memory). > > This diff only affects 32-bit aligned but 64-bit misaligned > > semaphores on 64-bit systems, and changes _count and _has_waiters > > atomically using a 64-bit atomic operation. It probably needs a > > may_alias attribute for correctness, but does not have > > a wrapper for that. > It wasn't clear on first reading, but you are using aliasing to get > around the need for new umtx calls by using a 64-bit atomic op to > adjust two ints at the same time, yes? Note that since a failing > semaphore op calls into the kernel for the "hard" case, you might in > fact be able to change the ABI without breaking process-shared > semaphores. That is, suppose you left 'has_waiters' as always true > and reused the high bit of count for has_waiters. > Would old binaries always trap into the kernel? (Not sure they will, > especially the case where an old binary creates the semaphore, a new > binary would have to force has_waiters to true in every sem op, but > even that might not be enough.) I think that everything will break when a binary linked to old and new libcs use the same semaphore. If the new contested bit is set, the old sem_getvalue() will return garbage, the old sem_trywait() will fail even if the real count is greater than 0, the old sem_wait() and sem_timedwait() may spin if the real count is greater than 0 and the old sem_post() will fail with [EOVERFLOW]. That the "hard" path always issues a system call does not help much, since the system calls do not write to _count (this is an throughput optimization, allowing a fast-path thread through while a slow-path thread is entering or leaving the kernel). > > Some x86 CPUs may cope with misaligned atomic ops without destroying > > performance (especially if they do not cross a cache line), so the > > alignment restriction could be relaxed to make the patch more practical. > > Many CPUs in the i386 architecture have a 64-bit atomic op (cmpxchg8b) > > which could be used here. > > This appears to restore performance of 10-stable uncontested semaphores > > with the strange alignment to 9-stable levels (a tight loop with > > sem_wait and sem_post). I have not tested in any real workload. > > Index: lib/libc/gen/sem_new.c > > =================================================================== > > --- lib/libc/gen/sem_new.c (revision 269952) > > +++ lib/libc/gen/sem_new.c (working copy) > > @@ -437,6 +437,32 @@ _sem_post(sem_t *sem) > > if (sem_check_validity(sem) != 0) > > return (-1); > > > > +#ifdef __LP64__ > > + if (((uintptr_t)&sem->_kern._count & 7) == 0) { > > + uint64_t oldval, newval; > > + > > + while (!sem->_kern._has_waiters) { > > + count = sem->_kern._count; > > + if (count + 1 > SEM_VALUE_MAX) > > + return (EOVERFLOW); > > + /* > > + * Expect _count == count and _has_waiters == 0. > > + */ > > +#if BYTE_ORDER == LITTLE_ENDIAN > > + oldval = (uint64_t)count << 32; > > + newval = (uint64_t)(count + 1) << 32; > > +#elif BYTE_ORDER == BIG_ENDIAN > > + oldval = (uint64_t)count; > > + newval = (uint64_t)(count + 1); > > +#else > > +#error Unknown byte order > > +#endif > > + if (atomic_cmpset_rel_64((uint64_t *)&sem->_kern._count, > > + oldval, newval)) > > + return (0); > > + } > > + } > > +#endif > I think this looks ok, but it's a shame we can't fix it more cleanly. I don't like at all that you have to either force an exotic alignment or use unaligned atomic ops. I saw another interesting sem_ implementation in musl libc. It also uses a separate waiters word (but an int instead of a boolean), and is based on the futex calls (which we already have as wait_uint/wake umtx calls). It looks rather complicated but the musl libc people are usually good at getting this kind of stuff right. I suppose this will also break subtly when interoperating with old libc, and adding new umtx ops will simplify the algorithm. Consideration: just declare mixing process-shared semaphores with sufficiently different libc unsupported, and change SEM_MAGIC to enforce that? (This does not prevent running old binaries, as long as they're dynamically linked to libc and you use a new libc.so.) Consideration 2: use the new implementation only for process-private semaphores and keep using the old slow one for process-shared semaphores? -- Jilles Tjoelker