Date: Fri, 27 Mar 2015 15:26:17 -0400 From: Eric van Gyzen <vangyzen@FreeBSD.org> To: current@FreeBSD.org Subject: SSE in libthr Message-ID: <5515AED9.8040408@FreeBSD.org>
next in thread | raw e-mail | index | archive | help
In a nutshell: Clang emits SSE instructions on amd64 in the common path of pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd like to disable SSE in libthr. In more detail: In libthr/thread/thr_mutex.c, we find the following: #define MUTEX_INIT_LINK(m) do { \ (m)->m_qe.tqe_prev = NULL; \ (m)->m_qe.tqe_next = NULL; \ } while (0) In 9.1, clang 3.1 emits two ordinary mov instructions: movq $0x0,0x8(%rax) movq $0x0,(%rax) Since 10.0 and clang 3.3, clang emits these SSE instructions: xorps %xmm0,%xmm0 movups %xmm0,(%rax) Although these look harmless enough, using the FPU can reduce performance by incurring extra overhead due to context-switching the FPU state. As I mentioned, this code is used in the common path of pthread_mutex_unlock. I have a simple test program that creates four threads, all contending for a single mutex, and measures the total number of lock acquisitions over several seconds. When libthr is built with SSE, as is current, I get around 53 million locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace shows around 790,000 calls to fpudna versus 10 calls. There could be other factors involved, but I presume that the FPU context switches account for most of the change in performance. Even when I add some SSE usage in the application--incidentally, these same instructions--building libthr without SSE improves performance from 53.5 million to 55.8 million (4.3%). In the real-world application where I first noticed this, performance improves by 3-5%. I would appreciate your thoughts and feedback. The proposed patch is below. Eric Index: base/head/lib/libthr/arch/amd64/Makefile.inc =================================================================== --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703) +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy) @@ -1,3 +1,8 @@ #$FreeBSD$ SRCS+= _umtx_op_err.S + +# Using SSE incurs extra overhead per context switch, +# which measurably impacts performance when the application +# does not otherwise use FP/SSE. +CFLAGS+=-mno-sse
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5515AED9.8040408>