Date: Sat, 28 Mar 2015 21:54:08 +0800 From: Julian Elischer <julian@freebsd.org> To: freebsd-current@freebsd.org Subject: Re: SSE in libthr Message-ID: <5516B280.6060002@freebsd.org> In-Reply-To: <20150327214452.GR2379@kib.kiev.ua> References: <5515AED9.8040408@FreeBSD.org> <3A96AAEC-9C1C-444E-9A73-3CD2AED33116@me.com> <20150327214452.GR2379@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On 3/28/15 5:44 AM, Konstantin Belousov wrote: > On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote: >> On Mar 27, 2015, at 12:26, Eric van Gyzen <vangyzen@FreeBSD.org> wrote: >>> In a nutshell: >>> >>> Clang emits SSE instructions on amd64 in the common path of >>> pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd >>> like to disable SSE in libthr. >>> >>> In more detail: >>> >>> In libthr/thread/thr_mutex.c, we find the following: >>> >>> #define MUTEX_INIT_LINK(m) do { \ >>> (m)->m_qe.tqe_prev = NULL; \ >>> (m)->m_qe.tqe_next = NULL; \ >>> } while (0) >>> >>> In 9.1, clang 3.1 emits two ordinary mov instructions: >>> >>> movq $0x0,0x8(%rax) >>> movq $0x0,(%rax) >>> >>> Since 10.0 and clang 3.3, clang emits these SSE instructions: >>> >>> xorps %xmm0,%xmm0 >>> movups %xmm0,(%rax) >>> >>> Although these look harmless enough, using the FPU can reduce performance by >>> incurring extra overhead due to context-switching the FPU state. >>> >>> As I mentioned, this code is used in the common path of pthread_mutex_unlock. I >>> have a simple test program that creates four threads, all contending for a >>> single mutex, and measures the total number of lock acquisitions over several >>> seconds. When libthr is built with SSE, as is current, I get around 53 million >>> locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace >>> shows around 790,000 calls to fpudna versus 10 calls. There could be other >>> factors involved, but I presume that the FPU context switches account for most >>> of the change in performance. >>> >>> Even when I add some SSE usage in the application--incidentally, these same >>> instructions--building libthr without SSE improves performance from 53.5 million >>> to 55.8 million (4.3%). >>> >>> In the real-world application where I first noticed this, performance improves >>> by 3-5%. >>> >>> I would appreciate your thoughts and feedback. The proposed patch is below. >>> >>> Eric >>> >>> >>> >>> Index: base/head/lib/libthr/arch/amd64/Makefile.inc >>> =================================================================== >>> --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703) >>> +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy) >>> @@ -1,3 +1,8 @@ >>> #$FreeBSD$ >>> >>> SRCS+= _umtx_op_err.S >>> + >>> +# Using SSE incurs extra overhead per context switch, >>> +# which measurably impacts performance when the application >>> +# does not otherwise use FP/SSE. >>> +CFLAGS+=-mno-sse >> Good catch! >> >> Regarding your patch, I think we should disable even more, if possible. How about: >> >> CFLAGS+= -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 > I think so. > > Also, this should be done for libc as well, both on i386 and amd64. > I am not sure, should compiler-rt be included into the set ? the point is that clang will do this anywhere it can, because it isn't taking into account the side effects, just the speed of the commands themselves. > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5516B280.6060002>