Date: Fri, 27 Mar 2015 13:49:03 -0700 From: Rui Paulo <rpaulo@me.com> To: Eric van Gyzen <vangyzen@FreeBSD.org> Cc: current@FreeBSD.org Subject: Re: SSE in libthr Message-ID: <3A96AAEC-9C1C-444E-9A73-3CD2AED33116@me.com> In-Reply-To: <5515AED9.8040408@FreeBSD.org> References: <5515AED9.8040408@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mar 27, 2015, at 12:26, Eric van Gyzen <vangyzen@FreeBSD.org> wrote: >=20 > In a nutshell: >=20 > Clang emits SSE instructions on amd64 in the common path of > pthread_mutex_unlock. This reduces performance by a non-trivial = amount. I'd > like to disable SSE in libthr. >=20 > In more detail: >=20 > In libthr/thread/thr_mutex.c, we find the following: >=20 > #define MUTEX_INIT_LINK(m) do { \ > (m)->m_qe.tqe_prev =3D NULL; \ > (m)->m_qe.tqe_next =3D NULL; \ > } while (0) >=20 > In 9.1, clang 3.1 emits two ordinary mov instructions: >=20 > movq $0x0,0x8(%rax) > movq $0x0,(%rax) >=20 > Since 10.0 and clang 3.3, clang emits these SSE instructions: >=20 > xorps %xmm0,%xmm0 > movups %xmm0,(%rax) >=20 > Although these look harmless enough, using the FPU can reduce = performance by > incurring extra overhead due to context-switching the FPU state. >=20 > As I mentioned, this code is used in the common path of = pthread_mutex_unlock. I > have a simple test program that creates four threads, all contending = for a > single mutex, and measures the total number of lock acquisitions over = several > seconds. When libthr is built with SSE, as is current, I get around = 53 million > locks in 5 seconds. Without SSE, I get around 60 million (13% more). = DTrace > shows around 790,000 calls to fpudna versus 10 calls. There could be = other > factors involved, but I presume that the FPU context switches account = for most > of the change in performance. >=20 > Even when I add some SSE usage in the application--incidentally, these = same > instructions--building libthr without SSE improves performance from = 53.5 million > to 55.8 million (4.3%). >=20 > In the real-world application where I first noticed this, performance = improves > by 3-5%. >=20 > I would appreciate your thoughts and feedback. The proposed patch is = below. >=20 > Eric >=20 >=20 >=20 > Index: base/head/lib/libthr/arch/amd64/Makefile.inc > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision = 280703) > +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy) > @@ -1,3 +1,8 @@ > #$FreeBSD$ >=20 > SRCS+=3D _umtx_op_err.S > + > +# Using SSE incurs extra overhead per context switch, > +# which measurably impacts performance when the application > +# does not otherwise use FP/SSE. > +CFLAGS+=3D-mno-sse Good catch! Regarding your patch, I think we should disable even more, if possible. = How about: CFLAGS+=3D -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -- Rui Paulo
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3A96AAEC-9C1C-444E-9A73-3CD2AED33116>