From owner-freebsd-current@FreeBSD.ORG Sat Mar 28 13:54:21 2015 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 019AE99F for ; Sat, 28 Mar 2015 13:54:20 +0000 (UTC) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "vps1.elischer.org", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id D10CC6C6 for ; Sat, 28 Mar 2015 13:54:20 +0000 (UTC) Received: from Julian-MBP3.local (ppp121-45-255-201.lns20.per4.internode.on.net [121.45.255.201]) (authenticated bits=0) by vps1.elischer.org (8.14.9/8.14.9) with ESMTP id t2SDsE57002354 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO) for ; Sat, 28 Mar 2015 06:54:19 -0700 (PDT) (envelope-from julian@freebsd.org) Message-ID: <5516B280.6060002@freebsd.org> Date: Sat, 28 Mar 2015 21:54:08 +0800 From: Julian Elischer User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: freebsd-current@freebsd.org Subject: Re: SSE in libthr References: <5515AED9.8040408@FreeBSD.org> <3A96AAEC-9C1C-444E-9A73-3CD2AED33116@me.com> <20150327214452.GR2379@kib.kiev.ua> In-Reply-To: <20150327214452.GR2379@kib.kiev.ua> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 28 Mar 2015 13:54:21 -0000 On 3/28/15 5:44 AM, Konstantin Belousov wrote: > On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote: >> On Mar 27, 2015, at 12:26, Eric van Gyzen wrote: >>> In a nutshell: >>> >>> Clang emits SSE instructions on amd64 in the common path of >>> pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd >>> like to disable SSE in libthr. >>> >>> In more detail: >>> >>> In libthr/thread/thr_mutex.c, we find the following: >>> >>> #define MUTEX_INIT_LINK(m) do { \ >>> (m)->m_qe.tqe_prev = NULL; \ >>> (m)->m_qe.tqe_next = NULL; \ >>> } while (0) >>> >>> In 9.1, clang 3.1 emits two ordinary mov instructions: >>> >>> movq $0x0,0x8(%rax) >>> movq $0x0,(%rax) >>> >>> Since 10.0 and clang 3.3, clang emits these SSE instructions: >>> >>> xorps %xmm0,%xmm0 >>> movups %xmm0,(%rax) >>> >>> Although these look harmless enough, using the FPU can reduce performance by >>> incurring extra overhead due to context-switching the FPU state. >>> >>> As I mentioned, this code is used in the common path of pthread_mutex_unlock. I >>> have a simple test program that creates four threads, all contending for a >>> single mutex, and measures the total number of lock acquisitions over several >>> seconds. When libthr is built with SSE, as is current, I get around 53 million >>> locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace >>> shows around 790,000 calls to fpudna versus 10 calls. There could be other >>> factors involved, but I presume that the FPU context switches account for most >>> of the change in performance. >>> >>> Even when I add some SSE usage in the application--incidentally, these same >>> instructions--building libthr without SSE improves performance from 53.5 million >>> to 55.8 million (4.3%). >>> >>> In the real-world application where I first noticed this, performance improves >>> by 3-5%. >>> >>> I would appreciate your thoughts and feedback. The proposed patch is below. >>> >>> Eric >>> >>> >>> >>> Index: base/head/lib/libthr/arch/amd64/Makefile.inc >>> =================================================================== >>> --- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703) >>> +++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy) >>> @@ -1,3 +1,8 @@ >>> #$FreeBSD$ >>> >>> SRCS+= _umtx_op_err.S >>> + >>> +# Using SSE incurs extra overhead per context switch, >>> +# which measurably impacts performance when the application >>> +# does not otherwise use FP/SSE. >>> +CFLAGS+=-mno-sse >> Good catch! >> >> Regarding your patch, I think we should disable even more, if possible. How about: >> >> CFLAGS+= -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 > I think so. > > Also, this should be done for libc as well, both on i386 and amd64. > I am not sure, should compiler-rt be included into the set ? the point is that clang will do this anywhere it can, because it isn't taking into account the side effects, just the speed of the commands themselves. > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org" >