From owner-freebsd-current@FreeBSD.ORG  Sat Mar 28 13:54:21 2015
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 019AE99F
 for <freebsd-current@freebsd.org>; Sat, 28 Mar 2015 13:54:20 +0000 (UTC)
Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "vps1.elischer.org",
 Issuer "CA Cert Signing Authority" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id D10CC6C6
 for <freebsd-current@freebsd.org>; Sat, 28 Mar 2015 13:54:20 +0000 (UTC)
Received: from Julian-MBP3.local
 (ppp121-45-255-201.lns20.per4.internode.on.net [121.45.255.201])
 (authenticated bits=0)
 by vps1.elischer.org (8.14.9/8.14.9) with ESMTP id t2SDsE57002354
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO)
 for <freebsd-current@freebsd.org>; Sat, 28 Mar 2015 06:54:19 -0700 (PDT)
 (envelope-from julian@freebsd.org)
Message-ID: <5516B280.6060002@freebsd.org>
Date: Sat, 28 Mar 2015 21:54:08 +0800
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10;
 rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: freebsd-current@freebsd.org
Subject: Re: SSE in libthr
References: <5515AED9.8040408@FreeBSD.org>
 <3A96AAEC-9C1C-444E-9A73-3CD2AED33116@me.com>
 <20150327214452.GR2379@kib.kiev.ua>
In-Reply-To: <20150327214452.GR2379@kib.kiev.ua>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
 <freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-current>, 
 <mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current/>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
 <mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 28 Mar 2015 13:54:21 -0000

On 3/28/15 5:44 AM, Konstantin Belousov wrote:
> On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote:
>> On Mar 27, 2015, at 12:26, Eric van Gyzen <vangyzen@FreeBSD.org> wrote:
>>> In a nutshell:
>>>
>>> Clang emits SSE instructions on amd64 in the common path of
>>> pthread_mutex_unlock.  This reduces performance by a non-trivial amount.  I'd
>>> like to disable SSE in libthr.
>>>
>>> In more detail:
>>>
>>> In libthr/thread/thr_mutex.c, we find the following:
>>>
>>> 	#define MUTEX_INIT_LINK(m)              do {            \
>>> 	        (m)->m_qe.tqe_prev = NULL;                      \
>>> 	        (m)->m_qe.tqe_next = NULL;                      \
>>> 	} while (0)
>>>
>>> In 9.1, clang 3.1 emits two ordinary mov instructions:
>>>
>>> 	movq   $0x0,0x8(%rax)
>>> 	movq   $0x0,(%rax)
>>>
>>> Since 10.0 and clang 3.3, clang emits these SSE instructions:
>>>
>>> 	xorps  %xmm0,%xmm0
>>> 	movups %xmm0,(%rax)
>>>
>>> Although these look harmless enough, using the FPU can reduce performance by
>>> incurring extra overhead due to context-switching the FPU state.
>>>
>>> As I mentioned, this code is used in the common path of pthread_mutex_unlock.  I
>>> have a simple test program that creates four threads, all contending for a
>>> single mutex, and measures the total number of lock acquisitions over several
>>> seconds.  When libthr is built with SSE, as is current, I get around 53 million
>>> locks in 5 seconds.  Without SSE, I get around 60 million (13% more).  DTrace
>>> shows around 790,000 calls to fpudna versus 10 calls.  There could be other
>>> factors involved, but I presume that the FPU context switches account for most
>>> of the change in performance.
>>>
>>> Even when I add some SSE usage in the application--incidentally, these same
>>> instructions--building libthr without SSE improves performance from 53.5 million
>>> to 55.8 million (4.3%).
>>>
>>> In the real-world application where I first noticed this, performance improves
>>> by 3-5%.
>>>
>>> I would appreciate your thoughts and feedback.  The proposed patch is below.
>>>
>>> Eric
>>>
>>>
>>>
>>> Index: base/head/lib/libthr/arch/amd64/Makefile.inc
>>> ===================================================================
>>> --- base/head/lib/libthr/arch/amd64/Makefile.inc	(revision 280703)
>>> +++ base/head/lib/libthr/arch/amd64/Makefile.inc	(working copy)
>>> @@ -1,3 +1,8 @@
>>> #$FreeBSD$
>>>
>>> SRCS+=	_umtx_op_err.S
>>> +
>>> +# Using SSE incurs extra overhead per context switch,
>>> +# which measurably impacts performance when the application
>>> +# does not otherwise use FP/SSE.
>>> +CFLAGS+=-mno-sse
>> Good catch!
>>
>> Regarding your patch, I think we should disable even more, if possible.  How about:
>>
>> CFLAGS+=        -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3
> I think so.
>
> Also, this should be done for libc as well, both on i386 and amd64.
> I am not sure, should compiler-rt be included into the set ?
the point is that clang will do this anywhere it can, because it isn't 
taking into account the
side effects, just the speed of the commands themselves.

> _______________________________________________
> freebsd-current@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
>