Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 12 Nov 2012 22:04:58 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Dimitry Andric <dim@FreeBSD.org>, Nathan Whitehorn <nwhitehorn@FreeBSD.org>
Subject:   Re: svn commit: r242835 - head/contrib/llvm/lib/Target/X86
Message-ID:  <20121112213445.W1247@besplex.bde.org>
In-Reply-To: <20121112014417.O1675@besplex.bde.org>
References:  <201211091856.qA9IuRxX035169@svn.freebsd.org> <509F2AA6.9050509@freebsd.org> <20121111214908.P938@besplex.bde.org> <509FB35F.1010801@FreeBSD.org> <20121112014417.O1675@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 12 Nov 2012, Bruce Evans wrote:

> On Sun, 11 Nov 2012, Dimitry Andric wrote:

>> It works just fine now with clang.  For the first example, I get:
>> 
>>        pushl   %ebp
>>        movl    %esp, %ebp
>>        andl    $-32, %esp
>> 
>> as prolog, and for the second:
>> 
>>        pushl   %ebp
>>        movl    %esp, %ebp
>>        andl    $-16, %esp
>
> Good.
>
> The andl executes very fast.  Perhaps not as fast as subl on %esp,
> because subl is normal so more likely to be optimized (they nominally
> have the same speeds, but %esp is magic).  Unfortunately, it seems to
> be impossible to both align the stack and reserve some space on it in
> 1 instruction -- the andl might not reserve any.

I lost kib's reply to this.  He said something agreeeing about %esp
being magic on Intel CPUs starting with PentiumPro.

The following quick test shows no problems on Xeon 5650 (freefall) or
Athlon64:

@ asm("					\n\
@ .globl main				\n\
@ main:					\n\
@ 	movl	$266681734,%eax		\n\
@ 	# movl	$201017002,%eax		\n\
@ 1:					\n\
@ 	call	foo1			\n\
@ 	decl	%eax			\n\
@ 	jne	1b			\n\
@ 	ret				\n\
@ 					\n\
@ foo1:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo2			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo2:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo3			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo3:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo4			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo4:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo5			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo5:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo6			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo6:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo7			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo7:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo8			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo8:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	# call	foo9			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ ");

Build this on an i386 system so that it is 32-bit mode.

This takes 56-57 cycles/iteration on Athlon64 and 50-51 cycles/iteration
on X6560.  Changing the andls to subls of 16 doesn't change this.
Removing all the andls and subls doesn't change this on Athlon64, but
on X6560 it is 4-5 cycles faster.  This shows that the gcc pessimization
is largest on X6560 :-).  Adding "pushl %eax; popl %eax" before the
calls to foo[2-8] adds 35-36 cycles/iteration on Athlon64 but only 6-7
on X6560.  I know some Athlons don't optimize pushl/popl well (maybe
when they are close together or near a stack pointer change as here).
Apparently Athlon64 is one such.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121112213445.W1247>