From owner-svn-src-all@FreeBSD.ORG  Mon Nov 12 11:05:05 2012
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 0587872D;
 Mon, 12 Nov 2012 11:05:05 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au
 [211.29.132.184])
 by mx1.freebsd.org (Postfix) with ESMTP id 1A6538FC08;
 Mon, 12 Nov 2012 11:05:03 +0000 (UTC)
Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au
 (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26])
 by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qACB4w6S009451
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Mon, 12 Nov 2012 22:04:59 +1100
Date: Mon, 12 Nov 2012 22:04:58 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
Subject: Re: svn commit: r242835 - head/contrib/llvm/lib/Target/X86
In-Reply-To: <20121112014417.O1675@besplex.bde.org>
Message-ID: <20121112213445.W1247@besplex.bde.org>
References: <201211091856.qA9IuRxX035169@svn.freebsd.org>
 <509F2AA6.9050509@freebsd.org>
 <20121111214908.P938@besplex.bde.org> <509FB35F.1010801@FreeBSD.org>
 <20121112014417.O1675@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-Cloudmark-Score: 0
X-Optus-Cloudmark-Analysis: v=2.0 cv=I9g936cg c=1 sm=1 a=B4V2Pwk6IZ0A:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=Bex9lGB9SJoA:10
 a=HI2N3_CjzdjmQ36RSUMA:9 a=CjuIK1q_8ugA:10 a=bxQHXO5Py4tHmhUgaywp5w==:117
Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org,
 src-committers@FreeBSD.org, Dimitry Andric <dim@FreeBSD.org>,
 Nathan Whitehorn <nwhitehorn@FreeBSD.org>
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 12 Nov 2012 11:05:05 -0000

On Mon, 12 Nov 2012, Bruce Evans wrote:

> On Sun, 11 Nov 2012, Dimitry Andric wrote:

>> It works just fine now with clang.  For the first example, I get:
>> 
>>        pushl   %ebp
>>        movl    %esp, %ebp
>>        andl    $-32, %esp
>> 
>> as prolog, and for the second:
>> 
>>        pushl   %ebp
>>        movl    %esp, %ebp
>>        andl    $-16, %esp
>
> Good.
>
> The andl executes very fast.  Perhaps not as fast as subl on %esp,
> because subl is normal so more likely to be optimized (they nominally
> have the same speeds, but %esp is magic).  Unfortunately, it seems to
> be impossible to both align the stack and reserve some space on it in
> 1 instruction -- the andl might not reserve any.

I lost kib's reply to this.  He said something agreeeing about %esp
being magic on Intel CPUs starting with PentiumPro.

The following quick test shows no problems on Xeon 5650 (freefall) or
Athlon64:

@ asm("					\n\
@ .globl main				\n\
@ main:					\n\
@ 	movl	$266681734,%eax		\n\
@ 	# movl	$201017002,%eax		\n\
@ 1:					\n\
@ 	call	foo1			\n\
@ 	decl	%eax			\n\
@ 	jne	1b			\n\
@ 	ret				\n\
@ 					\n\
@ foo1:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo2			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo2:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo3			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo3:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo4			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo4:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo5			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo5:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo6			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo6:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo7			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo7:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	call	foo8			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ 					\n\
@ foo8:					\n\
@ 	pushl	%ebp			\n\
@ 	movl	%esp,%ebp		\n\
@ 	andl	$-16,%esp		\n\
@ 	# call	foo9			\n\
@ 	movl	%ebp,%esp		\n\
@ 	popl	%ebp			\n\
@ 	ret				\n\
@ ");

Build this on an i386 system so that it is 32-bit mode.

This takes 56-57 cycles/iteration on Athlon64 and 50-51 cycles/iteration
on X6560.  Changing the andls to subls of 16 doesn't change this.
Removing all the andls and subls doesn't change this on Athlon64, but
on X6560 it is 4-5 cycles faster.  This shows that the gcc pessimization
is largest on X6560 :-).  Adding "pushl %eax; popl %eax" before the
calls to foo[2-8] adds 35-36 cycles/iteration on Athlon64 but only 6-7
on X6560.  I know some Athlons don't optimize pushl/popl well (maybe
when they are close together or near a stack pointer change as here).
Apparently Athlon64 is one such.

Bruce