Date: Mon, 12 Nov 2012 22:04:58 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Bruce Evans <brde@optusnet.com.au> Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Dimitry Andric <dim@FreeBSD.org>, Nathan Whitehorn <nwhitehorn@FreeBSD.org> Subject: Re: svn commit: r242835 - head/contrib/llvm/lib/Target/X86 Message-ID: <20121112213445.W1247@besplex.bde.org> In-Reply-To: <20121112014417.O1675@besplex.bde.org> References: <201211091856.qA9IuRxX035169@svn.freebsd.org> <509F2AA6.9050509@freebsd.org> <20121111214908.P938@besplex.bde.org> <509FB35F.1010801@FreeBSD.org> <20121112014417.O1675@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 12 Nov 2012, Bruce Evans wrote: > On Sun, 11 Nov 2012, Dimitry Andric wrote: >> It works just fine now with clang. For the first example, I get: >> >> pushl %ebp >> movl %esp, %ebp >> andl $-32, %esp >> >> as prolog, and for the second: >> >> pushl %ebp >> movl %esp, %ebp >> andl $-16, %esp > > Good. > > The andl executes very fast. Perhaps not as fast as subl on %esp, > because subl is normal so more likely to be optimized (they nominally > have the same speeds, but %esp is magic). Unfortunately, it seems to > be impossible to both align the stack and reserve some space on it in > 1 instruction -- the andl might not reserve any. I lost kib's reply to this. He said something agreeeing about %esp being magic on Intel CPUs starting with PentiumPro. The following quick test shows no problems on Xeon 5650 (freefall) or Athlon64: @ asm(" \n\ @ .globl main \n\ @ main: \n\ @ movl $266681734,%eax \n\ @ # movl $201017002,%eax \n\ @ 1: \n\ @ call foo1 \n\ @ decl %eax \n\ @ jne 1b \n\ @ ret \n\ @ \n\ @ foo1: \n\ @ pushl %ebp \n\ @ movl %esp,%ebp \n\ @ andl $-16,%esp \n\ @ call foo2 \n\ @ movl %ebp,%esp \n\ @ popl %ebp \n\ @ ret \n\ @ \n\ @ foo2: \n\ @ pushl %ebp \n\ @ movl %esp,%ebp \n\ @ andl $-16,%esp \n\ @ call foo3 \n\ @ movl %ebp,%esp \n\ @ popl %ebp \n\ @ ret \n\ @ \n\ @ foo3: \n\ @ pushl %ebp \n\ @ movl %esp,%ebp \n\ @ andl $-16,%esp \n\ @ call foo4 \n\ @ movl %ebp,%esp \n\ @ popl %ebp \n\ @ ret \n\ @ \n\ @ foo4: \n\ @ pushl %ebp \n\ @ movl %esp,%ebp \n\ @ andl $-16,%esp \n\ @ call foo5 \n\ @ movl %ebp,%esp \n\ @ popl %ebp \n\ @ ret \n\ @ \n\ @ foo5: \n\ @ pushl %ebp \n\ @ movl %esp,%ebp \n\ @ andl $-16,%esp \n\ @ call foo6 \n\ @ movl %ebp,%esp \n\ @ popl %ebp \n\ @ ret \n\ @ \n\ @ foo6: \n\ @ pushl %ebp \n\ @ movl %esp,%ebp \n\ @ andl $-16,%esp \n\ @ call foo7 \n\ @ movl %ebp,%esp \n\ @ popl %ebp \n\ @ ret \n\ @ \n\ @ foo7: \n\ @ pushl %ebp \n\ @ movl %esp,%ebp \n\ @ andl $-16,%esp \n\ @ call foo8 \n\ @ movl %ebp,%esp \n\ @ popl %ebp \n\ @ ret \n\ @ \n\ @ foo8: \n\ @ pushl %ebp \n\ @ movl %esp,%ebp \n\ @ andl $-16,%esp \n\ @ # call foo9 \n\ @ movl %ebp,%esp \n\ @ popl %ebp \n\ @ ret \n\ @ "); Build this on an i386 system so that it is 32-bit mode. This takes 56-57 cycles/iteration on Athlon64 and 50-51 cycles/iteration on X6560. Changing the andls to subls of 16 doesn't change this. Removing all the andls and subls doesn't change this on Athlon64, but on X6560 it is 4-5 cycles faster. This shows that the gcc pessimization is largest on X6560 :-). Adding "pushl %eax; popl %eax" before the calls to foo[2-8] adds 35-36 cycles/iteration on Athlon64 but only 6-7 on X6560. I know some Athlons don't optimize pushl/popl well (maybe when they are close together or near a stack pointer change as here). Apparently Athlon64 is one such. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121112213445.W1247>