Date: Fri, 17 Mar 2006 08:30:23 +1100 (EST) From: Bruce Evans <bde@zeta.org.au> To: Peter Wemm <peter@wemm.org> Cc: freebsd-amd64@FreeBSD.org Subject: Re: amd64 slower than i386 on identical AMD 64 system? / How is hyperthreading handled on amd64? Message-ID: <20060317064348.C48622@delplex.bde.org> In-Reply-To: <200603160917.30225.peter@wemm.org> References: <20060313221836.5491916A420@hub.freebsd.org> <200603151356.27972.peter@wemm.org> <200603160747.00051.joao@matik.com.br> <200603160917.30225.peter@wemm.org>
index | next in thread | previous in thread | raw e-mail
On Thu, 16 Mar 2006, Peter Wemm wrote: > There are a number of weaknesses in the amd64 port too. In particular, > the math library does not yet use the generally superior SSE2 > instructions. This is a real setback because the ABI uses SSE2 > floating point parameter passing. The effect is that some random libm > function is given a SSE2 register, which we convert to and x87 fp stack > register, do the x87 operation, then convert the x87 stack register > back to a SSE2 register then return the SSE2 result. This is > especially unfortunate when the native SSE2 instruction that would > operate on the SSE2 registers directly is faster. But, I don't know > SSE2 nor x87 fpu assembler code very well, so I've done "just enough" > to get things to work. Actually, the math library just uses SSE2 (except for long doubles, when SSE2 can't be used), and anyway SSE2 is only slightly faster than the FPU for code with scalar interfaces like the math library. The "just uses" part is due to gcc. It just uses SSE2 instructions by default on amd64. SSE2 is only slightly faster because most scalar floating point operations have the same execution latency and throughput as for the FPU. SSE2's advantage on scalar code comes mainly from having more directly accessible registers (16 xmm registers instead of 8 (or sometimes only 1 at the top of the stack directly accessible) FPU registers on amd64). This advantage is often small because the extra moves to access registers can be done in parallel with other operations. Note that this parallelism often occurs automatically due to (out of order instruction) scheduling in the CPU. Execution latency is very large (e.g., 4 cycles for each of add and mul) compared with execution throughput (e.g., 1 cycle for an add and a mul) so there are usually plenty of spare pipeline slots for executing the moves in parallel. My benchmarks in libm indicate that 64-bitness + SSE2 end up being a tiny improvment for single precision and a signifcant improvement for double and long double precision (even for long double where SSE2 cannot be used!), but this is only for versions that doesn't use the FPU for transcendental functions, and I think it is mainly from foot shooting in the 32-bit versions. The improvment in double precision is needed to be competitive with the hardware transcendental functions, and the foot shooting is from heavy use of the GET/SET macros -- these macros force things to memory and thus tend to cause pipeline stalls. Brucehome | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060317064348.C48622>
