Date: Wed, 29 Feb 2012 17:03:12 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: "Thomas D. Dean" <tomdean@speakeasy.org> Cc: freebsd-amd64@FreeBSD.org Subject: Re: Gcc46 and 128 Bit Floating Point Message-ID: <20120229161408.G2514@besplex.bde.org> In-Reply-To: <4F4DA398.6070703@speakeasy.org> References: <4F3EA37F.9010207@speakeasy.org> <CAGE5yCpvF0-b1iKAVGbya=fUNaYbGyrpj1PHSQxw4BvycNMLDg@mail.gmail.com> <4F3EC0B4.6050107@speakeasy.org> <4F4DA398.6070703@speakeasy.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 28 Feb 2012, Thomas D. Dean wrote: > On 02/17/12 13:03, Thomas D. Dean wrote: > I have been reading the Core-i7 developers manual and looking at libm. I have > been trying to shoe horn some calculations between the sizes of fpu > instructions and libgmp. > > I think there is little support for 128-bit floating point in the Core-i7 > 3930K CPU. That is true. libm doesn't try to support it at all, except on sparc64, though most of it would work (as for sparc64) with correct headers. gcc46's libraries might work, but I would expect problems outside of libm, starting with printf. But why would you want it? It is essentially unusable on sparc64, since it is several thousand times slower than 80-bit floating point on i386. At equal CPU clock speeds, it is only about 1000 times slower. Most of the factors of 10 are due to fundamental slowness of multi- word artithmetic in software and the soft-float implementations not being very good (I only tested with the old NetBSD/4.4BSD-derived one. This has been replaced by the Hauser one, which has good chances for being worse due to its greater generality and correctness, but the old one has a lot of slop to improve). A modern x86 is much faster than an old sparc64, giving about another factor of 10. 64-bit operations are only about this 10 times slower (or more like 3 times slower at equal CPU clock speeds) on an old sparc64 as on a not-so-modern core2 x86. The gnu libraries might be better. So you could hope for only a factor of 100 slowdown on scalar code. But modern x86's can also do vector code, and thus be up to 8 times faster for 32-bit floating point with AVX. Really good multi-word libraries might be able to exploit some vector operations, but I think multi-word operations are too seial in nature to get much parallelism with them. > The code which uses __float128 implements functions in software and use the > 80-bit fpu instructions to assist. > > I believe there is some speed improvement with the 128-bit registers. But, I > can find no floating point instructions that operate on 128-bit floating > point, like there is for 80-bit. AVX and below have none for 128 bits. They only have 32-bit and 64-bit ones done in parallel (4 32-bit ones or 2 64-bit ones with SSE, or twice that with AVX). Emulating 128-bit ones in software then takes 10-1000 times as long as the hardware 64-bit or 80-bit ones. (80-bit ones on x86 generally have identical speeds to 64-bit and 32-bit ones, but are not so parallelizable). > The bottom line seems to be little gain in floating point operations with the > core-i7 CPU. Expect a loss in speed of up to 1000 times for 128 bits. Modern x86 wins mainly be better parallelism and scheduling. Other things haven't changed much since Athlon-XP in 2001: - the clock speed got stuck at 2-4GHz - instructions issued per cycles got stuck at about 3 (2 FP adds or muls, plus a useful integer operation and/or load/store). Maybe slightly more with i7. But parallelism has increased by up to a factor of 4 -- these instructions can now be 4 64-bit ones in a vector every cycle instead of 2 64-bit ones in a vector every 2 cycles - latency for add/mul decreased from 4 cycles to 3 or maybe 2. > #include <quadmath.h> > #include <stdio.h> > int main() { > char buf[128]; > __float128 x = sqrtq(2.0Q); > quadmath_snprintf(buf, sizeof buf, "%.45Qf",x); > printf("sin(%s) = ",buf); > quadmath_snprintf(buf, sizeof buf, "%.45Qf",sinq(x)); > printf("%s\n",buf); > return 0; > } > > gcc46 math.c -o math /usr/local/lib/gcc46/libquadmath.a /usr/lib/libm.a I don't know the gcc library. The above has a chane or working, but it's painful to write when you can't use ordinary printf() directly. > Looking at the output of objdump -d math shows software implementation of > sqrtq() and sinq(). gcc46 does use the fsqrt instruction but not fsin. It doesn't use fsqrt according to Steve Kargl. Neither fsqrt nor fsin would work and neither should be used ever, since they are old, slow 80-bit i387 instructions which are apparently emulated in slow microcode on all modern x86. Software can beat them by a little for speed up to double precision and by a lot for accuracy in all precision. Software has a harder time being fast on them for 80 and 128 bits, even if the basic operations are fast. But 80-bit hardware versions of them are no help for the 128-bit software versions. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120229161408.G2514>