Date: Tue, 14 Aug 2012 20:09:39 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Stephen Montgomery-Smith <stephen@missouri.edu> Cc: freebsd-numerics@freebsd.org Subject: Re: Complex arg-trig functions Message-ID: <20120814173931.V934@besplex.bde.org> In-Reply-To: <50297468.20902@missouri.edu> References: <5017111E.6060003@missouri.edu> <501C361D.4010807@missouri.edu> <20120804165555.X1231@besplex.bde.org> <501D51D7.1020101@missouri.edu> <20120805030609.R3101@besplex.bde.org> <501D9C36.2040207@missouri.edu> <20120805175106.X3574@besplex.bde.org> <501EC015.3000808@missouri.edu> <20120805191954.GA50379@troutmask.apl.washington.edu> <20120807205725.GA10572@server.rulingia.com> <20120809025220.N4114@besplex.bde.org> <5027F07E.9060409@missouri.edu> <20120814003614.H3692@besplex.bde.org> <50295887.2010608@missouri.edu> <20120814055931.Q4897@besplex.bde.org> <50297468.20902@missouri.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 13 Aug 2012, Stephen Montgomery-Smith wrote: > On 08/13/2012 04:14 PM, Bruce Evans wrote: > ... >> 2. Division tends to be slow, and is slow on x86. On Phenom, division >> has a latency of 20 cycles and a throughput of 1 per 15 cycles in >> scalar double precision (17 in vector[2] double precision). Maybe >> the compiler can optimize division by a power of 2 to a multiplication >> though. > > Does the time taken to perform certain floating point operations depend upon > what the numbers are? I know that when I do long multiplication by hand that > (1) it is much faster than long division, and (2) certain numbers (like 1000) > are very quick to multiply and divide. On x86, mostly not. Denormals can be slow in general, but I've never seen then causing significant slowness on x86. I've only seen underflow (and overflow?) causing signficant slowness. The easy underflowing cases for exp*() run about 4 times slower than the normal case although they only execute 1 underflowing instruction instead of 50-100 normal instructions after the classification. This slowness is very MD: - on identical Intel core2 hardware (ref*.freebsd.org) - i386 underflow is normally slow in all precisions. But when SSE[2] is used, it has no penalty. SSE[2] is the default on core2 for clang in float and double precision, but for gcc it takes unusual CFLAGS to get it. CFLAGS without -march=<some CPU that supports SSE[2]> don't give it even for clang IIRC, since the default is more like -mpentiumpro. - amd64 underflow normally or always has no penalty in float and double precisions, since use of SSE[2] is forced, except possibly with even more unusual CFLAGS. - long doubles always use the i387 and always have a large penalty for underflow. - on AthlonXP and original Athlon64 hardware: - there is no penalty for underflow for any combination of amd64/i386/ i387/SSE[2]/float/double/long double. > Similarly with the computer, is it possible that division by 4 is much faster > than division by (say) 4.4424242389, and that division by 4 is just as fast > as multiplication by 0.25? (And multiplication by 0.25 is faster than > multiplication than 0.235341212412?) Pipelining and large hardware probably inhibits this optimization on modern CPUs. Some x86 has faster integer multiplication (?) for smaller numbers of bits and/or the position of the bits. I've never seen this for FP on x86. Except, there are special FP instructions for smaller numbers of bits in SSE: reciprocal with 12 (?) bits and reciprocal square root with 12 (?) bits. These have about the same latency as addition and multiplication (4 cycles). Probably lower throughput (not 1 per cycle). This many bits can be done by table lookup. Perhaps the full operations start with this fully in hardware, then use Newton's method in microcode, then magic to round. IIRC. ia64 doesn't even have full division in hardware, but it has reciprocals. >> This was tracked to a known annoying difference between SSE and i387 >> on NaNs. x+y clears the sign buit on i387 but not on SSE. SSE is >> correct. > > Was it your code that was wrong, or mine? And if it is mine, what is the > fix? Neither. It is just that different hardware gives different results, and the differences unfortunately show up on the same machine when SSE and i387 are mixed. Even larger differences also show up on sparc64, since long doubles are normally not in hardware; they are normally emulated by library calls and the library is not careful about NaNs; they can also be emulated by trapping of hardware long double instructions and then running different library functions from the trap handler, and these functions (or maybe the hardware part) are more careful about NaNs. >> % rccos: max_er = 0xb7f10439f3111686 24688206287.5958, avg_er = >> 1415.000, #>=1:0.5 = 3881364:4046644 > >> i387 hardware trig functions are very inaccurate. > > But also, what about the problem of when the input is close to one of the > non-trivial roots of sin, cos, etc? As a mathematician, I wouldn't be > shocked if sin(M_PI) was 1e-15 or such like. M_PI is a rational approximation to pi, so it would be a serious error of sin() on it were zero. sin() on it should be the infinite-precision sin() on it, correctly rounded to MANT_DIG bits except possibly for an error of less than 1 ulp. But the i387 only has a 66 (68?) bit internal approximation to pi. Subtracting this gives a large cancelation error: - float precision: 66-24 = 42. _At least_ 24 bits cancel. Some of the remaining 42 bits may cancel, but we would have to be very unlucky for pi to be so close to a representable rational that many of them cancel. In practice, most don't, and we are left with about 42 correct and rounding these to 24 bits is perfect provided we are also not unlucky with the trailing bits being near a half-way case. - double precision: 66-53 = 11. 11 bits possibly correct and 42 probably wrong. - long double precision: 66-64 = 2. 2 bits possibly correct and 62 probably wrong. sin(pi) is an easy case. More intesting is sin((double)(DBL_MAX/pi) * pi), where the multiplication is done in infinite precision and rounded to the closest representable rational. Multiplication by DBL_MAX/pi expands the number of certainly-canceling bits by a huge factor. The required number is approximately DBL_MAX_EXP = 1024. Then we have to know how many more bits may be lost to cancelation due to a multiple of pi being very close to a representable rational. Mathematicians showed using the continued fraction expansion of pi that the closest is about 2**-75 for precisions of interest (I forget if this is for double or ld128 long double precision). So about 1024+75 bits are needed in the approximation to pi. fdlibm msun/e_rem_pio2.c provided the necessary number (actually 1584) in 1994. Most of the extra 500 seem to be unnnecessary (perhaps the 2**-75 number was not well known in 1994). das had to expand this to support ld128. The table is now in msun/k_rem_pio2.c and provides 16560 bits. Unfortunately, the comment before it still says 1585 (actually "396 Hex digits"); this is correct but confusing since the table is extended under an ifdef. Approximately LDBL_MAX_EXP + 75+ = 16384 + 75 bits required for ld128, so there is now less to spare. Later the comment gives the precise number of bits needed for the reduction as (e0-3) + jk*24, where e0 <= 16360 and jk = 6; this is <= 16501, so there are 59 bits to spare. 59 isn't enough, so I don't see why we aren't depending on numerical accidents. I already avoid sign errors for NaNs in many places including hypot*() and atan2*(). This is probably needed for clog() to pass my tests: % Index: e_atan2.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/e_atan2.c,v % retrieving revision 1.14 % diff -u -1 -r1.14 e_atan2.c % --- e_atan2.c 2 Aug 2008 19:17:00 -0000 1.14 % +++ e_atan2.c 12 Jul 2012 18:31:28 -0000 % @@ -72,3 +72,3 @@ % ((iy|((ly|-ly)>>31))>0x7ff00000)) /* x or y is NaN */ % - return x+y; % + return (x+0.0L)/(y+0); /* quieten sNaNs before mixing */ % if((hx-0x3ff00000|lx)==0) return atan(y); /* x=1.0 */ Adding 0 quietens signaling NaNs before mixing. Otherwise, the result can depend on the quiet bit. (The hardware rule for mixing NaNs is sometimes to compare them in bits and select the largest one. The sign and quiet bits should not be primary in this comparision, but they are top bits and are primary on some hardware. SSE differs from i387 here IIRC). This was already done in some places. The additional hack of adding 0.0L instead of 0 makes the caclulation done in the same (i387) hardware on x86 for all precisions, so that the result doesn't depend on the precision. On i386 with ![clang && -msse2], this makes no difference to the object code, since the i387 is always used anyway, but it takes extra code on amd64, as does adding 0. This is only in an exceptional path so it shouldn't cost any speed. Also make the indentation less unusual. % Index: e_atan2f.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/e_atan2f.c,v % retrieving revision 1.12 % diff -u -1 -r1.12 e_atan2f.c % --- e_atan2f.c 3 Aug 2008 17:39:54 -0000 1.12 % +++ e_atan2f.c 12 Jul 2012 18:31:50 -0000 % @@ -43,3 +43,3 @@ % (iy>0x7f800000)) /* x or y is NaN */ % - return x+y; % + return (x+0.0L)/(y+0); /* quieten sNaNs before mixing */ % if(hx==0x3f800000) return atanf(y); /* x=1.0 */ Use the same +0.0L and comment and formatting in all precisions. % Index: e_atan2l.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/e_atan2l.c,v % retrieving revision 1.3 % diff -u -1 -r1.3 e_atan2l.c % --- e_atan2l.c 2 Aug 2008 19:17:00 -0000 1.3 % +++ e_atan2l.c 12 Jul 2012 18:32:55 -0000 % @@ -64,3 +64,3 @@ % ((uy.bits.manh&~LDBL_NBIT)|uy.bits.manl)!=0)) /* y is NaN */ % - return x+y; % + return (x+0.0L)/(y+0); /* quieten sNaNs before mixing */ % if (expsignx==BIAS && ((ux.bits.manh&~LDBL_NBIT)|ux.bits.manl)==0) Formatting was already improved. % Index: e_hypot.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/e_hypot.c,v % retrieving revision 1.14 % diff -u -1 -r1.14 e_hypot.c % --- e_hypot.c 15 Oct 2011 07:00:28 -0000 1.14 % +++ e_hypot.c 3 Jan 2012 18:00:26 -0000 % @@ -72,3 +72,3 @@ % /* Use original arg order iff result is NaN; quieten sNaNs. */ % - w = fabs(x+0.0)-fabs(y+0.0); % + w = fabsl(x+0.0L)-fabs(y+0); % GET_LOW_WORD(low,a); hypot* already added 0.0. Also change the spelling of 0.0 to 0 wherever possible, so that 0 is added in the same precision wherever the i387 hack is not wanted. % Index: e_hypotf.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/e_hypotf.c,v % retrieving revision 1.14 % diff -u -1 -r1.14 e_hypotf.c % --- e_hypotf.c 15 Oct 2011 07:00:28 -0000 1.14 % +++ e_hypotf.c 3 Jan 2012 18:00:42 -0000 % @@ -39,3 +39,3 @@ % /* Use original arg order iff result is NaN; quieten sNaNs. */ % - w = fabsf(x+0.0F)-fabsf(y+0.0F); % + w = fabsl(x+0.0L)-fabsf(y+0); % if(ha == 0x7f800000) w = a; % Index: e_hypotl.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/e_hypotl.c,v % retrieving revision 1.3 % diff -u -1 -r1.3 e_hypotl.c % --- e_hypotl.c 16 Oct 2011 05:36:39 -0000 1.3 % +++ e_hypotl.c 4 Jan 2012 05:26:52 -0000 % @@ -66,3 +69,3 @@ % /* Use original arg order iff result is NaN; quieten sNaNs. */ % - w = fabsl(x+0.0)-fabsl(y+0.0); % + w = fabsl(x+0.0L)-fabsl(y+0); % GET_LDBL_MAN(manh,manl,a); % Index: e_pow.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/e_pow.c,v % retrieving revision 1.14 % diff -u -1 -r1.14 e_pow.c % --- e_pow.c 21 Oct 2011 06:26:07 -0000 1.14 % +++ e_pow.c 3 Jan 2012 18:02:10 -0000 % @@ -117,3 +117,3 @@ % iy > 0x7ff00000 || ((iy==0x7ff00000)&&(ly!=0))) % - return (x+0.0)+(y+0.0); % + return (x+0.0L)+(y+0); % Propagate the +0.0L hack... % Index: e_powf.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/e_powf.c,v % retrieving revision 1.16 % diff -u -1 -r1.16 e_powf.c % --- e_powf.c 21 Oct 2011 06:26:07 -0000 1.16 % +++ e_powf.c 3 Jan 2012 18:02:21 -0000 % @@ -75,3 +75,3 @@ % iy > 0x7f800000) % - return (x+0.0F)+(y+0.0F); % + return (x+0.0L)+(y+0); % % Index: e_remainder.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/e_remainder.c,v % retrieving revision 1.12 % diff -u -1 -r1.12 e_remainder.c % --- e_remainder.c 30 Mar 2008 20:47:42 -0000 1.12 % +++ e_remainder.c 3 Jan 2012 17:43:21 -0000 % @@ -47,7 +47,7 @@ % /* purge off exception values */ % - if((hp|lp)==0) return (x*p)/(x*p); /* p = 0 */ % - if((hx>=0x7ff00000)|| /* x not finite */ % + if(((hp|lp)==0)|| /* p = 0 */ % + (hx>=0x7ff00000)|| /* x not finite */ % ((hp>=0x7ff00000)&& /* p is NaN */ % (((hp-0x7ff00000)|lp)!=0))) % - return ((long double)x*p)/((long double)x*p); % + return ((x+0.0L)*(p+0))/((x+0.0L)*(p+0)); % This already used a different way of extending to long double. Not so good, since it is MD whether extensions quieten NaNs before mixing. Also, rearrange code so as not to have a special case for p = 0. This might be an optimization for p != 0, but I only wanted the NaN handling to be uniform. % Index: e_remainderf.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/e_remainderf.c,v % retrieving revision 1.8 % diff -u -1 -r1.8 e_remainderf.c % --- e_remainderf.c 12 Feb 2008 17:11:36 -0000 1.8 % +++ e_remainderf.c 3 Jan 2012 17:44:06 -0000 % @@ -38,6 +38,6 @@ % /* purge off exception values */ % - if(hp==0) return (x*p)/(x*p); /* p = 0 */ % - if((hx>=0x7f800000)|| /* x not finite */ % + if((hp==0)|| /* p = 0 */ % + (hx>=0x7f800000)|| /* x not finite */ % ((hp>0x7f800000))) /* p is NaN */ % - return ((long double)x*p)/((long double)x*p); % + return ((x+0.0L)*(p+0))/((x+0.0L)*(p+0)); % remainderl() uses remquol(), so e_remainderl.c isn't changed in these patches. % Index: s_csinh.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_csinh.c,v % retrieving revision 1.2 % diff -u -1 -r1.2 s_csinh.c % --- s_csinh.c 21 Oct 2011 06:29:32 -0000 1.2 % +++ s_csinh.c 3 Jan 2012 06:20:18 -0000 % @@ -27,2 +27,9 @@ % /* % + * XXX TODO: % + * Change x +-I y to x + I (+-y) or vice versa? We currently use the % + * former former for args and the latter for results. % + * s/the invalid floating-point exception/FE_INVALID/g % + */ % + % +/* % * Hyperbolic sine of a complex argument z = x + i y. % @@ -65,3 +72,3 @@ % return (cpack(sinh(x), y)); % - if (ix < 0x40360000) /* small x: normal case */ % + if (ix < 0x40360000) /* |x| < 22: normal case */ % return (cpack(sinh(x) * cos(y), cosh(x) * sin(y))); % @@ -85,12 +92,12 @@ % /* % - * sinh(+-0 +- I Inf) = sign(d(+-0, dNaN))0 + I dNaN. % - * The sign of 0 in the result is unspecified. Choice = normally % - * the same as dNaN. Raise the invalid floating-point exception. % - * % - * sinh(+-0 +- I NaN) = sign(d(+-0, NaN))0 + I d(NaN). % - * The sign of 0 in the result is unspecified. Choice = normally % - * the same as d(NaN). % + * sinh(+-0 +- I Inf) = +-0 + I dNaN. % + * The sign of 0 in the result is unspecified. Choice = same sign % + * as the argument. Raise the invalid floating-point exception. % + * % + * sinh(+-0 +- I NaN) = +-0 + I d(NaN). % + * The sign of 0 in the result is unspecified. Choice = same sign % + * as the argument. % */ % - if ((ix | lx) == 0 && iy >= 0x7ff00000) % - return (cpack(copysign(0, x * (y - y)), y - y)); % + if ((ix | lx) == 0) /* && iy >= 0x7ff00000 */ % + return (cpack(x, y - y)); % % @@ -101,7 +108,4 @@ % */ % - if ((iy | ly) == 0 && ix >= 0x7ff00000) { % - if (((hx & 0xfffff) | lx) == 0) % - return (cpack(x, y)); % - return (cpack(x, copysign(0, y))); % - } % + if ((iy | ly) == 0) /* && ix >= 0x7ff00000 */ % + return (cpack(x + x, y)); % % @@ -115,4 +119,4 @@ % */ % - if (ix < 0x7ff00000 && iy >= 0x7ff00000) % - return (cpack(y - y, x * (y - y))); % + if (ix < 0x7ff00000) /* && iy >= 0x7ff00000 */ % + return (cpack(y - y, y - y)); % % @@ -120,8 +124,8 @@ % * sinh(+-Inf + I NaN) = +-Inf + I d(NaN). % - * The sign of Inf in the result is unspecified. Choice = normally % - * the same as d(NaN). % + * The sign of Inf in the result is unspecified. Choice = same sign % + * as the argument. % * % * sinh(+-Inf +- I Inf) = +Inf + I dNaN. % - * The sign of Inf in the result is unspecified. Choice = always +. % - * Raise the invalid floating-point exception. % + * The sign of Inf in the result is unspecified. Choice = same sign % + * as the argument. Raise the invalid floating-point exception. % * % @@ -129,5 +133,5 @@ % */ % - if (ix >= 0x7ff00000 && ((hx & 0xfffff) | lx) == 0) { % + if (ix == 0x7ff00000 && lx == 0) { % if (iy >= 0x7ff00000) % - return (cpack(x * x, x * (y - y))); % + return (cpack(x, y - y)); % return (cpack(x * cos(y), INFINITY * sin(y))); % @@ -146,3 +150,3 @@ % */ % - return (cpack((x * x) * (y - y), (x + x) * (y - y))); % + return (cpack((x + x) * (y - y), (x * x) * (y - y))); % } % @@ -153,3 +157,3 @@ % % - /* csin(z) = -I * csinh(I * z) */ % + /* csin(z) = -I * csinh(I * z). */ % z = csinh(cpack(-cimag(z), creal(z))); Beginning of cleanups and NaN fixes for committed complex functions. % Index: s_csinhf.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_csinhf.c,v % retrieving revision 1.2 % diff -u -1 -r1.2 s_csinhf.c % --- s_csinhf.c 21 Oct 2011 06:29:32 -0000 1.2 % +++ s_csinhf.c 3 Jan 2012 06:20:30 -0000 % @@ -27,3 +27,3 @@ % /* % - * Hyperbolic sine of a complex argument z. See s_csinh.c for details. % + * Float version of csinh(). See s_csinh.c for details. % */ % @@ -58,3 +58,3 @@ % return (cpackf(sinhf(x), y)); % - if (ix < 0x41100000) /* small x: normal case */ % + if (ix < 0x41100000) /* |x| < 9: normal case */ % return (cpackf(sinhf(x) * cosf(y), coshf(x) * sinf(y))); % @@ -64,3 +64,3 @@ % /* x < 88.7: expf(|x|) won't overflow */ % - h = expf(fabsf(x)) * 0.5f; % + h = expf(fabsf(x)) * 0.5F; % return (cpackf(copysignf(h, x) * cosf(y), h * sinf(y))); % @@ -77,17 +77,14 @@ % % - if (ix == 0 && iy >= 0x7f800000) % - return (cpackf(copysignf(0, x * (y - y)), y - y)); % + if (ix == 0) /* && iy >= 0x7f800000 */ % + return (cpackf(x, y - y)); % % - if (iy == 0 && ix >= 0x7f800000) { % - if ((hx & 0x7fffff) == 0) % - return (cpackf(x, y)); % - return (cpackf(x, copysignf(0, y))); % - } % + if (iy == 0) /* && ix >= 0x7f800000 */ % + return (cpackf(x + x , y)); % % - if (ix < 0x7f800000 && iy >= 0x7f800000) % - return (cpackf(y - y, x * (y - y))); % + if (ix < 0x7f800000) /* && iy >= 0x7f800000 */ % + return (cpackf(y - y, y - y)); % % - if (ix >= 0x7f800000 && (hx & 0x7fffff) == 0) { % + if (ix == 0x7f800000) { % if (iy >= 0x7f800000) % - return (cpackf(x * x, x * (y - y))); % + return (cpackf(x, y - y)); % return (cpackf(x * cosf(y), INFINITY * sinf(y))); % @@ -95,3 +92,3 @@ % % - return (cpackf((x * x) * (y - y), (x + x) * (y - y))); % + return (cpackf((x + x) * (y - y), (x * x) * (y - y))); % } % Index: s_csqrt.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_csqrt.c,v % retrieving revision 1.4 % diff -u -1 -r1.4 s_csqrt.c % --- s_csqrt.c 8 Aug 2008 00:15:16 -0000 1.4 % +++ s_csqrt.c 25 Oct 2011 14:51:10 -0000 % @@ -65,3 +65,3 @@ % t = (b - b) / (b - b); /* raise invalid if b is not a NaN */ % - return (cpack(a, t)); /* return NaN + NaN i */ % + return (cpack(a + a, t)); /* return NaN + NaN i */ % } csqrt*() didn't even quieten NaNs. Many more of the committed complex functions would have failed my tests without fixes like this. The inverse hyerbolic ones somehow have fewer problems with NaNs, perhaps by using atan2() more. % Index: s_csqrtf.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_csqrtf.c,v % retrieving revision 1.3 % diff -u -1 -r1.3 s_csqrtf.c % --- s_csqrtf.c 8 Aug 2008 00:15:16 -0000 1.3 % +++ s_csqrtf.c 25 Oct 2011 14:51:21 -0000 % @@ -56,3 +56,3 @@ % t = (b - b) / (b - b); /* raise invalid if b is not a NaN */ % - return (cpackf(a, t)); /* return NaN + NaN i */ % + return (cpackf(a + a, t)); /* return NaN + NaN i */ % } % Index: s_csqrtl.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_csqrtl.c,v % retrieving revision 1.2 % diff -u -1 -r1.2 s_csqrtl.c % --- s_csqrtl.c 8 Aug 2008 00:15:16 -0000 1.2 % +++ s_csqrtl.c 25 Oct 2011 14:51:29 -0000 % @@ -65,3 +65,3 @@ % t = (b - b) / (b - b); /* raise invalid if b is not a NaN */ % - return (cpackl(a, t)); /* return NaN + NaN i */ % + return (cpackl(a + a, t)); /* return NaN + NaN i */ % } % Index: s_ctanh.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_ctanh.c,v % retrieving revision 1.2 % diff -u -1 -r1.2 s_ctanh.c % --- s_ctanh.c 21 Oct 2011 06:30:16 -0000 1.2 % +++ s_ctanh.c 25 Oct 2011 14:30:18 -0000 % @@ -87,2 +87,13 @@ % /* % + * XXX this is missing the dNaN/d(NaN) notation, which tells us the % + * following: % + * dNaN is a default NaN unrelated to any NaN args % + * d(NaN) is a unary conversion (usually quieting) of the arg `NaN' % + * % + * XXX everything is missing: % + * d(NaN1, NaN2) and d(NaN, y) % + * which should be used for binary conversions. % + * % + * XXX this misspells I as i. % + * % * ctanh(NaN + i 0) = NaN + i 0 % @@ -104,3 +115,3 @@ % if ((ix & 0xfffff) | lx) /* x is NaN */ % - return (cpack(x, (y == 0 ? y : x * y))); % + return (cpack(x + x, y == 0 ? y : x * y)); % SET_HIGH_WORD(x, hx - 0x40000000); /* x = copysign(1, x) */ ctanh*() needs more work than the others. It also has the largest errors. Some above 6 ulps. % Index: s_ctanhf.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_ctanhf.c,v % retrieving revision 1.2 % diff -u -1 -r1.2 s_ctanhf.c % --- s_ctanhf.c 21 Oct 2011 06:30:16 -0000 1.2 % +++ s_ctanhf.c 25 Oct 2011 14:30:57 -0000 % @@ -53,3 +53,3 @@ % if (ix & 0x7fffff) % - return (cpackf(x, (y == 0 ? y : x * y))); % + return (cpackf(x + x, y == 0 ? y : x * y)); % SET_FLOAT_WORD(x, hx - 0x40000000); % Index: s_fabsl.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_fabsl.c,v % retrieving revision 1.4 % diff -u -1 -r1.4 s_fabsl.c % --- s_fabsl.c 25 Apr 2012 18:07:35 -0000 1.4 % +++ s_fabsl.c 12 Jul 2012 13:02:28 -0000 % @@ -40,4 +40,4 @@ % u.e = x; % - u.bits.sign = 0; % - return (u.e); % + u.xbits.expsign &= 0x7fff; % + return (u.e + 0); % fabsl() probably shouldn't quieten NaNs, since the hardware normally doesn't. All fabs*() functions are usually inline or in asm, so the fdlibm version is rarely used except for testing it. I didn't hack on fabsf() or fabs() similarly because NaN errors in them didn't shoow up in testing. Also de-pessimize the sign handling a little. compilers tends to produce pessimal code for direct bit-field accesses, with u.bits.sign the worst case. and u.xbits.expsign not too bad. % Index: s_remquo.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_remquo.c,v % retrieving revision 1.3 % diff -u -1 -r1.3 s_remquo.c % --- s_remquo.c 7 Apr 2012 03:59:12 -0000 1.3 % +++ s_remquo.c 12 Jul 2012 13:06:25 -0000 % @@ -46,3 +46,3 @@ % ((hy|((ly|-ly)>>31))>0x7ff00000)) /* or y is NaN */ % - return (x*y)/(x*y); % + return ((x+0.0L)*(y+0))/((x+0.0L)*(y+0)); % if(hx<=hy) { % Index: s_remquof.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_remquof.c,v % retrieving revision 1.2 % diff -u -1 -r1.2 s_remquof.c % --- s_remquof.c 7 Apr 2012 03:59:12 -0000 1.2 % +++ s_remquof.c 12 Jul 2012 13:06:32 -0000 % @@ -43,3 +43,3 @@ % if(hy==0||hx>=0x7f800000||hy>0x7f800000) /* y=0,NaN;or x not finite */ % - return (x*y)/(x*y); % + return ((x+0.0L)*(y+0))/((x+0.0L)*(y+0)); % if(hx<hy) { % Index: s_remquol.c % =================================================================== % RCS file: /home/ncvs/src/lib/msun/src/s_remquol.c,v % retrieving revision 1.3 % diff -u -1 -r1.3 s_remquol.c % --- s_remquol.c 7 Apr 2012 03:59:12 -0000 1.3 % +++ s_remquol.c 12 Jul 2012 13:06:40 -0000 % @@ -81,3 +81,2 @@ % uy.bits.sign = 0; /* |y| */ % - x = ux.e; % Don't clobber x yet, so that the original x can be used in early returns. % @@ -88,3 +87,3 @@ % ((uy.bits.manh&~LDBL_NBIT)|uy.bits.manl)!=0)) /* or y is NaN */ % - return (x*y)/(x*y); % + return ((x+0.0L)*(y+0))/((x+0.0L)*(y+0)); % if(ux.bits.exp<=uy.bits.exp) { The usual hack. % @@ -128,3 +127,2 @@ % q = 0; % - % while(n--) { % @@ -156,5 +154,4 @@ % } % - ux.bits.sign = 0; Remove dead code. % - x = ux.e; % fixup: % + x = ux.e; /* |x| */ % y = fabsl(y); Clobber x later when the main path is joined. It is bad style to re-use a variable. It micro-optimizes for 30 year old compilers. It makes the code harder to writem read and debug. Modern compilers will automatically reuse register and stack resources for variables if this is possible and doesn't interfere too much with debugging. Not reusing variables seems to give slightly better object code more often than slightly worse object code, by simplifying the lifetime analysis needed for this optimization. I only try re-using variables near the start of a function (maybe x = creal(z); use(x); x = fabs(x)), to try to stop the compiler making so many copies of x. This somtimes works. (The typical pessimization avoided by this is when x passed on the stack. gcc likes to copy it to another place on the stack, using pessimal methods, and then never use the original copy. This is good for debugging, but otherwise not very good.). % @@ -169,3 +166,2 @@ % } % - % ux.e = x; % @@ -173,3 +169,2 @@ % x = ux.e; % - % q &= 0x7fffffff; remquo() and remquof() didn't need the rearrangement of the code to preserve x. I blame the direct bit-field accesses in remquol(). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120814173931.V934>