Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 14 Aug 2012 20:09:39 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Stephen Montgomery-Smith <stephen@missouri.edu>
Cc:        freebsd-numerics@freebsd.org
Subject:   Re: Complex arg-trig functions
Message-ID:  <20120814173931.V934@besplex.bde.org>
In-Reply-To: <50297468.20902@missouri.edu>
References:  <5017111E.6060003@missouri.edu> <501C361D.4010807@missouri.edu> <20120804165555.X1231@besplex.bde.org> <501D51D7.1020101@missouri.edu> <20120805030609.R3101@besplex.bde.org> <501D9C36.2040207@missouri.edu> <20120805175106.X3574@besplex.bde.org> <501EC015.3000808@missouri.edu> <20120805191954.GA50379@troutmask.apl.washington.edu> <20120807205725.GA10572@server.rulingia.com> <20120809025220.N4114@besplex.bde.org> <5027F07E.9060409@missouri.edu> <20120814003614.H3692@besplex.bde.org> <50295887.2010608@missouri.edu> <20120814055931.Q4897@besplex.bde.org> <50297468.20902@missouri.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 13 Aug 2012, Stephen Montgomery-Smith wrote:

> On 08/13/2012 04:14 PM, Bruce Evans wrote:
> ...
>> 2. Division tends to be slow, and is slow on x86.  On Phenom, division
>>     has a latency of 20 cycles and a throughput of 1 per 15 cycles in
>>     scalar double precision (17 in vector[2] double precision).  Maybe
>>     the compiler can optimize division by a power of 2 to a multiplication
>>     though.
>
> Does the time taken to perform certain floating point operations depend upon 
> what the numbers are?  I know that when I do long multiplication by hand that 
> (1) it is much faster than long division, and (2) certain numbers (like 1000) 
> are very quick to multiply and divide.

On x86, mostly not.  Denormals can be slow in general, but I've never seen
then causing significant slowness on x86.  I've only seen underflow (and
overflow?) causing signficant slowness.  The easy underflowing cases for
exp*() run about 4 times slower than the normal case although they only
execute 1 underflowing instruction instead of 50-100 normal instructions
after the classification.  This slowness is very MD:
- on identical Intel core2 hardware (ref*.freebsd.org)
   - i386 underflow is normally slow in all precisions.  But when SSE[2]
     is used, it has no penalty.  SSE[2] is the default on core2 for
     clang in float and double precision, but for gcc it takes unusual
     CFLAGS to get it.  CFLAGS without -march=<some CPU that supports
     SSE[2]> don't give it even for clang IIRC, since the default is
     more like -mpentiumpro.
   - amd64 underflow normally or always has no penalty in float and double
     precisions, since use of SSE[2] is forced, except possibly with even
     more unusual CFLAGS.
   - long doubles always use the i387 and always have a large penalty for
     underflow.
- on AthlonXP and original Athlon64 hardware:
   - there is no penalty for underflow for any combination of amd64/i386/
     i387/SSE[2]/float/double/long double.

> Similarly with the computer, is it possible that division by 4 is much faster 
> than division by (say) 4.4424242389, and that division by 4 is just as fast 
> as multiplication by 0.25?  (And multiplication by 0.25 is faster than 
> multiplication than 0.235341212412?)

Pipelining and large hardware probably inhibits this optimization on
modern CPUs.  Some x86 has faster integer multiplication (?) for smaller
numbers of bits and/or the position of the bits.  I've never seen this
for FP on x86.  Except, there are special FP instructions for smaller
numbers of bits in SSE: reciprocal with 12 (?) bits and reciprocal
square root with 12 (?) bits.  These have about the same latency as
addition and multiplication (4 cycles).  Probably lower throughput
(not 1 per cycle).  This many bits can be done by table lookup.  Perhaps
the full operations start with this fully in hardware, then use Newton's
method in microcode, then magic to round.  IIRC. ia64 doesn't even
have full division in hardware, but it has reciprocals.

>> This was tracked to a known annoying difference between SSE and i387
>> on NaNs.  x+y clears the sign buit on i387 but not on SSE.  SSE is
>> correct.
>
> Was it your code that was wrong, or mine?  And if it is mine, what is the 
> fix?

Neither.  It is just that different hardware gives different results,
and the differences unfortunately show up on the same machine when SSE
and i387 are mixed.  Even larger differences also show up on sparc64,
since long doubles are normally not in hardware; they are normally
emulated by library calls and the library is not careful about NaNs;
they can also be emulated by trapping of hardware long double instructions
and then running different library functions from the trap handler, and
these functions (or maybe the hardware part) are more careful about NaNs.

>> % rccos: max_er = 0xb7f10439f3111686 24688206287.5958, avg_er =
>> 1415.000, #>=1:0.5 = 3881364:4046644
>
>> i387 hardware trig functions are very inaccurate.
>
> But also, what about the problem of when the input is close to one of the 
> non-trivial roots of sin, cos, etc?  As a mathematician, I wouldn't be 
> shocked if sin(M_PI) was 1e-15 or such like.

M_PI is a rational approximation to pi, so it would be a serious error of
sin() on it were zero.  sin() on it should be the infinite-precision sin()
on it, correctly rounded to MANT_DIG bits except possibly for an error of
less than 1 ulp.  But the i387 only has a 66 (68?) bit internal
approximation to pi.  Subtracting this gives a large cancelation error:
- float precision: 66-24 = 42.  _At least_ 24 bits cancel.  Some of the
   remaining 42 bits may cancel, but we would have to be very unlucky
   for pi to be so close to a representable rational that many of them
   cancel.  In practice, most don't, and we are left with about 42
   correct and rounding these to 24 bits is perfect provided we are
   also not unlucky with the trailing bits being near a half-way case.
- double precision: 66-53 = 11.  11 bits possibly correct and 42
   probably wrong.
- long double precision: 66-64 = 2.  2 bits possibly correct and 62
   probably wrong.

sin(pi) is an easy case.  More intesting is
sin((double)(DBL_MAX/pi) * pi), where the multiplication is done in
infinite precision and rounded to the closest representable rational.
Multiplication by DBL_MAX/pi expands the number of certainly-canceling
bits by a huge factor.  The required number is approximately
DBL_MAX_EXP = 1024.  Then we have to know how many more bits may be
lost to cancelation due to a multiple of pi being very close to a
representable rational.  Mathematicians showed using the continued
fraction expansion of pi that the closest is about 2**-75 for
precisions of interest (I forget if this is for double or ld128 long
double precision).  So about 1024+75 bits are needed in the
approximation to pi.  fdlibm msun/e_rem_pio2.c provided the
necessary number (actually 1584) in 1994.  Most of the extra 500
seem to be unnnecessary (perhaps the 2**-75 number was not well
known in 1994).  das had to expand this to support ld128.  The
table is now in msun/k_rem_pio2.c and provides 16560 bits.
Unfortunately, the comment before it still says 1585 (actually
"396 Hex digits"); this is correct but confusing since the table
is extended under an ifdef.  Approximately LDBL_MAX_EXP + 75+ =
16384 + 75 bits required for ld128, so there is now less to
spare.  Later the comment gives the precise number of bits needed
for the reduction as (e0-3) + jk*24, where e0 <= 16360 and jk = 6;
this is <= 16501, so there are 59 bits to spare.  59 isn't enough,
so I don't see why we aren't depending on numerical accidents.

I already avoid sign errors for NaNs in many places including
hypot*() and atan2*().  This is probably needed for clog() to
pass my tests:

% Index: e_atan2.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/e_atan2.c,v
% retrieving revision 1.14
% diff -u -1 -r1.14 e_atan2.c
% --- e_atan2.c	2 Aug 2008 19:17:00 -0000	1.14
% +++ e_atan2.c	12 Jul 2012 18:31:28 -0000
% @@ -72,3 +72,3 @@
%  	   ((iy|((ly|-ly)>>31))>0x7ff00000))	/* x or y is NaN */
% -	   return x+y;
% +	    return (x+0.0L)/(y+0);	/* quieten sNaNs before mixing */
%  	if((hx-0x3ff00000|lx)==0) return atan(y);   /* x=1.0 */

Adding 0 quietens signaling NaNs before mixing.  Otherwise, the result
can depend on the quiet bit.  (The hardware rule for mixing NaNs is
sometimes to compare them in bits and select the largest one.  The
sign and quiet bits should not be primary in this comparision, but
they are top bits and are primary on some hardware.  SSE differs from
i387 here IIRC).

This was already done in some places.  The additional hack of adding
0.0L instead of 0 makes the caclulation done in the same (i387) hardware
on x86 for all precisions, so that the result doesn't depend on the
precision.  On i386 with ![clang && -msse2], this makes no difference
to the object code, since the i387 is always used anyway, but it takes
extra code on amd64, as does adding 0.  This is only in an exceptional
path so it shouldn't cost any speed.

Also make the indentation less unusual.

% Index: e_atan2f.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/e_atan2f.c,v
% retrieving revision 1.12
% diff -u -1 -r1.12 e_atan2f.c
% --- e_atan2f.c	3 Aug 2008 17:39:54 -0000	1.12
% +++ e_atan2f.c	12 Jul 2012 18:31:50 -0000
% @@ -43,3 +43,3 @@
%  	   (iy>0x7f800000))	/* x or y is NaN */
% -	   return x+y;
% +	    return (x+0.0L)/(y+0);	/* quieten sNaNs before mixing */
%  	if(hx==0x3f800000) return atanf(y);   /* x=1.0 */

Use the same +0.0L and comment and formatting in all precisions.

% Index: e_atan2l.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/e_atan2l.c,v
% retrieving revision 1.3
% diff -u -1 -r1.3 e_atan2l.c
% --- e_atan2l.c	2 Aug 2008 19:17:00 -0000	1.3
% +++ e_atan2l.c	12 Jul 2012 18:32:55 -0000
% @@ -64,3 +64,3 @@
%  	     ((uy.bits.manh&~LDBL_NBIT)|uy.bits.manl)!=0))	/* y is NaN */
% -	    return x+y;
% +	    return (x+0.0L)/(y+0);	/* quieten sNaNs before mixing */
%  	if (expsignx==BIAS && ((ux.bits.manh&~LDBL_NBIT)|ux.bits.manl)==0)

Formatting was already improved.

% Index: e_hypot.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/e_hypot.c,v
% retrieving revision 1.14
% diff -u -1 -r1.14 e_hypot.c
% --- e_hypot.c	15 Oct 2011 07:00:28 -0000	1.14
% +++ e_hypot.c	3 Jan 2012 18:00:26 -0000
% @@ -72,3 +72,3 @@
%  	       /* Use original arg order iff result is NaN; quieten sNaNs. */
% -	       w = fabs(x+0.0)-fabs(y+0.0);
% +	       w = fabsl(x+0.0L)-fabs(y+0);
%  	       GET_LOW_WORD(low,a);

hypot* already added 0.0.

Also change the spelling of 0.0 to 0 wherever possible, so that 0 is
added in the same precision wherever the i387 hack is not wanted.

% Index: e_hypotf.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/e_hypotf.c,v
% retrieving revision 1.14
% diff -u -1 -r1.14 e_hypotf.c
% --- e_hypotf.c	15 Oct 2011 07:00:28 -0000	1.14
% +++ e_hypotf.c	3 Jan 2012 18:00:42 -0000
% @@ -39,3 +39,3 @@
%  	       /* Use original arg order iff result is NaN; quieten sNaNs. */
% -	       w = fabsf(x+0.0F)-fabsf(y+0.0F);
% +	       w = fabsl(x+0.0L)-fabsf(y+0);
%  	       if(ha == 0x7f800000) w = a;
% Index: e_hypotl.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/e_hypotl.c,v
% retrieving revision 1.3
% diff -u -1 -r1.3 e_hypotl.c
% --- e_hypotl.c	16 Oct 2011 05:36:39 -0000	1.3
% +++ e_hypotl.c	4 Jan 2012 05:26:52 -0000
% @@ -66,3 +69,3 @@
%  	       /* Use original arg order iff result is NaN; quieten sNaNs. */
% -	       w = fabsl(x+0.0)-fabsl(y+0.0);
% +	       w = fabsl(x+0.0L)-fabsl(y+0);
%  	       GET_LDBL_MAN(manh,manl,a);
% Index: e_pow.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/e_pow.c,v
% retrieving revision 1.14
% diff -u -1 -r1.14 e_pow.c
% --- e_pow.c	21 Oct 2011 06:26:07 -0000	1.14
% +++ e_pow.c	3 Jan 2012 18:02:10 -0000
% @@ -117,3 +117,3 @@
%  	   iy > 0x7ff00000 || ((iy==0x7ff00000)&&(ly!=0))) 
% -		return (x+0.0)+(y+0.0);
% +		return (x+0.0L)+(y+0);
%

Propagate the +0.0L hack...

% Index: e_powf.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/e_powf.c,v
% retrieving revision 1.16
% diff -u -1 -r1.16 e_powf.c
% --- e_powf.c	21 Oct 2011 06:26:07 -0000	1.16
% +++ e_powf.c	3 Jan 2012 18:02:21 -0000
% @@ -75,3 +75,3 @@
%  	   iy > 0x7f800000)
% -		return (x+0.0F)+(y+0.0F);
% +		return (x+0.0L)+(y+0);
% 
% Index: e_remainder.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/e_remainder.c,v
% retrieving revision 1.12
% diff -u -1 -r1.12 e_remainder.c
% --- e_remainder.c	30 Mar 2008 20:47:42 -0000	1.12
% +++ e_remainder.c	3 Jan 2012 17:43:21 -0000
% @@ -47,7 +47,7 @@
%      /* purge off exception values */
% -	if((hp|lp)==0) return (x*p)/(x*p); 	/* p = 0 */
% -	if((hx>=0x7ff00000)||			/* x not finite */
% +	if(((hp|lp)==0)||		 	/* p = 0 */
% +	  (hx>=0x7ff00000)||			/* x not finite */
%  	  ((hp>=0x7ff00000)&&			/* p is NaN */
%  	  (((hp-0x7ff00000)|lp)!=0)))
% -	    return ((long double)x*p)/((long double)x*p);
% +	    return ((x+0.0L)*(p+0))/((x+0.0L)*(p+0));
%

This already used a different way of extending to long double.  Not
so good, since it is MD whether extensions quieten NaNs before mixing.

Also, rearrange code so as not to have a special case for p = 0.
This might be an optimization for p != 0, but I only wanted the NaN
handling to be uniform.

% Index: e_remainderf.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/e_remainderf.c,v
% retrieving revision 1.8
% diff -u -1 -r1.8 e_remainderf.c
% --- e_remainderf.c	12 Feb 2008 17:11:36 -0000	1.8
% +++ e_remainderf.c	3 Jan 2012 17:44:06 -0000
% @@ -38,6 +38,6 @@
%      /* purge off exception values */
% -	if(hp==0) return (x*p)/(x*p);	 	/* p = 0 */
% -	if((hx>=0x7f800000)||			/* x not finite */
% +	if((hp==0)||			 	/* p = 0 */
% +	  (hx>=0x7f800000)||			/* x not finite */
%  	  ((hp>0x7f800000)))			/* p is NaN */
% -	    return ((long double)x*p)/((long double)x*p);
% +	    return ((x+0.0L)*(p+0))/((x+0.0L)*(p+0));
%

remainderl() uses remquol(), so e_remainderl.c isn't changed in these
patches.

% Index: s_csinh.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_csinh.c,v
% retrieving revision 1.2
% diff -u -1 -r1.2 s_csinh.c
% --- s_csinh.c	21 Oct 2011 06:29:32 -0000	1.2
% +++ s_csinh.c	3 Jan 2012 06:20:18 -0000
% @@ -27,2 +27,9 @@
%  /*
% + * XXX TODO:
% + * Change x +-I y to x + I (+-y) or vice versa?  We currently use the
% + * former former for args and the latter for results.
% + * s/the invalid floating-point exception/FE_INVALID/g
% + */
% +
% +/*
%   * Hyperbolic sine of a complex argument z = x + i y.
% @@ -65,3 +72,3 @@
%  			return (cpack(sinh(x), y));
% -		if (ix < 0x40360000)	/* small x: normal case */
% +		if (ix < 0x40360000)	/* |x| < 22: normal case */
%  			return (cpack(sinh(x) * cos(y), cosh(x) * sin(y)));
% @@ -85,12 +92,12 @@
%  	/*
% -	 * sinh(+-0 +- I Inf) = sign(d(+-0, dNaN))0 + I dNaN.
% -	 * The sign of 0 in the result is unspecified.  Choice = normally
% -	 * the same as dNaN.  Raise the invalid floating-point exception.
% -	 *
% -	 * sinh(+-0 +- I NaN) = sign(d(+-0, NaN))0 + I d(NaN).
% -	 * The sign of 0 in the result is unspecified.  Choice = normally
% -	 * the same as d(NaN).
% +	 * sinh(+-0 +- I Inf) = +-0 + I dNaN.
% +	 * The sign of 0 in the result is unspecified.  Choice = same sign
% +	 * as the argument.  Raise the invalid floating-point exception.
% +	 *
% +	 * sinh(+-0 +- I NaN) = +-0 + I d(NaN).
% +	 * The sign of 0 in the result is unspecified.  Choice = same sign
% +	 * as the argument.
%  	 */
% -	if ((ix | lx) == 0 && iy >= 0x7ff00000)
% -		return (cpack(copysign(0, x * (y - y)), y - y));
% +	if ((ix | lx) == 0)		/* && iy >= 0x7ff00000 */
% +		return (cpack(x, y - y));
% 
% @@ -101,7 +108,4 @@
%  	 */
% -	if ((iy | ly) == 0 && ix >= 0x7ff00000) {
% -		if (((hx & 0xfffff) | lx) == 0)
% -			return (cpack(x, y));
% -		return (cpack(x, copysign(0, y)));
% -	}
% +	if ((iy | ly) == 0)		/* && ix >= 0x7ff00000 */
% +		return (cpack(x + x, y));
% 
% @@ -115,4 +119,4 @@
%  	 */
% -	if (ix < 0x7ff00000 && iy >= 0x7ff00000)
% -		return (cpack(y - y, x * (y - y)));
% +	if (ix < 0x7ff00000)		/* && iy >= 0x7ff00000 */
% +		return (cpack(y - y, y - y));
% 
% @@ -120,8 +124,8 @@
%  	 * sinh(+-Inf + I NaN)  = +-Inf + I d(NaN).
% -	 * The sign of Inf in the result is unspecified.  Choice = normally
% -	 * the same as d(NaN).
% +	 * The sign of Inf in the result is unspecified.  Choice = same sign
% +	 * as the argument.
%  	 *
%  	 * sinh(+-Inf +- I Inf) = +Inf + I dNaN.
% -	 * The sign of Inf in the result is unspecified.  Choice = always +.
% -	 * Raise the invalid floating-point exception.
% +	 * The sign of Inf in the result is unspecified.  Choice = same sign
% +	 * as the argument.  Raise the invalid floating-point exception.
%  	 *
% @@ -129,5 +133,5 @@
%  	 */
% -	if (ix >= 0x7ff00000 && ((hx & 0xfffff) | lx) == 0) {
% +	if (ix == 0x7ff00000 && lx == 0) {
%  		if (iy >= 0x7ff00000)
% -			return (cpack(x * x, x * (y - y)));
% +			return (cpack(x, y - y));
%  		return (cpack(x * cos(y), INFINITY * sin(y)));
% @@ -146,3 +150,3 @@
%  	 */
% -	return (cpack((x * x) * (y - y), (x + x) * (y - y)));
% +	return (cpack((x + x) * (y - y), (x * x) * (y - y)));
%  }
% @@ -153,3 +157,3 @@
% 
% -	/* csin(z) = -I * csinh(I * z) */
% +	/* csin(z) = -I * csinh(I * z). */
%  	z = csinh(cpack(-cimag(z), creal(z)));

Beginning of cleanups and NaN fixes for committed complex functions.

% Index: s_csinhf.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_csinhf.c,v
% retrieving revision 1.2
% diff -u -1 -r1.2 s_csinhf.c
% --- s_csinhf.c	21 Oct 2011 06:29:32 -0000	1.2
% +++ s_csinhf.c	3 Jan 2012 06:20:30 -0000
% @@ -27,3 +27,3 @@
%  /*
% - * Hyperbolic sine of a complex argument z.  See s_csinh.c for details.
% + * Float version of csinh().  See s_csinh.c for details.
%   */
% @@ -58,3 +58,3 @@
%  			return (cpackf(sinhf(x), y));
% -		if (ix < 0x41100000)	/* small x: normal case */
% +		if (ix < 0x41100000)	/* |x| < 9: normal case */
%  			return (cpackf(sinhf(x) * cosf(y), coshf(x) * sinf(y)));
% @@ -64,3 +64,3 @@
%  			/* x < 88.7: expf(|x|) won't overflow */
% -			h = expf(fabsf(x)) * 0.5f;
% +			h = expf(fabsf(x)) * 0.5F;
%  			return (cpackf(copysignf(h, x) * cosf(y), h * sinf(y)));
% @@ -77,17 +77,14 @@
% 
% -	if (ix == 0 && iy >= 0x7f800000)
% -		return (cpackf(copysignf(0, x * (y - y)), y - y));
% +	if (ix == 0)			/* && iy >= 0x7f800000 */
% +		return (cpackf(x, y - y));
% 
% -	if (iy == 0 && ix >= 0x7f800000) {
% -		if ((hx & 0x7fffff) == 0)
% -			return (cpackf(x, y));
% -		return (cpackf(x, copysignf(0, y)));
% -	}
% +	if (iy == 0)			/* && ix >= 0x7f800000 */
% +		return (cpackf(x + x , y));
% 
% -	if (ix < 0x7f800000 && iy >= 0x7f800000)
% -		return (cpackf(y - y, x * (y - y)));
% +	if (ix < 0x7f800000)		/* && iy >= 0x7f800000 */
% +		return (cpackf(y - y, y - y));
% 
% -	if (ix >= 0x7f800000 && (hx & 0x7fffff) == 0) {
% +	if (ix == 0x7f800000) {
%  		if (iy >= 0x7f800000)
% -			return (cpackf(x * x, x * (y - y)));
% +			return (cpackf(x, y - y));
%  		return (cpackf(x * cosf(y), INFINITY * sinf(y)));
% @@ -95,3 +92,3 @@
% 
% -	return (cpackf((x * x) * (y - y), (x + x) * (y - y)));
% +	return (cpackf((x + x) * (y - y), (x * x) * (y - y)));
%  }
% Index: s_csqrt.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_csqrt.c,v
% retrieving revision 1.4
% diff -u -1 -r1.4 s_csqrt.c
% --- s_csqrt.c	8 Aug 2008 00:15:16 -0000	1.4
% +++ s_csqrt.c	25 Oct 2011 14:51:10 -0000
% @@ -65,3 +65,3 @@
%  		t = (b - b) / (b - b);	/* raise invalid if b is not a NaN */
% -		return (cpack(a, t));	/* return NaN + NaN i */
% +		return (cpack(a + a, t));	/* return NaN + NaN i */
%  	}

csqrt*() didn't even quieten NaNs.  Many more of the committed complex
functions would have failed my tests without fixes like this.  The
inverse hyerbolic ones somehow have fewer problems with NaNs, perhaps
by using atan2() more.

% Index: s_csqrtf.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_csqrtf.c,v
% retrieving revision 1.3
% diff -u -1 -r1.3 s_csqrtf.c
% --- s_csqrtf.c	8 Aug 2008 00:15:16 -0000	1.3
% +++ s_csqrtf.c	25 Oct 2011 14:51:21 -0000
% @@ -56,3 +56,3 @@
%  		t = (b - b) / (b - b);	/* raise invalid if b is not a NaN */
% -		return (cpackf(a, t));	/* return NaN + NaN i */
% +		return (cpackf(a + a, t));	/* return NaN + NaN i */
%  	}
% Index: s_csqrtl.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_csqrtl.c,v
% retrieving revision 1.2
% diff -u -1 -r1.2 s_csqrtl.c
% --- s_csqrtl.c	8 Aug 2008 00:15:16 -0000	1.2
% +++ s_csqrtl.c	25 Oct 2011 14:51:29 -0000
% @@ -65,3 +65,3 @@
%  		t = (b - b) / (b - b);	/* raise invalid if b is not a NaN */
% -		return (cpackl(a, t));	/* return NaN + NaN i */
% +		return (cpackl(a + a, t));	/* return NaN + NaN i */
%  	}
% Index: s_ctanh.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_ctanh.c,v
% retrieving revision 1.2
% diff -u -1 -r1.2 s_ctanh.c
% --- s_ctanh.c	21 Oct 2011 06:30:16 -0000	1.2
% +++ s_ctanh.c	25 Oct 2011 14:30:18 -0000
% @@ -87,2 +87,13 @@
%  	/*
% +	 * XXX this is missing the dNaN/d(NaN) notation, which tells us the
% +	 * following:
% +	 *   dNaN is a default NaN unrelated to any NaN args
% +	 *   d(NaN) is a unary conversion (usually quieting) of the arg `NaN'
% +	 *
% +	 * XXX everything is missing:
% +	 *   d(NaN1, NaN2) and d(NaN, y)
% +	 * which should be used for binary conversions.
% +	 *
% +	 * XXX this misspells I as i.
% +	 *
%  	 * ctanh(NaN + i 0) = NaN + i 0
% @@ -104,3 +115,3 @@
%  		if ((ix & 0xfffff) | lx)	/* x is NaN */
% -			return (cpack(x, (y == 0 ? y : x * y)));
% +			return (cpack(x + x, y == 0 ? y : x * y));
%  		SET_HIGH_WORD(x, hx - 0x40000000);	/* x = copysign(1, x) */

ctanh*() needs more work than the others.  It also has the largest errors.
Some above 6 ulps.

% Index: s_ctanhf.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_ctanhf.c,v
% retrieving revision 1.2
% diff -u -1 -r1.2 s_ctanhf.c
% --- s_ctanhf.c	21 Oct 2011 06:30:16 -0000	1.2
% +++ s_ctanhf.c	25 Oct 2011 14:30:57 -0000
% @@ -53,3 +53,3 @@
%  		if (ix & 0x7fffff)
% -			return (cpackf(x, (y == 0 ? y : x * y)));
% +			return (cpackf(x + x, y == 0 ? y : x * y));
%  		SET_FLOAT_WORD(x, hx - 0x40000000);
% Index: s_fabsl.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_fabsl.c,v
% retrieving revision 1.4
% diff -u -1 -r1.4 s_fabsl.c
% --- s_fabsl.c	25 Apr 2012 18:07:35 -0000	1.4
% +++ s_fabsl.c	12 Jul 2012 13:02:28 -0000
% @@ -40,4 +40,4 @@
%  	u.e = x;
% -	u.bits.sign = 0;
% -	return (u.e);
% +	u.xbits.expsign &= 0x7fff;
% +	return (u.e + 0);
%

fabsl() probably shouldn't quieten NaNs, since the hardware normally
doesn't.  All fabs*() functions are usually inline or in asm, so the
fdlibm version is rarely used except for testing it.  I didn't hack
on fabsf() or fabs() similarly because NaN errors in them didn't
shoow up in testing.

Also de-pessimize the sign handling a little.  compilers tends to
produce pessimal code for direct bit-field accesses, with u.bits.sign
the worst case.  and u.xbits.expsign not too bad.

% Index: s_remquo.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_remquo.c,v
% retrieving revision 1.3
% diff -u -1 -r1.3 s_remquo.c
% --- s_remquo.c	7 Apr 2012 03:59:12 -0000	1.3
% +++ s_remquo.c	12 Jul 2012 13:06:25 -0000
% @@ -46,3 +46,3 @@
%  	  ((hy|((ly|-ly)>>31))>0x7ff00000))	/* or y is NaN */
% -	    return (x*y)/(x*y);
% +	    return ((x+0.0L)*(y+0))/((x+0.0L)*(y+0));
%  	if(hx<=hy) {
% Index: s_remquof.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_remquof.c,v
% retrieving revision 1.2
% diff -u -1 -r1.2 s_remquof.c
% --- s_remquof.c	7 Apr 2012 03:59:12 -0000	1.2
% +++ s_remquof.c	12 Jul 2012 13:06:32 -0000
% @@ -43,3 +43,3 @@
%  	if(hy==0||hx>=0x7f800000||hy>0x7f800000) /* y=0,NaN;or x not finite */
% -	    return (x*y)/(x*y);
% +	    return ((x+0.0L)*(y+0))/((x+0.0L)*(y+0));
%  	if(hx<hy) {
% Index: s_remquol.c
% ===================================================================
% RCS file: /home/ncvs/src/lib/msun/src/s_remquol.c,v
% retrieving revision 1.3
% diff -u -1 -r1.3 s_remquol.c
% --- s_remquol.c	7 Apr 2012 03:59:12 -0000	1.3
% +++ s_remquol.c	12 Jul 2012 13:06:40 -0000
% @@ -81,3 +81,2 @@
%  	uy.bits.sign = 0;	/* |y| */
% -	x = ux.e;
%

Don't clobber x yet, so that the original x can be used in early returns.

% @@ -88,3 +87,3 @@
%  	    ((uy.bits.manh&~LDBL_NBIT)|uy.bits.manl)!=0)) /* or y is NaN */
% -	    return (x*y)/(x*y);
% +	    return ((x+0.0L)*(y+0))/((x+0.0L)*(y+0));
%  	if(ux.bits.exp<=uy.bits.exp) {

The usual hack.

% @@ -128,3 +127,2 @@
%  	q = 0;
% -
%  	while(n--) {
% @@ -156,5 +154,4 @@
%  	}
% -	ux.bits.sign = 0;

Remove dead code.

% -	x = ux.e;
%  fixup:
% +	x = ux.e;		/* |x| */
%  	y = fabsl(y);

Clobber x later when the main path is joined.

It is bad style to re-use a variable.  It micro-optimizes for 30 year
old compilers.  It makes the code harder to writem read and debug.
Modern compilers will automatically reuse register and stack resources
for variables if this is possible and doesn't interfere too much with
debugging.  Not reusing variables seems to give slightly better object
code more often than slightly worse object code, by simplifying the
lifetime analysis needed for this optimization.  I only try re-using
variables near the start of a function (maybe x = creal(z); use(x);
x = fabs(x)), to try to stop the compiler making so many copies of x.
This somtimes works.  (The typical pessimization avoided by this is
when x passed on the stack.  gcc likes to copy it to another place
on the stack, using pessimal methods, and then never use the original
copy.  This is good for debugging, but otherwise not very good.).

% @@ -169,3 +166,2 @@
%  	}
% -
%  	ux.e = x;
% @@ -173,3 +169,2 @@
%  	x = ux.e;
% -
%  	q &= 0x7fffffff;

remquo() and remquof() didn't need the rearrangement of the code to
preserve x.  I blame the direct bit-field accesses in remquol().

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120814173931.V934>