Date: Thu, 6 Sep 2012 21:42:26 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Stephen Montgomery-Smith <stephen@missouri.edu> Cc: freebsd-numerics@freebsd.org Subject: Re: Complex arg-trig functions Message-ID: <20120906200420.H1056@besplex.bde.org> In-Reply-To: <503ABAC4.3060503@missouri.edu> References: <5017111E.6060003@missouri.edu> <501C361D.4010807@missouri.edu> <20120804165555.X1231@besplex.bde.org> <501D51D7.1020101@missouri.edu> <20120805030609.R3101@besplex.bde.org> <501D9C36.2040207@missouri.edu> <20120805175106.X3574@besplex.bde.org> <501EC015.3000808@missouri.edu> <20120805191954.GA50379@troutmask.apl.washington.edu> <20120807205725.GA10572@server.rulingia.com> <20120809025220.N4114@besplex.bde.org> <5027F07E.9060409@missouri.edu> <20120814003614.H3692@besplex.bde.org> <50295F5C.6010800@missouri.edu> <20120814072946.S5260@besplex.bde.org> <50297CA5.5010900@missouri.edu> <50297E43.7090309@missouri.edu> <20120814201105.T934@besplex.bde.org> <502A780B.2010106@missouri.edu> <20120815223631.N1751@besplex.bde.org> <502C0CF8.8040003@missouri.edu> <503265E8.3060101@missouri.edu> <5036EFDB.3010304@missouri.edu> <503A9A1A.1050004@missouri.edu> <503ABAC4.3060503@missouri.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
I'm back from a holiday. On Sun, 26 Aug 2012, Stephen Montgomery-Smith wrote: > On 08/26/2012 04:50 PM, Stephen Montgomery-Smith wrote: >> On 08/23/2012 10:07 PM, Stephen Montgomery-Smith wrote: >>> I just found out that the boost libraries implement the complex asin >>> function. I think their implementation is more faithful to the paper by >>> Hull et al than my implementation is. It does seem to have a BSD style >>> license. The only problem with it is that it is written in C++. In extremely ugly C++. >> http://www.boost.org/doc/libs/1_51_0/boost/math/complex/asin.hpp > >> Their asin seems to be about 10-15% faster than mine. Their error is >> slightly higher (4.5 ULP instead of 4ULP). > > But the speed improvement are because they use their own log1p function. > When I replace it with the standard log1p, it goes 5% slower than mine. It seems unlikely that C++ is faster :-). In float and long double precision, it might benefit from using builtins for isnan(), etc., but your double version uses private versions of these (except when it calls csqrt() etc., and the callee uses the slow non-builtins). I tested your latest version and found regressions for NaNs. (Also, the float version is less accurate now that it doesn't call double functions, but that is a progression.) 1. casin*() abuses x and y for |x| and |y| and thus loses the sign of NaNs. The quick fix is to reload the original values. The correct fix is to use separate variables for the absolute values, as is done for cacosh*() and catanh*(). 2. NaNs are now filtered out early, but in casin*() and cacos*() this is not before handling large values for the other arg. Perhaps the resulting NaN is correct except for (1). I didn't try applying the fix for (1) in the large-arg case, and just moved moved the NaN checks earlier. Later I noticed that this has other advantages. 3. You were missing my hack to give uniform behaviour in catanh(). The other functions seem to need it more than before. I applied it to all the functions. % diff -u2 catrig.c~ catrig.c % --- catrig.c~ 2012-08-28 02:22:51.000000000 +0000 % +++ catrig.c 2012-09-06 09:09:36.086949000 +0000 % @@ -30,9 +30,13 @@ % #include "math_private.h" % % +/* XXX should use normal fdlibm access macros GET_HIGH_WORD(), etc. */ % #define ISNAN(x) ((x)!=(x)) % #define SIGNBIT(bx) (((bx).parts.msw & 0x80000000) != 0) This is not quite optimal, but better than the non-builtin isnan() etc. that are used in float and long double precision. The above isnan() check is fairly efficient and correct, unlike the extern versions (which are broken and especially slow for long double precision), but normal fdlibm code doesn't do it this way (it checks everything as bits). The above SIGNBIT() check would be written similarly using bits, but with an access macro to read the bits. gcc tends to generate slow code for the direct bitfield access, especially for the sign bit, especially in long doubles, and this is not easy to change without a macro to hide the details. % +/* % + * XXX now that we filter out NaNs as the first step, ISINF() can be % + * simplified to checking that the (biased) exponent is 0x7ff. % + */ % #define ISINF(bx) (((((bx).parts.msw & 0x7fffffff) - 0x7ff00000) | \ % (bx).parts.lsw) == 0) When you do do the bit checks explicitly, it is more obvious that the mantissa bits have already been checked by ISNAN(), so they don't need to be checked again. Optimized fdlibm code would also combine the exponent checking for ISNAN() and ISINF(). % -#define ISFINITE(bx) (((bx).parts.msw & 0x7ff00000) != 0x7ff00000) Previous changes made this unused. ISINF(bx) could be spelled as !ISFINITE(bx) to implement the above optimization, but this is unclear since it only works in contexts where !ISNAN(bx). % % #define MAX_4TH_ROOT 1e75 /* approx pow(DBL_MAX,0.25) */ % @@ -272,13 +276,4 @@ % ay = fabs(y); % % - if (ax > RECIP_SQRT_EPSILON_100 || ay > RECIP_SQRT_EPSILON_100) % - if (ISINF(bx) || ISINF(by) || (int)(1+tiny)==1) { /* raise inexact */ % - if (SIGNBIT(bx) == 0) % - w = clog_for_large_values(z) + M_LN2; % - else % - w = clog_for_large_values(-z) + M_LN2; % - return (cpack(copysign(creal(w), x), copysign(cimag(w), y))); % - } % - % if (ISNAN(x) || ISNAN(y)) { % /* casinh(NaN + I*0) = NaN + I*0 */ Nothing changed here. The diff shows the wrong block of code being moved. % @@ -289,7 +284,16 @@ % * the arguments is not NaN, so we opt not to raise it. % */ % - return (cpack(x+y, x+y)); % + return (cpack(x+0.0L+(y+0), x+0.0L+(y+0))); % } I moved this to the start and added uniform NaN mixing, for all functions in all precisions. The NaN mixing is now localized to a single line for each function. It would be easy to hide its details in a macro, without any efficiency problems if you change the macro back to cpack(x+y, x+y). % % + if (ax > RECIP_SQRT_EPSILON_100 || ay > RECIP_SQRT_EPSILON_100) % + if (ISINF(bx) || ISINF(by) || (int)(1+tiny)==1) { /* raise inexact */ % + if (SIGNBIT(bx) == 0) % + w = clog_for_large_values(z) + M_LN2; % + else % + w = clog_for_large_values(-z) + M_LN2; % + return (cpack(copysign(creal(w), x), copysign(cimag(w), y))); % + } % + Not really changed. Reducing the ISINF() tests to just checking the exponent depends on all cases with NaNs being filtered out earlier. Similarly for the other functions, in all precisions. % if (ax < SQRT_EPSILON_100 && ay < SQRT_EPSILON_100) % if ((int)ax==0 && (int)ay==0) /* raise inexact */ % @@ -345,15 +349,11 @@ % y = fabs(y); % % - if (x > RECIP_SQRT_EPSILON_100 || y > RECIP_SQRT_EPSILON_100) % - if (ISINF(bx) || ISINF(by) || (int)(1+tiny)==1) { /* raise inexact */ % - w = clog_for_large_values(z); % - rx = fabs(cimag(w)); % - ry = creal(w) + M_LN2; % - if (sy == 0) % - ry = -ry; % - return (cpack(rx, ry)); % - } % - Not changed. Diff messed up the move again. % if (ISNAN(x) || ISNAN(y)) { % + /* % + * XXX unclobber x and y. We should not have abused x and % + * y to hold |x| and |y|. % + */ % + x = creal(z); % + y = cimag(z); Quick fix. If you actually want to toggle the sign[s] of the returned x and y to match the symmetry properties of this and other functions even for NaNs, then more work is needed. copysign() would have to be used, probably in a different way for the real and imaginary parts. C99 doesn't require this much care even for +-0 in combination with a NaN (the sign of the zero is unspecified, presumably because it could reasonably depend on the sign of a NaN arg and nothing specifies what the sign of a NaN does). % /* cacos(0 + I*NaN) = PI/2 + I*NaN */ % if (x == 0) return (cpack(M_PI_2, y+y)); % @@ -363,11 +363,26 @@ % * the arguments is not NaN, so we opt not to raise it. % */ % - return (cpack(x+y, x+y)); % + return (cpack(x+0.0L+(y+0), x+0.0L+(y+0))); Usual uniformization. % } % % + if (x > RECIP_SQRT_EPSILON_100 || y > RECIP_SQRT_EPSILON_100) % + if (ISINF(bx) || ISINF(by) || (int)(1+tiny)==1) { /* raise inexact */ % + w = clog_for_large_values(z); % + rx = fabs(cimag(w)); % + ry = creal(w) + M_LN2; % + if (sy == 0) % + ry = -ry; % + return (cpack(rx, ry)); % + } % + Usual diff mess. % if (x < SQRT_EPSILON_100 && y < SQRT_EPSILON_100) % if ((int)x==0 && (int)y==0) { /* raise inexact */ % + /* XXX we should use unclobbered x and y here too. */ % if (sy == 0) % return (cpack(M_PI_2 - creal(z), copysign(cimag(z), -1))); % + /* % + * XXX and here, except we should use the clobbered % + * y and not spell it fabs(cimag(z)). % + */ % return (cpack(M_PI_2 - creal(z), fabs(cimag(z)))); creal(z) must be reloaded, but fabs(cimag(z)) is already in y, but this y should be spelled ay. % } % @@ -547,5 +562,5 @@ % * the arguments is not NaN, so we opt not to raise it. % */ % - return (cpack(x+y, x+y)); % + return (cpack(x+0.0L+(y+0), x+0.0L+(y+0))); catanh() didn't need recovering x and y or moving the check for NaNs earlier to be correct. Checking for NaNs earlier might still be an optimization. % } % % diff -c2 catrigf.c~ catrigf.c % --- catrigf.c~ 2012-08-22 02:29:06.000000000 +0000 % +++ catrigf.c 2012-09-06 08:57:14.029522000 +0000 % @@ -167,4 +167,9 @@ % ay = fabsf(y); % % + if (isnan(x) || isnan(y)) { % + if (y == 0) return (cpackf(x+x, y)); % + return (cpackf(x+0.0L+(y+0), x+0.0L+(y+0))); % + } % + % if (ax > RECIP_SQRT_EPSILON_100 || ay > RECIP_SQRT_EPSILON_100) % if (isinf(x) || isinf(y) || (int)(1+tiny)==1) { /* raise inexact */ % @@ -176,9 +181,4 @@ % } % % - if (isnan(x) || isnan(y)) { % - if (y == 0) return (cpackf(x+x, y)); % - return (cpackf(x+y, x+y)); % - } % - % if (ax < SQRT_EPSILON_100 && ay < SQRT_EPSILON_100) % if ((int)ax==0 && (int)ay==0) /* raise inexact */ % @@ -215,4 +215,11 @@ % y = fabsf(y); % % + if (isnan(x) || isnan(y)) { % + x = crealf(z); % + y = cimagf(z); % + if (x == 0) return (cpackf(M_PI_2, y+y)); % + return (cpackf(x+0.0L+(y+0), x+0.0L+(y+0))); % + } % + % if (x > RECIP_SQRT_EPSILON_100 || y > RECIP_SQRT_EPSILON_100) % if (isinf(x) || isinf(y) || (int)(1+tiny)==1) { /* raise inexact */ % @@ -225,9 +232,4 @@ % } % % - if (isnan(x) || isnan(y)) { % - if (x == 0) return (cpackf(M_PI_2, y+y)); % - return (cpackf(x+y, x+y)); % - } % - % if (x < SQRT_EPSILON_100 && y < SQRT_EPSILON_100) % if ((int)x==0 && (int)y==0) { /* raise inexact */ % @@ -336,5 +338,5 @@ % if (x == 0) % return (cpackf(x, y+y)); % - return (cpackf(x+y, x+y)); % + return (cpackf(x+0.0L+(y+0), x+0.0L+(y+0))); % } % % diff -c2 catrigl.c~ catrigl.c % --- catrigl.c~ 2012-08-22 02:29:48.000000000 +0000 % +++ catrigl.c 2012-09-05 16:50:10.684867000 +0000 % @@ -139,4 +139,9 @@ % ay = fabsl(y); % % + if (isnan(x) || isnan(y)) { % + if (y == 0) return (cpackl(x+x, y)); % + return (cpackl(x+0.0L+(y+0), x+0.0L+(y+0))); % + } % + % if (ax > RECIP_SQRT_EPSILON_100 || ay > RECIP_SQRT_EPSILON_100) % if (isinf(x) || isinf(y) || (int)(1+tiny)==1) { /* raise inexact */ % @@ -148,9 +153,4 @@ % } % % - if (isnan(x) || isnan(y)) { % - if (y == 0) return (cpackl(x+x, y)); % - return (cpackl(x+y, x+y)); % - } % - % if (ax < SQRT_EPSILON_100 && ay < SQRT_EPSILON_100) % if ((int)ax==0 && (int)ay==0) /* raise inexact */ % @@ -187,4 +187,11 @@ % y = fabsl(y); % % + if (isnan(x) || isnan(y)) { % + x = creall(z); % + y = cimagl(z); % + if (x == 0) return (cpackl(L_PI_2, y+y)); % + return (cpackl(x+0.0L+(y+0), x+0.0L+(y+0))); % + } % + % if (x > RECIP_SQRT_EPSILON_100 || y > RECIP_SQRT_EPSILON_100) % if (isinf(x) || isinf(y) || (int)(1+tiny)==1) { /* raise inexact */ % @@ -197,9 +204,4 @@ % } % % - if (isnan(x) || isnan(y)) { % - if (x == 0) return (cpackl(L_PI_2, y+y)); % - return (cpackl(x+y, x+y)); % - } % - % if (x < SQRT_EPSILON_100 && y < SQRT_EPSILON_100) % if ((int)x==0 && (int)y==0) { /* raise inexact */ % @@ -308,5 +310,5 @@ % if (x == 0) % return (cpackl(x, y+y)); % - return (cpackl(x+y, x+y)); % + return (cpackl(x+0.0L+(y+0), x+0.0L+(y+0))); % } % Same changes in all precision. Diff only messed up the move in double precision. More details on the pessimizations for isnan() etc.: - gcc-4.2 (but not gcc-3.3) has builtins for isnan() and signbit(), but apparently not for isinf() or isfinite() (it has builtin even for the last 2, but these just make libcalls - for isnan(x) on at least i386, gcc-4.2 generates the same as for (x) != (x). Efficiency of this is mediocre, except possibly for long doubles (you can often beat it using bit tests, but for long doubles there are lots of bits and special cases. The extern version is broken for long doubles since it doesn't understand the special cases) - for signbit(x) on at least i386, gcc-4.2 generates very good code in float and double precisions, but not so good code in lond double precision. The extern versions are pessimized using a bit-field, in an unusual way: signbit() only returns nonzero if the sign bit is set, but bit-field semantics require a value of +-1, and several operations are needed to convert to this. The builtins are missing these extra operations. But gcc-4.2 pessimizes long doubles even more for the builtin than for bit-fields: @ f: @ pushl %ebp @ movl %esp, %ebp @ movl 16(%ebp), %eax This loads the high 16-bit word of the long double, and also 16-bits of garbage. The garbage is usually not even written to, and may cause a cache miss or write buffer penalty. Versions with bit-fields tend to cause related penalties but not this one. @ andl $32768, %eax @ leave @ ret - only the builtin for isnan() is ever used under FreeBSD, because macros in math.h translate to other names which gcc doesn't understand, except isnan() alone is translated to itself so gcc can still see its name. E.g., from math.h: @ #define isnan(x) \ @ ((sizeof (x) == sizeof (float)) ? __isnanf(x) \ @ : (sizeof (x) == sizeof (double)) ? isnan(x) \ @ : __isnanl(x)) The double isnan() has a null translation for historical reasons, and this accident allows the builtin to work. @ ... @ #ifdef __MATH_BUILTIN_RELOPS @ #define isgreater(x, y) __builtin_isgreater((x), (y)) @ #define isgreaterequal(x, y) __builtin_isgreaterequal((x), (y)) @ #define isless(x, y) __builtin_isless((x), (y)) @ #define islessequal(x, y) __builtin_islessequal((x), (y)) @ #define islessgreater(x, y) __builtin_islessgreater((x), (y)) @ #define isunordered(x, y) __builtin_isunordered((x), (y)) @ #else @ #define isgreater(x, y) (!isunordered((x), (y)) && (x) > (y)) @ #define isgreaterequal(x, y) (!isunordered((x), (y)) && (x) >= (y)) @ #define isless(x, y) (!isunordered((x), (y)) && (x) < (y)) @ #define islessequal(x, y) (!isunordered((x), (y)) && (x) <= (y)) @ #define islessgreater(x, y) (!isunordered((x), (y)) && \ @ ((x) > (y) || (y) > (x))) @ #define isunordered(x, y) (isnan(x) || isnan(y)) @ #endif /* __MATH_BUILTIN_RELOPS */ Builtins are used for these macros, but not for any other similar macros. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120906200420.H1056>