Date: Sun, 12 Aug 2012 23:13:18 -0000 From: Bruce Evans <brde@optusnet.com.au> To: Stephen Montgomery-Smith <stephen@missouri.edu> Cc: Diane Bruce <db@db.net>, Steve Kargl <sgk@troutmask.apl.washington.edu>, John Baldwin <jhb@freebsd.org>, David Chisnall <theraven@freebsd.org>, Bruce Evans <bde@freebsd.org>, Bruce Evans <brde@optusnet.com.au>, David Schultz <das@freebsd.org>, Peter Jeremy <peter@rulingia.com>, Warner Losh <imp@bsdimp.com> Subject: Re: Use of C99 extra long double math functions after r236148 Message-ID: <20120719164458.G1927@besplex.bde.org> Resent-Message-ID: <20120812231311.GA20453@server.rulingia.com> In-Reply-To: <5007826D.7060806@missouri.edu> References: <20120714120432.GA70706@server.rulingia.com> <20120717084457.U3890@besplex.bde.org> <5004A5C7.1040405@missouri.edu> <5004DEA9.1050001@missouri.edu> <20120717200931.U6624@besplex.bde.org> <5006D13D.2080702@missouri.edu> <20120718205625.GA409@troutmask.apl.washington.edu> <500725F2.7060603@missouri.edu> <20120719025345.GA1376@troutmask.apl.washington.edu> <50077987.1080307@missouri.edu> <20120719032706.GA1558@troutmask.apl.washington.edu> <5007826D.7060806@missouri.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 18 Jul 2012, Stephen Montgomery-Smith wrote: > On 07/18/2012 10:27 PM, Steve Kargl wrote: >> On Wed, Jul 18, 2012 at 10:05:43PM -0500, Stephen Montgomery-Smith wrote: >>> On 07/18/2012 09:53 PM, Steve Kargl wrote: >>>> >>>> The inexact flag will get raised by the fpu, but you need to >>>> cause the condition. For your 'sqrt(y*y-1) = y' example, >>>> you would do something like 'sqrt(y*y-1) = abs(y) - tiny' where >>>> tiny is much less than abs(y). Search msun/src for inexact >>>> (ie., grep -i inexact msun/src/*.c) Don't worry much about the inexact flag. The overflow and underflow flags are much more important. sqrt(y*y - 1) probably already sets the inexact flag. An exception might when y is 1. Then the sqrt() is exact, but you might still want to set the inexact flag. Typical code where you want to set the inexact flag but it doesn't happen automatically is for sin(x) ~= x when x is tiny. Then the correctly rounded result is x, but fdlibm is careful not to just return x. It sets the inexact flag using an efficient trick. A more topical example: > if (x==0) { > if (fabs(y)<=1) return I*asin(y); > else return signum(y)* ( > log(fabs(y)+sqrt(y*y-1)) > + I*PI/2); Here PI/2 is inexact, but if you just return that, then the inexact flag might not be set. When y != 1, the sqrt() srts the inexact flag. Even when y = 1, PI/2 may or may not set the inexact flag, depending on whether the division is done at compile time or at runtime, and if it is done at runtime then on whether PI's lowest bit is 1. Optimization normally results in PI/2 beinc done at compile time. fdlibm already handles almost exactly this code in e_asin.c. fdlibm should be copied for this and many other things: % GET_HIGH_WORD(hx,x); % ix = hx&0x7fffffff; % if(ix>= 0x3ff00000) { /* |x|>= 1 */ % u_int32_t lx; % GET_LOW_WORD(lx,x); % if(((ix-0x3ff00000)|lx)==0) % /* asin(1)=+-pi/2 with inexact */ % return x*pio2_hi+x*pio2_lo; Hmm, it has several subtleties for efficiency: - the comment is incomplete and should start with asin(+-1) - x is +-1. We multiply by this to get the sign right - we add up our hi and low terms for pi/2 to set the inexact flag - we multiply each of the hi and lo terms before adding them up to avoid optimizations that would result in not setting the inexact flag. The following wouldn't work: (1) x * pio2. x is precisely +-1, so this never sets the inexact flag (2) x * (pio2_hi + pio2_lo). The addition might be and usually is done at compile time, so the result is usually the same as (1). We could avoid this by making one of the pio2 terms volatile. This should give better object code here but worse elsehwere. We can avoid it being worse elswhere by using a separate term for use here. But the fdlibm implementors stopped optimizing before this point. This is only 1 very special case and not worth optimizing more than it already is. fdlibm uses at least 3 other methods for setting the inexact flag. It tries to choose the optimal method using variables with known values like x = +-1 above. Often x is known to be tiny or 0 and it can use (1 + x) to set the inexact flag iff x is != 0. Another, stranger, method is to test if ((int)x == 0). This sets the inexact flag if x is tiny and |x| < 1. This looks worse than the previous method, but it often works better because with (1 + x) it can be hard to actually use the result so that it doesn't get optimized away (the integer conversion method depends on the compiler not be smart enough to know that the result of the test is always `true'), so that it cannot be optimized away. Hardware improvements (better branch prediction) resulted in the integer conversion method working even better than when fdlibm preferred it in 1992. % return (x-x)/(x-x); /* asin(|x|>1) is NaN */ % } else if (ix<0x3fe00000) { /* |x|<0.5 */ >>> Couldn't you do this instead? >>> >>> #include <fenv.h> >>> >>> feraiseexcept(FE_INEXACT) >> >> I haven't checked, but I suspect you're looking at a speed >> issue. It's faster to let the hardware raise the flag. >> It seems that libm only uses the above in the fuse-multiple-add >> code: Expect feraiseexcept() to be 100-200 times as slow as fdlibm methods. >> laptop:kargl[206] grep feraise src/*c >> src/s_fma.c: feraiseexcept(FE_INEXACT); >> src/s_fma.c: feraiseexcept(FE_UNDERFLOW); >> src/s_fmal.c: feraiseexcept(FE_INEXACT); >> src/s_fmal.c: feraiseexcept(FE_UNDERFLOW); >> src/s_lround.c: feraiseexcept(FE_INVALID); FreeBSD (now non-fdlibm parts) generally uses fenv stuff iff the standard explictly requires particular rounding or exceptions. This normally makes the functions that use it unusable. The fma() function is unusable even when it is in hardware. I get silly results like the following trying to optimize things using fma() on ia64: - fma instruction in hardware takes about 1 cycle. The compiler generates it if you write x*y + z - portable code that wants a correctly rounded result can't use x*y + z. My multi-precision code for one case takes 10-20 cycles (it depends on x, y and z being in a restrict range or constant) - fma(x, y, z) is portable, but takes about 50 cycles (49 for function call overhead and 1 to do the work). This is when fma is in hardware, and should be fixable by inlining fma. - if fma() is emulated, then it probably takes hundreds or thousands of cycles. Hundreds for each feraiseexcept(), though it probably doesn't always call that. My use doesn't care about inexact, and the others can't happen. > Still, I think I will use the feraiseexcept function in clog, because speed > isn't an issue when nans are involved. And it does make the code less > obscure. It's OK for development. Not so OK for testing. My tests cover NaNs too, and try not to have special knowledge of exceptional cases, so if the NaN case takes many of times longer than the usual case it will slow down the tests significantly. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120719164458.G1927>