FreeBSD Mail Archives

Date:      Sun, 12 Aug 2012 23:13:18 -0000
From:      Bruce Evans <brde@optusnet.com.au>
To:        Stephen Montgomery-Smith <stephen@missouri.edu>
Cc:        Diane Bruce <db@db.net>, Steve Kargl <sgk@troutmask.apl.washington.edu>, John Baldwin <jhb@freebsd.org>, David Chisnall <theraven@freebsd.org>, Bruce Evans <bde@freebsd.org>, Bruce Evans <brde@optusnet.com.au>, David Schultz <das@freebsd.org>, Peter Jeremy <peter@rulingia.com>, Warner Losh <imp@bsdimp.com>
Subject:   Re: Use of C99 extra long double math functions after r236148
Message-ID:  <20120719164458.G1927@besplex.bde.org>
Resent-Message-ID: <20120812231311.GA20453@server.rulingia.com>
In-Reply-To: <5007826D.7060806@missouri.edu>
References:  <20120714120432.GA70706@server.rulingia.com> <20120717084457.U3890@besplex.bde.org> <5004A5C7.1040405@missouri.edu> <5004DEA9.1050001@missouri.edu> <20120717200931.U6624@besplex.bde.org> <5006D13D.2080702@missouri.edu> <20120718205625.GA409@troutmask.apl.washington.edu> <500725F2.7060603@missouri.edu> <20120719025345.GA1376@troutmask.apl.washington.edu> <50077987.1080307@missouri.edu> <20120719032706.GA1558@troutmask.apl.washington.edu> <5007826D.7060806@missouri.edu>

On Wed, 18 Jul 2012, Stephen Montgomery-Smith wrote:

> On 07/18/2012 10:27 PM, Steve Kargl wrote:
>> On Wed, Jul 18, 2012 at 10:05:43PM -0500, Stephen Montgomery-Smith wrote:
>>> On 07/18/2012 09:53 PM, Steve Kargl wrote:
>>>> 
>>>> The inexact flag will get raised by the fpu, but you need to
>>>> cause the condition.  For your 'sqrt(y*y-1) = y' example,
>>>> you would do something like 'sqrt(y*y-1) = abs(y) - tiny' where
>>>> tiny is much less than abs(y).   Search msun/src for inexact
>>>> (ie., grep -i inexact msun/src/*.c)

Don't worry much about the inexact flag.  The overflow and underflow
flags are much more important.

sqrt(y*y - 1) probably already sets the inexact flag.  An exception
might when y is 1.  Then the sqrt() is exact, but you might still
want to set the inexact flag.  Typical code where you want to set
the inexact flag but it doesn't happen automatically is for sin(x)
~= x when x is tiny.  Then the correctly rounded result is x, but
fdlibm is careful not to just return x.  It sets the inexact flag
using an efficient trick.

A more topical example:

>  if (x==0) {
>    if (fabs(y)<=1) return I*asin(y);
>    else return signum(y)* (
>      log(fabs(y)+sqrt(y*y-1))
>      + I*PI/2);

Here PI/2 is inexact, but if you just return that, then the inexact flag
might not be set.  When y != 1, the sqrt() srts the inexact flag.  Even
when y = 1, PI/2 may or may not set the inexact flag, depending on whether
the division is done at compile time or at runtime, and if it is done
at runtime then on whether PI's lowest bit is 1.  Optimization normally
results in PI/2 beinc done at compile time.

fdlibm already handles almost exactly this code in e_asin.c.  fdlibm
should be copied for this and many other things:

% 	GET_HIGH_WORD(hx,x);
% 	ix = hx&0x7fffffff;
% 	if(ix>= 0x3ff00000) {		/* |x|>= 1 */
% 	    u_int32_t lx;
% 	    GET_LOW_WORD(lx,x);
% 	    if(((ix-0x3ff00000)|lx)==0)
% 		    /* asin(1)=+-pi/2 with inexact */
% 		return x*pio2_hi+x*pio2_lo;

Hmm, it has several subtleties for efficiency:
- the comment is incomplete and should start with asin(+-1)
- x is +-1.  We multiply by this to get the sign right
- we add up our hi and low terms for pi/2 to set the inexact flag
- we multiply each of the hi and lo terms before adding them up to
   avoid optimizations that would result in not setting the inexact
   flag.  The following wouldn't work:
   (1) x * pio2.  x is precisely +-1, so this never sets the inexact flag
   (2) x * (pio2_hi + pio2_lo).  The addition might be and usually is
       done at compile time, so the result is usually the same as (1).
       We could avoid this by making one of the pio2 terms volatile.
       This should give better object code here but worse elsehwere.
       We can avoid it being worse elswhere by using a separate term
       for use here.  But the fdlibm implementors stopped optimizing
       before this point.  This is only 1 very special case and not
       worth optimizing more than it already is.

fdlibm uses at least 3 other methods for setting the inexact flag.  It
tries to choose the optimal method using variables with known values
like x = +-1 above.  Often x is known to be tiny or 0 and it can use
(1 + x) to set the inexact flag iff x is != 0.  Another, stranger,
method is to test if ((int)x == 0).  This sets the inexact flag if x
is tiny and |x| < 1.  This looks worse than the previous method, but
it often works better because with (1 + x) it can be hard to actually
use the result so that it doesn't get optimized away (the integer
conversion method depends on the compiler not be smart enough to
know that the result of the test is always `true'), so that it cannot
be optimized away.  Hardware improvements (better branch prediction)
resulted in the integer conversion method working even better than
when fdlibm preferred it in 1992.

% 	    return (x-x)/(x-x);		/* asin(|x|>1) is NaN */ 
% 	} else if (ix<0x3fe00000) {	/* |x|<0.5 */

>>> Couldn't you do this instead?
>>> 
>>> #include <fenv.h>
>>> 
>>> feraiseexcept(FE_INEXACT)
>> 
>> I haven't checked, but I suspect you're looking at a speed
>> issue.  It's faster to let the hardware raise the flag.
>> It seems that libm only uses the above in the fuse-multiple-add
>> code:

Expect feraiseexcept() to be 100-200 times as slow as fdlibm methods.

>> laptop:kargl[206] grep feraise src/*c
>> src/s_fma.c:            feraiseexcept(FE_INEXACT);
>> src/s_fma.c:                    feraiseexcept(FE_UNDERFLOW);
>> src/s_fmal.c:           feraiseexcept(FE_INEXACT);
>> src/s_fmal.c:                   feraiseexcept(FE_UNDERFLOW);
>> src/s_lround.c:         feraiseexcept(FE_INVALID);

FreeBSD (now non-fdlibm parts) generally uses fenv stuff iff the standard
explictly requires particular rounding or exceptions.  This normally makes
the functions that use it unusable.  The fma() function is unusable even
when it is in hardware.  I get silly results like the following trying
to optimize things using fma() on ia64:

- fma instruction in hardware takes about 1 cycle.  The compiler generates
   it if you write x*y + z
- portable code that wants a correctly rounded result can't use x*y + z.
   My multi-precision code for one case takes 10-20 cycles (it depends on
   x, y and z being in a restrict range or constant)
- fma(x, y, z) is portable, but takes about 50 cycles (49 for function
   call overhead and 1 to do the work).  This is when fma is in hardware,
   and should be fixable by inlining fma.
- if fma() is emulated, then it probably takes hundreds or thousands of
   cycles.  Hundreds for each feraiseexcept(), though it probably doesn't
   always call that.  My use doesn't care about inexact, and the others
   can't happen.

> Still, I think I will use the feraiseexcept function in clog, because speed 
> isn't an issue when nans are involved.  And it does make the code less 
> obscure.

It's OK for development.  Not so OK for testing.  My tests cover NaNs too,
and try not to have special knowledge of exceptional cases, so if the
NaN case takes many of times longer than the usual case it will slow down
the tests significantly.

Bruce

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120719164458.G1927>

Header And Logo

Peripheral Links

Site Navigation