Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 16 Sep 2012 05:00:41 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Stephen Montgomery-Smith <stephen@missouri.edu>
Cc:        freebsd-numerics@FreeBSD.org
Subject:   Re: Complex arg-trig functions
Message-ID:  <20120916041132.D6344@besplex.bde.org>
In-Reply-To: <5054C200.7090307@missouri.edu>
References:  <5017111E.6060003@missouri.edu> <20120814003614.H3692@besplex.bde.org> <50295F5C.6010800@missouri.edu> <20120814072946.S5260@besplex.bde.org> <50297CA5.5010900@missouri.edu> <50297E43.7090309@missouri.edu> <20120814201105.T934@besplex.bde.org> <502A780B.2010106@missouri.edu> <20120815223631.N1751@besplex.bde.org> <502C0CF8.8040003@missouri.edu> <20120906221028.O1542@besplex.bde.org> <5048D00B.8010401@missouri.edu> <504D3CCD.2050006@missouri.edu> <504FF726.9060001@missouri.edu> <20120912191556.F1078@besplex.bde.org> <20120912225847.J1771@besplex.bde.org> <50511B40.3070009@missouri.edu> <20120913204808.T1964@besplex.bde.org> <5051F59C.6000603@missouri.edu> <20120914014208.I2862@besplex.bde.org> <50526050.2070303@missouri.edu> <20120914212403.H1983@besplex.bde.org> <50538E28.6050400@missouri.edu> <20120915231032.C2669@besplex.bde.org> <50548E15.3010405@missouri.edu> <5054C027.2040008@missouri.edu> <5054C200.7090307@missouri.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 15 Sep 2012, Stephen Montgomery-Smith wrote:

> On 09/15/2012 12:51 PM, Stephen Montgomery-Smith wrote:
>> On 09/15/2012 09:17 AM, Stephen Montgomery-Smith wrote:
>>> On 09/15/2012 08:45 AM, Bruce Evans wrote:
>>> 
>>> I had a hard time understanding what exactly the style fixes here were.
>>>   Probably the mail clients (either mine or yours) clobber these.  They
>>> seem like they are to do with the spacing, but I am not sure.
>>> 
>>>> % +B_crossover =        0.6417,        /* suggested by Hull et al */
>
> I am going to guess that you wanted a tab instead of a space.

Right.  This one was a tab and another one was a trailing space.  indent(1)
would have fixed the latter, but it mangles the constant table declarations
completely.

> So, I think I have implemented all of the new changes.
>
> One more thing - my intent is that when they are committed, that catrig*.c 
> should be split into to two files.  One contains casin(h) and cacos(h).  The 
> other contains catan(h).  And then some of those headings surrounded by boxes 
> can be removed.
>
> But that is up to you guys.

Maybe even 3 files.  fdlibm uses one entry point per file to a fault, but
for log* I think I want all 4 standard entry points for all 3 precisions
in 1 file so that tables can be shared.

I think I figured out why cacosf() is faster on i386(core2) when it just
calls cacos().  There are lots of (non-spurious) underflows, for at least
atan2(y, x) where x > 0 and y/x underflows.  I test with logarithmically
uniformly distributed y and x, so the average difference of exponents is
large and this underflow happens in a significant fraction of all cases.
The result always underflows, but with catrigf.c, especially on i386, there
are several more steps with the underflowing y/x: from e_atan2f.c:

% 	...
% 	else if(k<-26&&hx<0) z=0.0; 	/* 0 > |y|/x > -2**-26 */

In this case, the result will be near +-pi, and we must avoid underflow,
which we do by setting z to 0.

% 	else z=atanf(fabsf(y/x));	/* safe to do y/x */

When k<-26 but x>0, the result will be near 0, and may underflow.
On i386, y/x is normally evaluated in extra precision so it doesn't
normally underflow.  Strictly, it should be converted to float
before passing it to fabsf() and underflow then, but fabsf() is
normally a builtin and compiler bugfeatures usually result in no
conversion.  But atanf() is always extern, so the ABI forces a
conversion, giving the exception.  Then if the underflow is gradual,
atanf() sees a denormal value and may give further slowness (but it
should just return this value, so there should be no slowness from it
for underflow, but there might be on some hardware for denormals.
When y/x underflows, atanf() on fabsf() of it is just |y|/x, so the
above could be optimized to avoid the extra conversions and operations.

% 	switch (m) {
% 	    case 0: return       z  ;	/* atan(+,+) */
% 	    case 1: return      -z  ;	/* atan(-,+) */

These 2 cases are when x > 0.  When y < 0, we negate the |y|/x produced
above, having done extra work to mess up the sign of y/x to merge the
cases.  This operation may cause further underflow or denormal
exceptions.  Even returning z unchanged may cause exceptions, depending
on optimizations.  z when it is returned by atanf() is normally in an i387
register on i386 and it will not be denormal there; when fully optimized,
this will be returned unchanged, but if z is stored to memory then there
will be denormal and underflow exceptions for conversions that are part
of the store.

% 	    case 2: return  pi-(z-pi_lo);/* atan(+,-) */
% 	    default: /* case 3 */
% 	    	    return  (z-pi_lo)-pi;/* atan(-,-) */
% 	}

On i386(athlon64), underflows don't cost speed and direct cacosf() is
faster, but amd64(athlon64) acts strangely instead: on amd64(core2),
direct catrigf.c is significantly faster, but on amd64(athlon64) there
is little difference.

On amd64, because it doesn't use extended precision, denormals inherently
cost less because they stay denormal and don't cause underflow exceptions
on conversion back to themselves (or nearby) after load/modify/store,
provied the modification doesn't cause an exception.  I think fabsf()
on a float denormal doesn't cause an exception in SSE or i387, but it
does on i386 iff the result is stored to memory.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120916041132.D6344>