Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 17 Sep 2012 06:29:20 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Stephen Montgomery-Smith <stephen@missouri.edu>
Cc:        freebsd-numerics@freebsd.org
Subject:   Re: Complex arg-trig functions
Message-ID:  <20120917060116.G3825@besplex.bde.org>
In-Reply-To: <50562213.9020400@missouri.edu>
References:  <5017111E.6060003@missouri.edu> <502A780B.2010106@missouri.edu> <20120815223631.N1751@besplex.bde.org> <502C0CF8.8040003@missouri.edu> <20120906221028.O1542@besplex.bde.org> <5048D00B.8010401@missouri.edu> <504D3CCD.2050006@missouri.edu> <504FF726.9060001@missouri.edu> <20120912191556.F1078@besplex.bde.org> <20120912225847.J1771@besplex.bde.org> <50511B40.3070009@missouri.edu> <20120913204808.T1964@besplex.bde.org> <5051F59C.6000603@missouri.edu> <20120914014208.I2862@besplex.bde.org> <50526050.2070303@missouri.edu> <20120914212403.H1983@besplex.bde.org> <50538E28.6050400@missouri.edu> <20120915231032.C2669@besplex.bde.org> <50548E15.3010405@missouri.edu> <5054C027.2040008@missouri.edu> <5054C200.7090307@missouri.edu> <20120916041132.D6344@besplex.bde.org> <50553424.2080902@missouri.edu> <20120916134730.Y957@besplex.bde.org> <5055ECA8.2080008@missouri.edu> <20120917022614.R2943@besplex.bde.org> <50562213.9020400@missouri.edu>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 16 Sep 2012, Stephen Montgomery-Smith wrote:

> On 09/16/2012 11:51 AM, Bruce Evans wrote:
>> 
>> I don't like that.  It will be much slower on almost 1/4 of arg space.
>> The only reason to consider not doing it is that the args that it
>> applies to are not very likely, and optimizing for them may pessimize
>> the usual case.
>
> The pessimization when |z| is not small is tiny.  It takes no time at all to 
> check that |z| is small.

Not necessarily on out-of-order machines (most x86).  The CPU executes
multiple paths speculatively and concurrently.  If it does more on an
unused path, then it might do less on the used path.  It may mispredict
the branch on the size of |z| and thus misguess which path to do more
on.  (I don't know many details of this.  For example, does it do
anything at all on paths predicted to be not taken?)  Losses from this
are usually described as branch mispredictions.  They might cost 20
(50? 100?) cycles after taking 2 about cycles to actually check |z|
(2 cycles pipelined but more like <length of pipe> + 8 in real time,
and it is the latter time that you lose by backing out).

The only sure way to avoid branch mispredictions is to not have any,
and catrig is too complicated for that.

> On the other hand let me go through the code and see what happens when |x| is 
> small or |y| is small.  There are actually specific formulas that work well 
> in these two cases, and they are probably not that much slower than the 
> formulas I decided to remove.  And when you chase through all the logic and 
> "if" statements, you may find that you didn't use up a whole bunch of time 
> for these very special cases of |z| small - most of the extra time merely 
> being the decisions invoked by the "if" statements.

But all general cases end up going through an extern function like
acos() or atan2(), and just calling another function is a significant
overhead.  When |z| is small, the arg(s) to the other function will
probably be an special case for it (e.g., acos(small)).  The other
function should optimize this and not take as long as an average call.
However, since it is special, it may cause branch mispredictions for
other uses of the function.

>> I just found a related optimization for atan2().  For x > 0 and
>> |y|/x < 2**-(MANT_DIG+afew), atan2(y, x) is evaluated as essentially
>> sign(y) * atan(|y|/x).  But in this case, its value is simply y/x
>> with inexact.  Again the optimization applies to almost 1/4 of arg
>> space.  It gains more than the normal overhead of an atan() call by
>> avoiding secondary underflows when y/x underflows.
>
> You see, that is exactly where I don't want to do special optimization in my 
> code.  In my opinion, it is the tan function itself that should realize that 
> |y|/x is small, and hence it is that function that simply return |y|/x.  Or 
> if you want to implement it at a higher level, atan2 should make this 
> realization, and simply return y/x.

I'm thinking of going the other way and using atan(y/x) instead of atan2()
:-).  This is safe iff we know that y/x is not very special.

> Similarly, I would expect log1p(x) to simply return x (inexactly) for x 
> small.  And if the compiler is really good, I would hope that the two codes:
> log1p(x);
> (fabs(x) < DBL_EPSILON) ? x + set_tiny() : log1p(x);
> would be equivalent.  (But I am rather sure that gcc isn't that good.)
>
> Furthermore, casinh etc are not commonly used functions.  Putting huge 
> amounts of effort looking at special cases to speed it up a little somehow 
> feels wrong to me.  In fact, if the programmer knows that he will be wanting 
> casinh, and evaluated very fast, then he should be motivated enough to try 
> out using z in the case when |z| is small, and see if that really speeds 
> things up.

True.  Now I mainly want it to be fast so that I can test more cases.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120917060116.G3825>