From owner-freebsd-numerics@FreeBSD.ORG  Mon Sep 17 17:15:57 2012
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 2D5DE1065670
	for <freebsd-numerics@freebsd.org>;
	Mon, 17 Sep 2012 17:15:57 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au
	[211.29.132.183])
	by mx1.freebsd.org (Postfix) with ESMTP id A52638FC17
	for <freebsd-numerics@freebsd.org>;
	Mon, 17 Sep 2012 17:15:56 +0000 (UTC)
Received: from c122-106-157-84.carlnfd1.nsw.optusnet.com.au
	(c122-106-157-84.carlnfd1.nsw.optusnet.com.au [122.106.157.84])
	by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q8HH70PM016952
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 18 Sep 2012 03:15:53 +1000
Date: Tue, 18 Sep 2012 03:07:00 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Stephen Montgomery-Smith <stephen@missouri.edu>
In-Reply-To: <50563C57.60806@missouri.edu>
Message-ID: <20120918012459.V5094@besplex.bde.org>
References: <5017111E.6060003@missouri.edu> <502C0CF8.8040003@missouri.edu>
	<20120906221028.O1542@besplex.bde.org> <5048D00B.8010401@missouri.edu>
	<504D3CCD.2050006@missouri.edu> <504FF726.9060001@missouri.edu>
	<20120912191556.F1078@besplex.bde.org>
	<20120912225847.J1771@besplex.bde.org>
	<50511B40.3070009@missouri.edu> <20120913204808.T1964@besplex.bde.org>
	<5051F59C.6000603@missouri.edu> <20120914014208.I2862@besplex.bde.org>
	<50526050.2070303@missouri.edu> <20120914212403.H1983@besplex.bde.org>
	<50538E28.6050400@missouri.edu> <20120915231032.C2669@besplex.bde.org>
	<50548E15.3010405@missouri.edu> <5054C027.2040008@missouri.edu>
	<5054C200.7090307@missouri.edu> <20120916041132.D6344@besplex.bde.org>
	<50553424.2080902@missouri.edu> <20120916134730.Y957@besplex.bde.org>
	<5055ECA8.2080008@missouri.edu> <20120917022614.R2943@besplex.bde.org>
	<50562213.9020400@missouri.edu> <20120917060116.G3825@besplex.bde.org>
	<50563C57.60806@missouri.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-numerics@freebsd.org
Subject: Re: Complex arg-trig functions
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
	<freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
	<mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
	<mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 17 Sep 2012 17:15:57 -0000

On Sun, 16 Sep 2012, Stephen Montgomery-Smith wrote:

> On 09/16/2012 03:29 PM, Bruce Evans wrote:
>> On Sun, 16 Sep 2012, Stephen Montgomery-Smith wrote:
>> ...
>> The only sure way to avoid branch mispredictions is to not have any,
>> and catrig is too complicated for that.
>
> Yes, but I did a time test.  And in my case the test was almost always 
> failing.

I test different data, with an over-emphasis on exceptional cases :-).

>>> On the other hand let me go through the code and see what happens when
>>> |x| is small or |y| is small.  There are actually specific formulas
>>> that work well in these two cases, and they are probably not that much
>>> slower than the formulas I decided to remove.  And when you chase

I checked a few cases and didn't see any problems, but noticed some
more things that could be handled by general code, giving the following
minor optimizations (only done for float precision).

>>> through all the logic and "if" statements, you may find that you
>>> didn't use up a whole bunch of time for these very special cases of
>>> |z| small - most of the extra time merely being the decisions invoked
>>> by the "if" statements.

Branch prediction is working very well, but I would prefer not to stress
it unnecessarily.  The data in my tests is also too uniformly ordered to
stress the branch prediction.

@ --- catrigf.c~	2012-09-17 02:05:43.000000000 +0000
@ +++ catrigf.c	2012-09-17 15:21:59.560420000 +0000
@ @@ -157,12 +157,19 @@
@  	}
@ 
@ -	if (ax > RECIP_EPSILON || ay > RECIP_EPSILON)
@ -		if (isinf(x) || isinf(y) || (int)(1+tiny)==1) {
@ -			if (signbit(x) == 0)
@ -				w = clog_for_large_values(z) + m_ln2;
@ -			else
@ -				w = clog_for_large_values(-z) + m_ln2;
@ -			return (cpackf(copysignf(crealf(w), x), copysignf(cimagf(w), y)));
@ -		}
@ +	if (ax > RECIP_EPSILON || ay > RECIP_EPSILON) {
@ +		/* clog...() will raise inexact unless x or y is infinite */
@ +		if (signbit(x) == 0)
@ +			w = clog_for_large_values(z) + m_ln2;
@ +		else
@ +			w = clog_for_large_values(-z) + m_ln2;
@ +		return (cpackf(copysignf(crealf(w), x), copysignf(cimagf(w), y)));
@ +	}

Trust the general code (clog()) to raise inexact appropriately.

A previous version of this raised inexact by adding `tiny' to w in the
correct order.  realf(w) is large or infinite, so the expression
(realf(w) + tiny + m_ln2) has the same value as (realf(w) + m_ln2) and
raises inexact iff realf(w) != +Inf.  But this addition is unnecessary.

@ +
@ +#if 0
@ +	/* XXX the numbers are related to sqrt(6 * FLT_EPSILON). */
@ +	if (ax < 2048 * FLT_EPSILON && ay < 2048 * FLT_EPSILON)
@ +		if ((int)ax==0 && (int)ay==0)
@ +			return (z);
@ +#endif

Previous optimization turned off for debugging.

@ 
@  	do_hard_work(ax, ay, &rx, &B_is_usable, &B, &sqrt_A2my2, &new_y);
@ @@ -205,13 +212,19 @@
@  	}
@ 
@ -	if (ax > RECIP_EPSILON || ay > RECIP_EPSILON)
@ -		if (isinf(x) || isinf(y) || (int)(1+tiny)==1) {
@ -			w = clog_for_large_values(z);
@ -			rx = fabsf(cimagf(w));
@ -			ry = crealf(w) + m_ln2;
@ -			if (sy == 0)
@ -				ry = -ry;
@ -			return (cpackf(rx, ry));
@ -		}
@ +	if (ax > RECIP_EPSILON || ay > RECIP_EPSILON) {
@ +		/* clog...() will raise inexact unless x or y is infinite */
@ +		w = clog_for_large_values(z);
@ +		rx = fabsf(cimagf(w));
@ +		ry = crealf(w) + m_ln2;
@ +		if (sy == 0)
@ +			ry = -ry;
@ +		return (cpackf(rx, ry));
@ +	}

As above.

@ +
@ +#if 0
@ +	/* XXX the number for ay is related to sqrt(6 * FLT_EPSILON). */
@ +	if (ax < FLT_EPSILON / 8 && ay < 2048 * FLT_EPSILON)
@ +		return (cpackf(m_pi_2 + tiny, -y));
@ +#endif

Not quite the previous optimization turned off for debugging.  It now
raises inexact undconditionally by adding tiny to m_pi_2.  This seems
to actually be a minor pessimization, but I prefer it since it takes
less code.  The version using (int)(1+tiny) has the advantage that its
result is not normally used, while the above is often used; the above
does an extra operation in the often-used path.

@ 
@  	do_hard_work(ay, ax, &ry, &B_is_usable, &B, &sqrt_A2mx2, &new_x);
@ @@ -321,30 +334,28 @@
@  	}
@ 
@ -	if (isinf(x) || isinf(y))
@ -		return (cpackf(copysignf(0, x), copysignf(m_pi_2 + tiny, y)));
@ +	/* Raise inexact unless z == 0; return for z == 0 as a side effect. */
@ +	if ((x == 0 && y == 0) || (int)(1 + tiny) != 1)
@ +		return (z);

Larger optimizations only done for catanhf():

First, the above removes the special code for handling infinities.  These
will be handled by the "large" case later.

Second, it raises inexact for the one remaining case (z == 0) where the
result is exact (all the other exact cases involve NaNs.  Note that cases
involving Infs return m_pi_2 for the imaginary part, so they are never
exact).

This patch doesn't show the removal of the code for raising inexact in
sum_squares().

cacos*() and casin*() should benefit even more from an up-front raising
of inexact, since do_hard_work() has 7 magic statements to raise inexact
where sum_squares has only 1.

@ 
@  	if (ax > RECIP_EPSILON || ay > RECIP_EPSILON)
@ -		if ((int)(1+tiny)==1)
@ -			return (cpackf(copysignf(real_part_reciprocal(ax, ay), x), copysignf(m_pi_2, y)));
@ +		return (cpackf(copysignf(real_part_reciprocal(ax, ay), x), copysignf(m_pi_2, y)));

Depend on inexact being raised up-front.

There are no magic expressions (int)(1+tiny) left except the new up-front
one.  There are still not-so- magic expressions (m_pi_2 + tiny).  BTW,
most or all of the recent fixes to use the latter expressions don't
have a comment about raising inexact in catrig.c, while most or all
older expressions for setting inexact do have such a comment.

A previous version of this optimization raised inexact by adding tiny
to m_pi_2.

@ +
@ +	/* XXX the numbers are related to sqrt(6 * FLT_EPSILON). */
@ +	if (ax < 2048 * FLT_EPSILON && ay < 2048 * FLT_EPSILON)
@ +		return (z);

Previous optimization not turned off for debugging.  It is simpler now
that it can depend on inexact being raised up-front.

@ 
@  	if (ax == 1 && ay < FLT_EPSILON) {
@ -		if ((int)ay==0) {
@ -			if ( ilogbf(ay) > FLT_MIN_EXP)
@ -				rx = - logf(ay/2) / 2;
@ -			else
@ -				rx = - (logf(ay) - m_ln2) / 2;
@ -		}
@ +		if (ilogbf(ay) > FLT_MIN_EXP)
@ +			rx = - logf(ay/2) / 2;
@ +		else
@ +			rx = - (logf(ay) - m_ln2) / 2;

Depend on inexact being raised up-front.

A previous version of this optimization depended instead on logf() raising
inexact appropriately (since the arg is never 1, the result is always
inexact).

@  	} else
@  		rx = log1pf(4*ax / sum_squares(ax-1, ay)) / 4;
@ 
@ -	if (ax == 1) {
@ -		if (ay==0)
@ -			ry = 0;
@ -		else
@ -			ry = atan2f(2, -ay) / 2;
@ -	} else if (ay < FOUR_SQRT_MIN) {
@ -		if ((int)ay==0)
@ -			ry = atan2f(2*ay, (1-ax)*(1+ax)) / 2;
@ -	} else
@ +	if (ax == 1)
@ +		ry = atan2f(2, ay) / 2;
@ +	else if (ay < FOUR_SQRT_MIN)
@ +		ry = atan2f(2*ay, (1-ax)*(1+ax)) / 2;
@ +	else
@  		ry = atan2f(2*ay, (1-ax)*(1+ax) - ay*ay) / 2;
@

Remove the special case for (ax == 1, ay == 0).  The general case gives
the same result.  The correctness of this probably depends on the sign
considerations for the next change, and pointed to that change (here
it seems to use +0 when y == 0 but -ay otherwise.

Remove negation of ay for ax == 1.  The sign will be copied into the result
later for all cases, so it doesn't matter in the arg.  I didn't check the
branch cut details for this, but runtime tests passed.

Since the sign doesn't matter, we could pass y instead of ay.

I don't understand the threshold of FOUR_SQRT_MIN.  ay*ay starts
underflowing at SQRT_MIN.  FOUR_SQRT_MIN seems to work, and has
efficiency advantages.  But large multiples of FOUR_SQRT_MIN also
seem to work, and have larger efficiency advantages

...  I now understand what the threshold should be.  You have
filtered out ax == 1.  This makes 1 - ax*ax at least ~2*EPSILON, so 
ay*ay can be dropped if ay is less than sqrt(2*EPSILON*EPSILON) *
2**-GUARD_DIGITS = EPSILON * 2**-5 say.  SQRT_MIN is way smaller
than that, so FOUR_SQRT_MIN works too.  We should use a larger
threshold for efficiency, or avoid the special case for ax == 1.
Testing shows that this analysis is off by a factor of about
sqrt(EPSILON), since a threshold of EPSILON * 2**7 is optimal.
The optimization made no difference to speed; it is just an
optimization for understanding.  Maybe the special case for ax == 1
can be avoided, or folded together with the same special case for
evaluation of the real part.  This special case is similar to the
one in clog(), but easier.

Further optimization: in sum_squares(), y is always ay >= 0, so there
is no need to apply fabs*() to it.  I think the compiler does this
optimization.  It can see that y == ay via the inline.

BTW, do_hard_work() is usually not inlined, so the compiler wouldn't
be able to do such optimizations.  However, declaring it as
__always_inline didn't improve the speed.

Bruce