From owner-freebsd-numerics@FreeBSD.ORG Mon Sep 17 17:15:57 2012 Return-Path: Delivered-To: freebsd-numerics@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2D5DE1065670 for ; Mon, 17 Sep 2012 17:15:57 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id A52638FC17 for ; Mon, 17 Sep 2012 17:15:56 +0000 (UTC) Received: from c122-106-157-84.carlnfd1.nsw.optusnet.com.au (c122-106-157-84.carlnfd1.nsw.optusnet.com.au [122.106.157.84]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q8HH70PM016952 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 18 Sep 2012 03:15:53 +1000 Date: Tue, 18 Sep 2012 03:07:00 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Stephen Montgomery-Smith In-Reply-To: <50563C57.60806@missouri.edu> Message-ID: <20120918012459.V5094@besplex.bde.org> References: <5017111E.6060003@missouri.edu> <502C0CF8.8040003@missouri.edu> <20120906221028.O1542@besplex.bde.org> <5048D00B.8010401@missouri.edu> <504D3CCD.2050006@missouri.edu> <504FF726.9060001@missouri.edu> <20120912191556.F1078@besplex.bde.org> <20120912225847.J1771@besplex.bde.org> <50511B40.3070009@missouri.edu> <20120913204808.T1964@besplex.bde.org> <5051F59C.6000603@missouri.edu> <20120914014208.I2862@besplex.bde.org> <50526050.2070303@missouri.edu> <20120914212403.H1983@besplex.bde.org> <50538E28.6050400@missouri.edu> <20120915231032.C2669@besplex.bde.org> <50548E15.3010405@missouri.edu> <5054C027.2040008@missouri.edu> <5054C200.7090307@missouri.edu> <20120916041132.D6344@besplex.bde.org> <50553424.2080902@missouri.edu> <20120916134730.Y957@besplex.bde.org> <5055ECA8.2080008@missouri.edu> <20120917022614.R2943@besplex.bde.org> <50562213.9020400@missouri.edu> <20120917060116.G3825@besplex.bde.org> <50563C57.60806@missouri.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-numerics@freebsd.org Subject: Re: Complex arg-trig functions X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 17 Sep 2012 17:15:57 -0000 On Sun, 16 Sep 2012, Stephen Montgomery-Smith wrote: > On 09/16/2012 03:29 PM, Bruce Evans wrote: >> On Sun, 16 Sep 2012, Stephen Montgomery-Smith wrote: >> ... >> The only sure way to avoid branch mispredictions is to not have any, >> and catrig is too complicated for that. > > Yes, but I did a time test. And in my case the test was almost always > failing. I test different data, with an over-emphasis on exceptional cases :-). >>> On the other hand let me go through the code and see what happens when >>> |x| is small or |y| is small. There are actually specific formulas >>> that work well in these two cases, and they are probably not that much >>> slower than the formulas I decided to remove. And when you chase I checked a few cases and didn't see any problems, but noticed some more things that could be handled by general code, giving the following minor optimizations (only done for float precision). >>> through all the logic and "if" statements, you may find that you >>> didn't use up a whole bunch of time for these very special cases of >>> |z| small - most of the extra time merely being the decisions invoked >>> by the "if" statements. Branch prediction is working very well, but I would prefer not to stress it unnecessarily. The data in my tests is also too uniformly ordered to stress the branch prediction. @ --- catrigf.c~ 2012-09-17 02:05:43.000000000 +0000 @ +++ catrigf.c 2012-09-17 15:21:59.560420000 +0000 @ @@ -157,12 +157,19 @@ @ } @ @ - if (ax > RECIP_EPSILON || ay > RECIP_EPSILON) @ - if (isinf(x) || isinf(y) || (int)(1+tiny)==1) { @ - if (signbit(x) == 0) @ - w = clog_for_large_values(z) + m_ln2; @ - else @ - w = clog_for_large_values(-z) + m_ln2; @ - return (cpackf(copysignf(crealf(w), x), copysignf(cimagf(w), y))); @ - } @ + if (ax > RECIP_EPSILON || ay > RECIP_EPSILON) { @ + /* clog...() will raise inexact unless x or y is infinite */ @ + if (signbit(x) == 0) @ + w = clog_for_large_values(z) + m_ln2; @ + else @ + w = clog_for_large_values(-z) + m_ln2; @ + return (cpackf(copysignf(crealf(w), x), copysignf(cimagf(w), y))); @ + } Trust the general code (clog()) to raise inexact appropriately. A previous version of this raised inexact by adding `tiny' to w in the correct order. realf(w) is large or infinite, so the expression (realf(w) + tiny + m_ln2) has the same value as (realf(w) + m_ln2) and raises inexact iff realf(w) != +Inf. But this addition is unnecessary. @ + @ +#if 0 @ + /* XXX the numbers are related to sqrt(6 * FLT_EPSILON). */ @ + if (ax < 2048 * FLT_EPSILON && ay < 2048 * FLT_EPSILON) @ + if ((int)ax==0 && (int)ay==0) @ + return (z); @ +#endif Previous optimization turned off for debugging. @ @ do_hard_work(ax, ay, &rx, &B_is_usable, &B, &sqrt_A2my2, &new_y); @ @@ -205,13 +212,19 @@ @ } @ @ - if (ax > RECIP_EPSILON || ay > RECIP_EPSILON) @ - if (isinf(x) || isinf(y) || (int)(1+tiny)==1) { @ - w = clog_for_large_values(z); @ - rx = fabsf(cimagf(w)); @ - ry = crealf(w) + m_ln2; @ - if (sy == 0) @ - ry = -ry; @ - return (cpackf(rx, ry)); @ - } @ + if (ax > RECIP_EPSILON || ay > RECIP_EPSILON) { @ + /* clog...() will raise inexact unless x or y is infinite */ @ + w = clog_for_large_values(z); @ + rx = fabsf(cimagf(w)); @ + ry = crealf(w) + m_ln2; @ + if (sy == 0) @ + ry = -ry; @ + return (cpackf(rx, ry)); @ + } As above. @ + @ +#if 0 @ + /* XXX the number for ay is related to sqrt(6 * FLT_EPSILON). */ @ + if (ax < FLT_EPSILON / 8 && ay < 2048 * FLT_EPSILON) @ + return (cpackf(m_pi_2 + tiny, -y)); @ +#endif Not quite the previous optimization turned off for debugging. It now raises inexact undconditionally by adding tiny to m_pi_2. This seems to actually be a minor pessimization, but I prefer it since it takes less code. The version using (int)(1+tiny) has the advantage that its result is not normally used, while the above is often used; the above does an extra operation in the often-used path. @ @ do_hard_work(ay, ax, &ry, &B_is_usable, &B, &sqrt_A2mx2, &new_x); @ @@ -321,30 +334,28 @@ @ } @ @ - if (isinf(x) || isinf(y)) @ - return (cpackf(copysignf(0, x), copysignf(m_pi_2 + tiny, y))); @ + /* Raise inexact unless z == 0; return for z == 0 as a side effect. */ @ + if ((x == 0 && y == 0) || (int)(1 + tiny) != 1) @ + return (z); Larger optimizations only done for catanhf(): First, the above removes the special code for handling infinities. These will be handled by the "large" case later. Second, it raises inexact for the one remaining case (z == 0) where the result is exact (all the other exact cases involve NaNs. Note that cases involving Infs return m_pi_2 for the imaginary part, so they are never exact). This patch doesn't show the removal of the code for raising inexact in sum_squares(). cacos*() and casin*() should benefit even more from an up-front raising of inexact, since do_hard_work() has 7 magic statements to raise inexact where sum_squares has only 1. @ @ if (ax > RECIP_EPSILON || ay > RECIP_EPSILON) @ - if ((int)(1+tiny)==1) @ - return (cpackf(copysignf(real_part_reciprocal(ax, ay), x), copysignf(m_pi_2, y))); @ + return (cpackf(copysignf(real_part_reciprocal(ax, ay), x), copysignf(m_pi_2, y))); Depend on inexact being raised up-front. There are no magic expressions (int)(1+tiny) left except the new up-front one. There are still not-so- magic expressions (m_pi_2 + tiny). BTW, most or all of the recent fixes to use the latter expressions don't have a comment about raising inexact in catrig.c, while most or all older expressions for setting inexact do have such a comment. A previous version of this optimization raised inexact by adding tiny to m_pi_2. @ + @ + /* XXX the numbers are related to sqrt(6 * FLT_EPSILON). */ @ + if (ax < 2048 * FLT_EPSILON && ay < 2048 * FLT_EPSILON) @ + return (z); Previous optimization not turned off for debugging. It is simpler now that it can depend on inexact being raised up-front. @ @ if (ax == 1 && ay < FLT_EPSILON) { @ - if ((int)ay==0) { @ - if ( ilogbf(ay) > FLT_MIN_EXP) @ - rx = - logf(ay/2) / 2; @ - else @ - rx = - (logf(ay) - m_ln2) / 2; @ - } @ + if (ilogbf(ay) > FLT_MIN_EXP) @ + rx = - logf(ay/2) / 2; @ + else @ + rx = - (logf(ay) - m_ln2) / 2; Depend on inexact being raised up-front. A previous version of this optimization depended instead on logf() raising inexact appropriately (since the arg is never 1, the result is always inexact). @ } else @ rx = log1pf(4*ax / sum_squares(ax-1, ay)) / 4; @ @ - if (ax == 1) { @ - if (ay==0) @ - ry = 0; @ - else @ - ry = atan2f(2, -ay) / 2; @ - } else if (ay < FOUR_SQRT_MIN) { @ - if ((int)ay==0) @ - ry = atan2f(2*ay, (1-ax)*(1+ax)) / 2; @ - } else @ + if (ax == 1) @ + ry = atan2f(2, ay) / 2; @ + else if (ay < FOUR_SQRT_MIN) @ + ry = atan2f(2*ay, (1-ax)*(1+ax)) / 2; @ + else @ ry = atan2f(2*ay, (1-ax)*(1+ax) - ay*ay) / 2; @ Remove the special case for (ax == 1, ay == 0). The general case gives the same result. The correctness of this probably depends on the sign considerations for the next change, and pointed to that change (here it seems to use +0 when y == 0 but -ay otherwise. Remove negation of ay for ax == 1. The sign will be copied into the result later for all cases, so it doesn't matter in the arg. I didn't check the branch cut details for this, but runtime tests passed. Since the sign doesn't matter, we could pass y instead of ay. I don't understand the threshold of FOUR_SQRT_MIN. ay*ay starts underflowing at SQRT_MIN. FOUR_SQRT_MIN seems to work, and has efficiency advantages. But large multiples of FOUR_SQRT_MIN also seem to work, and have larger efficiency advantages ... I now understand what the threshold should be. You have filtered out ax == 1. This makes 1 - ax*ax at least ~2*EPSILON, so ay*ay can be dropped if ay is less than sqrt(2*EPSILON*EPSILON) * 2**-GUARD_DIGITS = EPSILON * 2**-5 say. SQRT_MIN is way smaller than that, so FOUR_SQRT_MIN works too. We should use a larger threshold for efficiency, or avoid the special case for ax == 1. Testing shows that this analysis is off by a factor of about sqrt(EPSILON), since a threshold of EPSILON * 2**7 is optimal. The optimization made no difference to speed; it is just an optimization for understanding. Maybe the special case for ax == 1 can be avoided, or folded together with the same special case for evaluation of the real part. This special case is similar to the one in clog(), but easier. Further optimization: in sum_squares(), y is always ay >= 0, so there is no need to apply fabs*() to it. I think the compiler does this optimization. It can see that y == ay via the inline. BTW, do_hard_work() is usually not inlined, so the compiler wouldn't be able to do such optimizations. However, declaring it as __always_inline didn't improve the speed. Bruce