From owner-freebsd-numerics@FreeBSD.ORG Wed May 29 20:25:41 2013 Return-Path: Delivered-To: freebsd-numerics@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id A12315BB for ; Wed, 29 May 2013 20:25:41 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 293A478D for ; Wed, 29 May 2013 20:25:41 +0000 (UTC) Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id C1B963C187A; Thu, 30 May 2013 06:25:32 +1000 (EST) Date: Thu, 30 May 2013 06:25:31 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl Subject: Re: Patches for s_expl.c In-Reply-To: <20130529162441.GA58773@troutmask.apl.washington.edu> Message-ID: <20130530045951.Y4776@besplex.bde.org> References: <20130528172242.GA51485@troutmask.apl.washington.edu> <20130529062437.V4648@besplex.bde.org> <20130529162441.GA58773@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=e4Ne0tV/ c=1 sm=1 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=SI7jfN9uMIUA:10 a=L3_B_2Seth8ID6XF1HAA:9 a=CjuIK1q_8ugA:10 a=qSAIOg-s5ZBGxsML:21 a=vBbP7Dv9lfjiZ7nx:21 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 Cc: freebsd-numerics@freebsd.org, Bruce Evans X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 May 2013 20:25:41 -0000 On Wed, 29 May 2013, Steve Kargl wrote: > On Wed, May 29, 2013 at 07:39:04AM +1000, Bruce Evans wrote: >> On Tue, 28 May 2013, Steve Kargl wrote: >> >>> Here are two patches for ld80/s_expl.c and ld128/s_expl.c. >>> Instead of committing the one large patch that I have spent >>> hours testing, I have split it into two. One patch fixes/updates >>> expl(). The other patch is the implementation of expm1l(). > ... >>> * Rename polynomial coefficients P# to A#, which is used in Tang. >> >> Almost all the declarations polynomial coefficients are still formatted >> in a nonstandard way, but differently than in previous development >> versions. I keep sending you patches for this. > > Hopefully, fixed. All fancy whitespace has been removed including > in comments with hex values. Er, I asked for them to be formatted in a standard way. This has whitespace for minus signs, since that lines up things better and its too hard to avoid it when using printf() to format tables. Removing it gives much larger diffs than before (although I merged a few of the regressions, I didn't merge them when the formatting was already standard). >>> * Compute expm1l(x) for IEEE 754 128-bit format. >> >> There is a fairly large bug in this, from only merging half of the >> most recent micro-optimization in the development version of the ld80 >> version. This might only be an efficiency bug, but I haven't tested >> the ld128 version with either the full merge or the half merge. >> >> The ld128 version still has excessive optimizations for |x| near 0. >> It uses a slightly different high-degree polynomial on each side of >> 0. The ld80 version uses the same poly on each side. Most of the >> style bugs in the 4 exp[!2]l functions are in the coeffs for the >> polys on each side. I haven't tried so hard to get you to fix them >> since I want to remove them. > > Hopefully, fixed to the extent that opened ld80/s_expl.c in one > nedit window and ld128/s_expl.c in another. I copied everything > from ld80 to ld128 except of course literal constants and > polynomials that must be different. Seems to be fixed (matches my version). I have barely started testing my version of it on sparc64. >> There are some minor style regressions relative to previous development >> versions outside of poly coeffs. Patches later. > > I'm sure you're going to hate the new patch at the end. Mainly more whitespace regressions :-). Several non-style regressions for ld128. > All coefficient are now formatted with the form: > > A6 = 1.3888891738560272e-3; /* 0x16c16c651633ae.0p-62 */ > > ie., 1 space before and 1 space after =. The space in the comments > for the implicit + sign has been removed. As I mentioned above, this is nonstandard and requires manual editing to mess up automatically formatted tables. I used to print the tables not very carefully and had to do lots of editing to match the style in the source. I got tired of this and changed the printing routines to prettyprint in a standard format with all the necessary C syntax so that I could copy whole tables to the source file. Signs may ore may not be required and it is easiest to always leave space for them in the standard format and never edit this to add or remove spaces for them. >> Some patches relative to my version now instead of later: > ... >> @ @@ -83,9 +83,9 @@ >> @ * (0.002708 is ln2/(2*INTERVALS) rounded up a little). >> @ */ >> @ -A2 = 0.5, >> @ -A3 = 1.6666666666666119e-1, /* 0x15555555555490.0p-55 */ >> @ -A4 = 4.1666666666665887e-2, /* 0x155555555554e5.0p-57 */ >> @ -A5 = 8.3333354987869413e-3, /* 0x1111115b789919.0p-59 */ >> @ -A6 = 1.3888891738560272e-3; /* 0x16c16c651633ae.0p-62 */ >> @ +A2 = 0.5, >> @ +A3 = 1.6666666666666119e-1, /* 0x15555555555490.0p-55 */ >> @ +A4 = 4.1666666666665887e-2, /* 0x155555555554e5.0p-57 */ >> @ +A5 = 8.3333354987869413e-3, /* 0x1111115b789919.0p-59 */ >> @ +A6 = 1.3888891738560272e-3; /* 0x16c16c651633ae.0p-62 */ >> @ >> @ /* >> >> Fix regressions relative to a previous development version. > > I made this conform to style(9). style(9) only says not to use fancy formatting for assignments implicitly. indent(1) cannot preserve fancy formatting for assignments or even be directed how to format assignments. However, we were intentionally using fancy formatting, and the above was one of the few places in s_expl.c where it was done consistently (after backing out regressions). You got the standard fancy formatting by copying one of my automatically generated sets of coeffs when the coeff names were P[2-6]. >> @ #endif >> @ n2 = (unsigned)n % INTERVALS; >> @ + /* Depend on the sign bit being propagated: */ >> @ k = n >> LOG2_INTERVALS; >> @ r1 = x - fn * L1; >> >> I think a comment is needed. This micro-optimization was merged from >> s_exp2*.c, where it is commented on more prominently for the long >> double versions only. > > Ignored adding a comment. It will be in future diffs. >> @ @@ -389,4 +409,9 @@ >> @ x4 = x2 * x2; >> @ q = x4 * (x2 * (x4 * >> @ + /* >> @ + * XXX the number of terms is no longer good for >> @ + * pairwise grouping of all except B3, and the >> @ + * grouping is no longer from highest down. >> @ + */ >> @ (x2 * B12 + (x * B11 + B10)) + >> @ (x2 * (x * B9 + B8) + (x * B7 + B6))) + > > I left this as-is with whitespace and did not add the comment. > This should be the only place where there is a substantial > deviation from style(9). The comment is a reminder for fix the grouping of terms. > After making the changes, current unscientific testing gives > (best viewed in a 95 column window): > ... > expm1l > > Timing: > 1M 10M 100M > i386 [-64.0000:-0.1659] 0.435783 4.342621 43.41397 Hmm, only slow on i386. It's still fast for me. Now tested on Athlon64 and core2. > i386 [ -0.1659: 0.1659] 0.082880 0.829142 8.28948 > i386 [ 0.1659:11356.0] 0.110590 1.096098 10.96253 > amd64 [-64.0000:-0.1659] 0.066751 0.648734 6.46649 > amd64 [ -0.1659: 0.1659] 0.061531 0.614824 6.14377 > amd64 [ 0.1659:11356.0] 0.071677 0.716927 7.16819 > sparc64 [-113.000:-0.1659] 37.84224 > sparc64 [ -0.1659: 0.1659] 66.28533 > sparc64 [ 0.1659:11356.0] 41.20714 > ... > Testing on flame is excrudiating slow especially because rdivacky > is building clang. I handle the normal slowness by reducing the number of tests by a factor of 100 for long double precision on sparc64. rdivacky only gave another factor of 3 slowness :-). > Yes, the following is one massive patch. Easier to apply that way. > Index: ld80/s_expl.c > ... > Index: ld128/s_expl.c > ... Too hard to see or describe regressions in these because they are relative to an old version. Here are my current diffs for ld80: @ --- z22/s_expl.c Thu May 30 03:56:37 2013 @ +++ ./s_expl.c Thu May 30 04:15:33 2013 @ @@ -63,5 +63,5 @@ @ /* log(2**16384 - 0.5) rounded towards zero: */ @ /* log(2**16384 - 0.5 + 1) rounded towards zero for expm1l() is the same: */ @ -o_thresholdu = LD80C(0xb17217f7d1cf79ab, 13, 11356.5234062941439488L), @ +o_thresholdu = LD80C(0xb17217f7d1cf79ab, 13, 11356.5234062941439488L), @ #define o_threshold (o_thresholdu.e) @ /* log(2**(-16381-64-1)) rounded towards zero: */ @ @@ -75,6 +75,6 @@ @ * bits zero so that multiplication of it by n is exact. @ */ @ -INV_L = 1.8466496523378731e+2, /* 0x171547652b82fe.0p-45 */ @ -L1 = 5.4152123484527692e-3, /* 0x162e42ff000000.0p-60 */ @ +INV_L = 1.8466496523378731e+2, /* 0x171547652b82fe.0p-45 */ @ +L1 = 5.4152123484527692e-3, /* 0x162e42ff000000.0p-60 */ @ L2 = -3.2819649005320973e-13, /* -0x1718432a1b0e26.0p-94 */ @ /* @ @@ -83,9 +83,9 @@ @ * (0.002708 is ln2/(2*INTERVALS) rounded up a little). @ */ @ -A2 = 0.5, @ -A3 = 1.6666666666666119e-1, /* 0x15555555555490.0p-55 */ @ -A4 = 4.1666666666665887e-2, /* 0x155555555554e5.0p-57 */ @ -A5 = 8.3333354987869413e-3, /* 0x1111115b789919.0p-59 */ @ -A6 = 1.3888891738560272e-3; /* 0x16c16c651633ae.0p-62 */ @ +A2 = 0.5, @ +A3 = 1.6666666666666119e-1, /* 0x15555555555490.0p-55 */ @ +A4 = 4.1666666666665887e-2, /* 0x155555555554e5.0p-57 */ @ +A5 = 8.3333354987869413e-3, /* 0x1111115b789919.0p-59 */ @ +A6 = 1.3888891738560272e-3; /* 0x16c16c651633ae.0p-62 */ @ @ /* As in previous version, with the diff larger since I didn't merge whitespace changes that are regressions unless the formatting was already mostly wrong. @ @@ -273,4 +281,5 @@ @ #endif @ n2 = (unsigned)n % INTERVALS; @ + /* Depend on the sign bit being propagated: */ @ k = n >> LOG2_INTERVALS; @ r1 = x - fn * L1; As in previous version. @ @@ -323,9 +332,19 @@ @ static const double @ T1 = -0.1659, /* ~-30.625/128 * log(2) */ @ -T2 = 0.1659; /* ~30.625/128 * log(2) */ @ +T2 = 0.1659; /* ~30.625/128 * log(2) */ @ @ /* @ - * Domain [-0.1659, 0.1659], range ~[-1.2027e-22, 3.4417e-22]: @ - * |(exp(x)-1-x-x**2/2)/x - p(x)| < 2**-71.2 @ + * Domain [-0.1659, 0.1659], range ~[-2.6155e-22, 2.5507e-23]: @ + * |(exp(x)-1-x-x**2/2)/x - p(x)| < 2**-71.6 @ + * @ + * XXX the coeffs aren't very carefully rounded, and I get 4.5 more bits, @ + * but unlike for ld128 we can't drop any terms. @ + * @ + * XXX this still isn't in standard format: @ + * - extra digits in exponents for decimal values @ + * - no spaces to line up equals signs (a new regression) @ + * - no space for a (not present) minus sign in either the decimal or hex @ + * values (a new regression for the LD80C hex values) @ + * - perhaps they are impossible for double values @ */ @ static const union IEEEl2bits Mostly as in previous version. I merged a lot of whitespace regressions here and only added comments saying that there is more to fix now. @ @@ -387,6 +408,10 @@ @ x2 = x * x; @ x4 = x2 * x2; @ - I didn't merge a new whitespace regression. @ q = x4 * (x2 * (x4 * @ + /* @ + * XXX the number of terms is no longer good for @ + * pairwise grouping of all except B3, and the @ + * grouping is no longer from highest down. @ + */ @ (x2 * B12 + (x * B11 + B10)) + @ (x2 * (x * B9 + B8) + (x * B7 + B6))) + As in previous verision. @ @@ -434,22 +459,21 @@ @ @ if (k == 0) { @ - t = tbl[n2].lo * (r1 + 1) + t * q + tbl[n2].hi * r1 + @ - (tbl[n2].hi - 1); @ + t = SUM2P(tbl[n2].hi - 1, tbl[n2].lo * (r1 + 1) + t * q + @ + tbl[n2].hi * r1); @ RETURNI(t); @ } @ - You don't want most of this, but there is still an extra blank line here, as in previous version. @ if (k == -1) { @ - t = tbl[n2].lo * (r1 + 1) + t * q + tbl[n2].hi * r1 + @ - (tbl[n2].hi - 2); @ + t = SUM2P(tbl[n2].hi - 2, tbl[n2].lo * (r1 + 1) + t * q + @ + tbl[n2].hi * r1); @ RETURNI(t / 2); @ } @ @ if (k < -7) { @ - t = tbl[n2].lo + t * (q + r1) + tbl[n2].hi; @ + t = SUM2P(tbl[n2].hi, tbl[n2].lo + t * (q + r1)); @ RETURNI(t * twopk - 1); @ } @ @ if (k > 2 * LDBL_MANT_DIG - 1) { @ - t = tbl[n2].lo + t * (q + r1) + tbl[n2].hi; @ + t = SUM2P(tbl[n2].hi, tbl[n2].lo + t * (q + r1)); @ if (k == LDBL_MAX_EXP) @ RETURNI(t * 2 * 0x1p16383L - 1); @ @@ -459,8 +483,9 @@ @ v.xbits.expsign = BIAS - k; @ twomk = v.e; @ + You don't want most of this, but there is now a missing blank line here. Apparently the extra blank line above was removed here. (The initialization of twomk was intentionally separated from its use since the initialization is somewhat special although it is not commented on like the inuitialization of twopk.) @ if (k > LDBL_MANT_DIG - 1) @ - t = tbl[n2].lo - twomk + t * (q + r1) + tbl[n2].hi; @ + t = SUM2P(tbl[n2].hi, tbl[n2].lo - twomk + t * (q + r1)); @ else @ - t = tbl[n2].lo + t * (q + r1) + (tbl[n2].hi - twomk); @ + t = SUM2P(tbl[n2].hi - twomk, tbl[n2].lo + t * (q + r1)); @ RETURNI(t * twopk); @ } Summary of my current diffs for ld128 (the full diffs are hard to untangle). There are a couple of more serious regressions which these patches reverse. No comments on formatting. No patches for things done last year. % --- z22/s_expl.c Thu May 30 04:21:49 2013 % +++ ./s_expl.c Thu May 30 04:59:06 2013 % ... % @@ -252,6 +289,7 @@ % /* Reduce x to (k*ln2 + endpoint[n2] + r1 + r2). */ % /* Use a specialized rint() to get fn. Assume round-to-nearest. */ % + /* XXX assume no extra precision for the additions, as for trig fns. */ % + /* XXX this set of comments is now quadruplicated. */ % fn = (double)x * INV_L + 0x1.8p52 - 0x1.8p52; % - r = x - fn * L1 - fn * L2; /* r = r1 + r2 done independently. */ This undoes a regression to the ld80 version. Initializing r using extra operations here is an optimization for the ld80 version (really for x86 and/or OOE CPUs with fast pipelines). It is a huge pessimization to do 2 extra long double multiplications on sparc64, so it was not done. % #if defined(HAVE_EFFICIENT_IRINT) % n = irint(fn); % @@ -263,6 +301,8 @@ % r1 = x - fn * L1; % r2 = fn * -L2; % + r = r1 + r2; Finish undoing the regression. % % /* Prepare scale factors. */ % + /* XXX sparc64 multiplication is so slow that scalbnl() is faster. */ % v.e = 1; % if (k >= LDBL_MIN_EXP) { Undo the regression of losing an important optimization hint. The x86ish optimization of using a multiplication to scale is not as bad on sparc64 as the one above, but it is still so bad that scalbnl() is better. The old fdlibm scaling method should be used instead of either of these (it is a specialized scalbnl() manually inlined). % @@ -303,11 +343,20 @@ % static const double % T1 = -0.1659, /* ~-30.625/128 * log(2) */ % -T2 = 0.1659; /* ~30.625/128 * log(2) */ % +T2 = 0.1659; /* ~30.625/128 * log(2) */ % % /* % * Split the interval [T1:T2] into two intervals [T1:T3] and [T3:T2]. % - * Setting T3 to 0 would require the |x| < 0x1p-113 condition to appear % + * Setting T3 to 0 would require the |x| < 0x1p-113 condition to appear % * in both subintervals, so set T3 = 2**-5, which places the condition % * into the [T1:T3] interval. % + * % + * XXX the above comment has rotted. The condition is now tested for % + * both subintervals (although with T3 nonzero it is only satisfied for % + * [T1:T3]. However, it is now even more critical for other reasons % + * that T3 not being in the middle. We now do this so that the polys % + * for each side can have almost the same degree. It may be slightly % + * misplaced, since the C poly has ended up 1 degree higher. % + * % + * XXX these micro-optimizations are excessive. % */ % static const double I'm not sure if the change to test the condition for both intervals is good, but it makes the first paragraph of the comment completely wrong. Bruce