From owner-freebsd-numerics@FreeBSD.ORG  Wed May 29 20:25:41 2013
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id A12315BB
 for <freebsd-numerics@freebsd.org>; Wed, 29 May 2013 20:25:41 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au
 [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 293A478D
 for <freebsd-numerics@freebsd.org>; Wed, 29 May 2013 20:25:41 +0000 (UTC)
Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au
 (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23])
 by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id C1B963C187A;
 Thu, 30 May 2013 06:25:32 +1000 (EST)
Date: Thu, 30 May 2013 06:25:31 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
Subject: Re: Patches for s_expl.c
In-Reply-To: <20130529162441.GA58773@troutmask.apl.washington.edu>
Message-ID: <20130530045951.Y4776@besplex.bde.org>
References: <20130528172242.GA51485@troutmask.apl.washington.edu>
 <20130529062437.V4648@besplex.bde.org>
 <20130529162441.GA58773@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=e4Ne0tV/ c=1 sm=1 a=kj9zAlcOel0A:10
 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=SI7jfN9uMIUA:10
 a=L3_B_2Seth8ID6XF1HAA:9 a=CjuIK1q_8ugA:10 a=qSAIOg-s5ZBGxsML:21
 a=vBbP7Dv9lfjiZ7nx:21 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117
Cc: freebsd-numerics@freebsd.org, Bruce Evans <brde@optusnet.com.au>
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 May 2013 20:25:41 -0000

On Wed, 29 May 2013, Steve Kargl wrote:

> On Wed, May 29, 2013 at 07:39:04AM +1000, Bruce Evans wrote:
>> On Tue, 28 May 2013, Steve Kargl wrote:
>>
>>> Here are two patches for ld80/s_expl.c and ld128/s_expl.c.
>>> Instead of committing the one large patch that I have spent
>>> hours testing, I have split it into two.  One patch fixes/updates
>>> expl().  The other patch is the implementation of expm1l().
> ...
>>>   * Rename polynomial coefficients P# to A#, which is used in Tang.
>>
>> Almost all the declarations polynomial coefficients are still formatted
>> in a nonstandard way, but differently than in previous development
>> versions.  I keep sending you patches for this.
>
> Hopefully, fixed.  All fancy whitespace has been removed including
> in comments with hex values.

Er, I asked for them to be formatted in a standard way.  This has
whitespace for minus signs, since that lines up things better and its
too hard to avoid it when using printf() to format tables.  Removing
it gives much larger diffs than before (although I merged a few of the
regressions, I didn't merge them when the formatting was already
standard).

>>>   * Compute expm1l(x) for IEEE 754 128-bit format.
>>
>> There is a fairly large bug in this, from only merging half of the
>> most recent micro-optimization in the development version of the ld80
>> version.  This might only be an efficiency bug, but I haven't tested
>> the ld128 version with either the full merge or the half merge.
>>
>> The ld128 version still has excessive optimizations for |x| near 0.
>> It uses a slightly different high-degree polynomial on each side of
>> 0.  The ld80 version uses the same poly on each side.  Most of the
>> style bugs in the 4 exp[!2]l functions are in the coeffs for the
>> polys on each side.  I haven't tried so hard to get you to fix them
>> since I want to remove them.
>
> Hopefully, fixed to the extent that opened ld80/s_expl.c in one
> nedit window and ld128/s_expl.c in another.  I copied everything
> from ld80 to ld128 except of course literal constants and
> polynomials that must be different.

Seems to be fixed (matches my version).  I have barely started testing
my version of it on sparc64.

>> There are some minor style regressions relative to previous development
>> versions outside of poly coeffs.  Patches later.
>
> I'm sure you're going to hate the new patch at the end.

Mainly more whitespace regressions :-).  Several non-style regressions
for ld128.

> All coefficient are now formatted with the form:
>
> A6 = 1.3888891738560272e-3;       /* 0x16c16c651633ae.0p-62 */
>
> ie., 1 space before and 1 space after =.  The space in the comments
> for the implicit + sign has been removed.

As I mentioned above, this is nonstandard and requires manual editing
to mess up automatically formatted tables.  I used to print the tables
not very carefully and had to do lots of editing to match the style in
the source.  I got tired of this and changed the printing routines to
prettyprint in a standard format with all the necessary C syntax so that
I could copy whole tables to the source file.  Signs may ore may not
be required and it is easiest to always leave space for them in the
standard format and never edit this to add or remove spaces for them.

>> Some patches relative to my version now instead of later:
> ...
>> @ @@ -83,9 +83,9 @@
>> @   * (0.002708 is ln2/(2*INTERVALS) rounded up a little).
>> @   */
>> @ -A2  = 0.5,
>> @ -A3  = 1.6666666666666119e-1,		/*  0x15555555555490.0p-55 */
>> @ -A4  = 4.1666666666665887e-2,		/*  0x155555555554e5.0p-57 */
>> @ -A5  = 8.3333354987869413e-3,		/*  0x1111115b789919.0p-59 */
>> @ -A6  = 1.3888891738560272e-3;		/*  0x16c16c651633ae.0p-62 */
>> @ +A2 =  0.5,
>> @ +A3 =  1.6666666666666119e-1,		/*  0x15555555555490.0p-55 */
>> @ +A4 =  4.1666666666665887e-2,		/*  0x155555555554e5.0p-57 */
>> @ +A5 =  8.3333354987869413e-3,		/*  0x1111115b789919.0p-59 */
>> @ +A6 =  1.3888891738560272e-3;		/*  0x16c16c651633ae.0p-62 */
>> @
>> @  /*
>>
>> Fix regressions relative to a previous development version.
>
> I made this conform to style(9).

style(9) only says not to use fancy formatting for assignments implicitly.
indent(1) cannot preserve fancy formatting for assignments or even be
directed how to format assignments.  However, we were intentionally using
fancy formatting, and the above was one of the few places in s_expl.c
where it was done consistently (after backing out regressions).  You got
the standard fancy formatting by copying one of my automatically generated
sets of coeffs when the coeff names were P[2-6].

>> @  #endif
>> @  	n2 = (unsigned)n % INTERVALS;
>> @ +	/* Depend on the sign bit being propagated: */
>> @  	k = n >> LOG2_INTERVALS;
>> @  	r1 = x - fn * L1;
>>
>> I think a comment is needed.  This micro-optimization was merged from
>> s_exp2*.c, where it is commented on more prominently for the long
>> double versions only.
>
> Ignored adding a comment.

It will be in future diffs.

>> @ @@ -389,4 +409,9 @@
>> @  		x4 = x2 * x2;
>> @  		q = x4 * (x2 * (x4 *
>> @ +		    /*
>> @ +		     * XXX the number of terms is no longer good for
>> @ +		     * pairwise grouping of all except B3, and the
>> @ +		     * grouping is no longer from highest down.
>> @ +		     */
>> @  		    (x2 *            B12  + (x * B11 + B10)) +
>> @  		    (x2 * (x * B9 +  B8) +  (x * B7 +  B6))) +
>
> I left this as-is with whitespace and did not add the comment.
> This should be the only place where there is a substantial
> deviation from style(9).

The comment is a reminder for fix the grouping of terms.

> After making the changes, current unscientific testing gives
> (best viewed in a 95 column window):
> ...
> expm1l
>
> Timing:
>                             1M          10M        100M
> i386    [-64.0000:-0.1659]   0.435783   4.342621  43.41397

Hmm, only slow on i386.  It's still fast for me.  Now tested on
Athlon64 and core2.

> i386    [ -0.1659: 0.1659]   0.082880   0.829142   8.28948
> i386    [  0.1659:11356.0]   0.110590   1.096098  10.96253
> amd64   [-64.0000:-0.1659]   0.066751   0.648734   6.46649
> amd64   [ -0.1659: 0.1659]   0.061531   0.614824   6.14377
> amd64   [  0.1659:11356.0]   0.071677   0.716927   7.16819
> sparc64 [-113.000:-0.1659]  37.84224
> sparc64 [ -0.1659: 0.1659]  66.28533
> sparc64 [  0.1659:11356.0]  41.20714
> ...
> Testing on flame is excrudiating slow especially because rdivacky
> is building clang.

I handle the normal slowness by reducing the number of tests by a
factor of 100 for long double precision on sparc64.  rdivacky only
gave another factor of 3 slowness :-).

> Yes, the following is one massive patch.

Easier to apply that way.

> Index: ld80/s_expl.c
> ...
> Index: ld128/s_expl.c
> ...

Too hard to see or describe regressions in these because they are relative
to an old version.

Here are my current diffs for ld80:

@ --- z22/s_expl.c	Thu May 30 03:56:37 2013
@ +++ ./s_expl.c	Thu May 30 04:15:33 2013
@ @@ -63,5 +63,5 @@
@  /* log(2**16384 - 0.5) rounded towards zero: */
@  /* log(2**16384 - 0.5 + 1) rounded towards zero for expm1l() is the same: */
@ -o_thresholdu = LD80C(0xb17217f7d1cf79ab, 13, 11356.5234062941439488L),
@ +o_thresholdu = LD80C(0xb17217f7d1cf79ab, 13,  11356.5234062941439488L),
@  #define o_threshold	 (o_thresholdu.e)
@  /* log(2**(-16381-64-1)) rounded towards zero: */
@ @@ -75,6 +75,6 @@
@   * bits zero so that multiplication of it by n is exact.
@   */
@ -INV_L = 1.8466496523378731e+2,		/* 0x171547652b82fe.0p-45 */
@ -L1 = 5.4152123484527692e-3,		/* 0x162e42ff000000.0p-60 */
@ +INV_L = 1.8466496523378731e+2,		/*  0x171547652b82fe.0p-45 */
@ +L1 =  5.4152123484527692e-3,		/*  0x162e42ff000000.0p-60 */
@  L2 = -3.2819649005320973e-13,		/* -0x1718432a1b0e26.0p-94 */
@  /*
@ @@ -83,9 +83,9 @@
@   * (0.002708 is ln2/(2*INTERVALS) rounded up a little).
@   */
@ -A2 = 0.5,
@ -A3 = 1.6666666666666119e-1,		/* 0x15555555555490.0p-55 */
@ -A4 = 4.1666666666665887e-2,		/* 0x155555555554e5.0p-57 */
@ -A5 = 8.3333354987869413e-3,		/* 0x1111115b789919.0p-59 */
@ -A6 = 1.3888891738560272e-3;		/* 0x16c16c651633ae.0p-62 */
@ +A2 =  0.5,
@ +A3 =  1.6666666666666119e-1,		/*  0x15555555555490.0p-55 */
@ +A4 =  4.1666666666665887e-2,		/*  0x155555555554e5.0p-57 */
@ +A5 =  8.3333354987869413e-3,		/*  0x1111115b789919.0p-59 */
@ +A6 =  1.3888891738560272e-3;		/*  0x16c16c651633ae.0p-62 */
@ 
@  /*

As in previous version, with the diff larger since I didn't merge
whitespace changes that are regressions unless the formatting was
already mostly wrong.

@ @@ -273,4 +281,5 @@
@  #endif
@  	n2 = (unsigned)n % INTERVALS;
@ +	/* Depend on the sign bit being propagated: */
@  	k = n >> LOG2_INTERVALS;
@  	r1 = x - fn * L1;

As in previous version.

@ @@ -323,9 +332,19 @@
@  static const double
@  T1 = -0.1659,				/* ~-30.625/128 * log(2) */
@ -T2 = 0.1659;				/* ~30.625/128 * log(2) */
@ +T2 =  0.1659;				/* ~30.625/128 * log(2) */
@ 
@  /*
@ - * Domain [-0.1659, 0.1659], range ~[-1.2027e-22, 3.4417e-22]:
@ - * |(exp(x)-1-x-x**2/2)/x - p(x)| < 2**-71.2
@ + * Domain [-0.1659, 0.1659], range ~[-2.6155e-22, 2.5507e-23]:
@ + * |(exp(x)-1-x-x**2/2)/x - p(x)| < 2**-71.6
@ + *
@ + * XXX the coeffs aren't very carefully rounded, and I get 4.5 more bits,
@ + * but unlike for ld128 we can't drop any terms.
@ + *
@ + * XXX this still isn't in standard format:
@ + * - extra digits in exponents for decimal values
@ + * - no spaces to line up equals signs (a new regression)
@ + * - no space for a (not present) minus sign in either the decimal or hex
@ + *   values (a new regression for the LD80C hex values)
@ + * - perhaps they are impossible for double values
@   */
@  static const union IEEEl2bits

Mostly as in previous version.  I merged a lot of whitespace regressions
here and only added comments saying that there is more to fix now.

@ @@ -387,6 +408,10 @@
@  		x2 = x * x;
@  		x4 = x2 * x2;
@ -

I didn't merge a new whitespace regression.

@  		q = x4 * (x2 * (x4 *
@ +		    /*
@ +		     * XXX the number of terms is no longer good for
@ +		     * pairwise grouping of all except B3, and the
@ +		     * grouping is no longer from highest down.
@ +		     */
@  		    (x2 *            B12  + (x * B11 + B10)) +
@  		    (x2 * (x * B9 +  B8) +  (x * B7 +  B6))) +

As in previous verision.

@ @@ -434,22 +459,21 @@
@ 
@  	if (k == 0) {
@ -		t = tbl[n2].lo * (r1 + 1) + t * q + tbl[n2].hi * r1 +
@ -		    (tbl[n2].hi - 1);
@ +		t = SUM2P(tbl[n2].hi - 1, tbl[n2].lo * (r1 + 1) + t * q +
@ +		    tbl[n2].hi * r1);
@  		RETURNI(t);
@  	}
@ -

You don't want most of this, but there is still an extra blank line here,
as in previous version.

@  	if (k == -1) {
@ -		t = tbl[n2].lo * (r1 + 1) + t * q + tbl[n2].hi * r1 + 
@ -		    (tbl[n2].hi - 2);
@ +		t = SUM2P(tbl[n2].hi - 2, tbl[n2].lo * (r1 + 1) + t * q +
@ +		    tbl[n2].hi * r1);
@  		RETURNI(t / 2);
@  	}
@ 
@  	if (k < -7) {
@ -		t = tbl[n2].lo + t * (q + r1) + tbl[n2].hi;
@ +		t = SUM2P(tbl[n2].hi, tbl[n2].lo + t * (q + r1));
@  		RETURNI(t * twopk - 1);
@  	}
@ 
@  	if (k > 2 * LDBL_MANT_DIG - 1) {
@ -		t = tbl[n2].lo + t * (q + r1) + tbl[n2].hi;
@ +		t = SUM2P(tbl[n2].hi, tbl[n2].lo + t * (q + r1));
@  		if (k == LDBL_MAX_EXP)
@  			RETURNI(t * 2 * 0x1p16383L - 1);
@ @@ -459,8 +483,9 @@
@  	v.xbits.expsign = BIAS - k;
@  	twomk = v.e;
@ +

You don't want most of this, but there is now a missing blank line here.
Apparently the extra blank line above was removed here.  (The initialization
of twomk was intentionally separated from its use since the initialization
is somewhat special although it is not commented on like the inuitialization
of twopk.)

@  	if (k > LDBL_MANT_DIG - 1)
@ -		t = tbl[n2].lo - twomk + t * (q + r1) + tbl[n2].hi;
@ +		t = SUM2P(tbl[n2].hi, tbl[n2].lo - twomk + t * (q + r1));
@  	else
@ -		t = tbl[n2].lo + t * (q + r1) + (tbl[n2].hi - twomk);
@ +		t = SUM2P(tbl[n2].hi - twomk, tbl[n2].lo + t * (q + r1));
@  	RETURNI(t * twopk);
@  }

Summary of my current diffs for ld128 (the full diffs are hard to untangle).
There are a couple of more serious regressions which these patches reverse.
No comments on formatting.  No patches for things done last year.

% --- z22/s_expl.c	Thu May 30 04:21:49 2013
% +++ ./s_expl.c	Thu May 30 04:59:06 2013
% ...
% @@ -252,6 +289,7 @@
%  	/* Reduce x to (k*ln2 + endpoint[n2] + r1 + r2). */
%  	/* Use a specialized rint() to get fn.  Assume round-to-nearest. */
% +	/* XXX assume no extra precision for the additions, as for trig fns. */
% +	/* XXX this set of comments is now quadruplicated. */
%  	fn = (double)x * INV_L + 0x1.8p52 - 0x1.8p52;
% -	r = x - fn * L1 - fn * L2;	/* r = r1 + r2 done independently. */

This undoes a regression to the ld80 version.  Initializing r using extra
operations here is an optimization for the ld80 version (really for x86
and/or OOE CPUs with fast pipelines).  It is a huge pessimization to do
2 extra long double multiplications on sparc64, so it was not done.

%  #if defined(HAVE_EFFICIENT_IRINT)
%  	n = irint(fn);
% @@ -263,6 +301,8 @@
%  	r1 = x - fn * L1;
%  	r2 = fn * -L2;
% +	r = r1 + r2;

Finish undoing the regression.

% 
%  	/* Prepare scale factors. */
% +	/* XXX sparc64 multiplication is so slow that scalbnl() is faster. */
%  	v.e = 1;
%  	if (k >= LDBL_MIN_EXP) {

Undo the regression of losing an important optimization hint.  The x86ish
optimization of using a multiplication to scale is not as bad on sparc64
as the one above, but it is still so bad that scalbnl() is better.  The
old fdlibm scaling method should be used instead of either of these (it
is a specialized scalbnl() manually inlined).

% @@ -303,11 +343,20 @@
%  static const double
%  T1 = -0.1659,				/* ~-30.625/128 * log(2) */
% -T2 = 0.1659;				/* ~30.625/128 * log(2) */
% +T2 =  0.1659;				/* ~30.625/128 * log(2) */
% 
%  /*
%   * Split the interval [T1:T2] into two intervals [T1:T3] and [T3:T2].
% - * Setting T3 to 0 would require the |x| < 0x1p-113  condition to appear
% + * Setting T3 to 0 would require the |x| < 0x1p-113 condition to appear
%   * in both subintervals, so set T3 = 2**-5, which places the condition
%   * into the [T1:T3] interval.
% + *
% + * XXX the above comment has rotted.  The condition is now tested for
% + * both subintervals (although with T3 nonzero it is only satisfied for
% + * [T1:T3].  However, it is now even more critical for other reasons
% + * that T3 not being in the middle.  We now do this so that the polys
% + * for each side can have almost the same degree.  It may be slightly
% + * misplaced, since the C poly has ended up 1 degree higher.
% + *
% + * XXX these micro-optimizations are excessive.
%   */
%  static const double

I'm not sure if the change to test the condition for both intervals
is good, but it makes the first paragraph of the comment completely
wrong.

Bruce