From owner-freebsd-numerics@FreeBSD.ORG  Tue May 28 21:58:38 2013
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@freebsd.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 11750D17
 for <freebsd-numerics@freebsd.org>; Tue, 28 May 2013 21:58:38 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au
 [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 8FE9FD30
 for <freebsd-numerics@freebsd.org>; Tue, 28 May 2013 21:58:37 +0000 (UTC)
Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au
 (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23])
 by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 8A74B122EDD;
 Wed, 29 May 2013 07:39:12 +1000 (EST)
Date: Wed, 29 May 2013 07:39:04 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
Subject: Re: Patches for s_expl.c
In-Reply-To: <20130528172242.GA51485@troutmask.apl.washington.edu>
Message-ID: <20130529062437.V4648@besplex.bde.org>
References: <20130528172242.GA51485@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=e/de0tV/ c=1 sm=1 a=kj9zAlcOel0A:10
 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=SI7jfN9uMIUA:10
 a=enA2T3gqEfefmBwEoGAA:9 a=CjuIK1q_8ugA:10 a=tJtbpcaLiRwA:10
 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117
Cc: freebsd-numerics@freebsd.org
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 May 2013 21:58:38 -0000

On Tue, 28 May 2013, Steve Kargl wrote:

> Here are two patches for ld80/s_expl.c and ld128/s_expl.c.
> Instead of committing the one large patch that I have spent
> hours testing, I have split it into two.  One patch fixes/updates
> expl().  The other patch is the implementation of expm1l().
>
> My commit messages will be:
>
> Patch 1:
>
>   ld80/s_expl.c:
>
>   * Use the LOG2_INTERVALS macro instead of hardcoding 7.

The use of LOG2_INTERVALS isn't merged into the ld128 version.  Patch 2
merges its use for expm1l() only.

>   * Use LD80C to set overflow and underflow thresholds, and then use
>     #defines to access the .e component to reduce diffs with ld128 version.
>   * Rename polynomial coefficients P# to A#, which is used in Tang.

Almost all the declarations polynomial coefficients are still formatted
in a nonstandard way, but differently than in previous development
versions.  I keep sending you patches for this.

>   * Remove the use of intermediate results t23 and t45.
>   * Micro-optimization: remove access to u.xbits.man.

On the same line(s) that LOG2_INTERVALS is used, there is a more
important micro-optimization than this one.

>   * Fix an off-by-one in the underflow case.
>   * Replace a factor the long double constant 2.0L by the integer 2.  Let
>     the compiler to the conversion.
>
>   ld128/s_expl.c:
>
>   * Adjust Copyright years to reflect when bits of the code were actually
>     written.
>   * Reduce diff between the ld80 and ld128 versions.
>
> Patch 2:
>
>   ld80/s_expl.c:
>
>   * Compute expm1l(x) for Intel 80-bit format.
>
>   ld128/s_expl.c:
>
>   * Compute expm1l(x) for IEEE 754 128-bit format.

There is a fairly large bug in this, from only merging half of the
most recent micro-optimization in the development version of the ld80
version.  This might only be an efficiency bug, but I haven't tested
the ld128 version with either the full merge or the half merge.

The ld128 version still has excessive optimizations for |x| near 0.
It uses a slightly different high-degree polynomial on each side of
0.  The ld80 version uses the same poly on each side.  Most of the
style bugs in the 4 exp[!2]l functions are in the coeffs for the
polys on each side.  I haven't tried so hard to get you to fix them
since I want to remove them.

>
>   These are based on:
>
>   PTP Tang, "Table-driven implementation of the Expm1 function
>   in IEEE floating-point arithmetic," ACM Trans. Math. Soft., 18,
>   211-222 (1992).
>
> These commit logs may be too terse for some, but quite frankly after
> 2 or 3 years of submitting and resubmitting diffs, I've forgotten
> why some changes have or have not been made.
>
> expm1l() resides in s_expl.c because she shares the same table,
> polynomial coefficients, and some numerical constants with expl().

There are some minor style regressions relative to previous development
versions outside of poly coeffs.  Patches later.

> Index: ld80/s_expl.c
> ===================================================================
> --- ld80/s_expl.c	(revision 251062)
> +++ ld80/s_expl.c	(working copy)
> ...
> @@ -78,11 +82,11 @@
>  * |exp(x) - p(x)| < 2**-77.2
>  * (0.002708 is ln2/(2*INTERVALS) rounded up a little).
>  */
> -P2 =  0.5,
> -P3 =  1.6666666666666119e-1,		/*  0x15555555555490.0p-55 */
> -P4 =  4.1666666666665887e-2,		/*  0x155555555554e5.0p-57 */
> -P5 =  8.3333354987869413e-3,		/*  0x1111115b789919.0p-59 */
> -P6 =  1.3888891738560272e-3;		/*  0x16c16c651633ae.0p-62 */
> +A2  = 0.5,
> +A3  = 1.6666666666666119e-1,		/*  0x15555555555490.0p-55 */
> +A4  = 4.1666666666665887e-2,		/*  0x155555555554e5.0p-57 */
> +A5  = 8.3333354987869413e-3,		/*  0x1111115b789919.0p-59 */
> +A6  = 1.3888891738560272e-3;		/*  0x16c16c651633ae.0p-62 */

Example of a formatting regression.  The extra space that was before the
values is for a possible minus sign.  This space is still there for the
hex values.  The extra space before the equals sign is used for fancy
formatting to line up the values when the variable names reach A10.  Since
thee variable names only reach A6, this is not needed.

> ...
> @@ -242,23 +247,21 @@
> 	ix = hx & 0x7fff;
> 	if (ix >= BIAS + 13) {		/* |x| >= 8192 or x is NaN */
> 		if (ix == BIAS + LDBL_MAX_EXP) {
> -			if (hx & 0x8000 && u.xbits.man == 1ULL << 63)
> -				return (0.0L);	/* x is -Inf */
> -			return (x + x); /* x is +Inf, NaN or unsupported */
> +			if (hx & 0x8000)  /* x is -Inf, -NaN or unsupported */
> +				return (-1 / x);

Micro-optimization here.

> ...
> @@ -270,12 +273,12 @@
> 	n  = (int)fn;
> #endif
> 	n2 = (unsigned)n % INTERVALS;
> -	k = (n - n2) / INTERVALS;
> +	k = n >> LOG2_INTERVALS;
> 	r1 = x - fn * L1;
> -	r2 = -fn * L2;
> +	r2 = fn * -L2;

2 micro-optimizations.

> Index: ld128/s_expl.c
> ===================================================================
> --- ld128/s_expl.c	(revision 251062)
> +++ ld128/s_expl.c	(working copy)
> ...
> @@ -38,34 +40,56 @@
> #include "math_private.h"
>
> #define	INTERVALS	128
> +#define	LOG2_INTERVALS	7

Not used.

> ...
> 	n2 = (unsigned)n % INTERVALS;
> 	k = (n - n2) / INTERVALS;
> 	r1 = x - fn * L1;
> -	r2 = -fn * L2;
> +	r2 = fn * -L2;
> +	r = r1 + r2;

1 micro-optimization (that uses LOG2_INTERVALS) not merrged here.

> @@ -244,18 +276,19 @@
> 		twopkp10000 = v.e;
> 	}
>
> -	r = r1 + r2;
> -	q = r * r * (P2 + r * (P3 + r * (P4 + r * (P5 + r * (P6 + r * (P7 +
> -	    r * (P8 + r * (P9 + r * (P10 + r * P11)))))))));
> +	/* Evaluate expl(endpoint[n2] + r1 + r2) = s[n2] * expl(r1 + r2). */
> +	dr = r;
> +	q = r2 + r * r * (A2 + r * (A3 + r * (A4 + r * (A5 + r * (A6 +
> +	    dr * (A7 + dr * (A8 + dr * (A9 + dr * A10))))))));

Macro-optimizations here.  Quite different from the ld80 ones.  The grouping
of terms was already quite different.  This merges a macro-optimization
technique from das's old work on the ld128 logl -- evaluate terms in double
precision if possible, since long double precision is so slow on sparc64
(about 1000 times slower than long double precision on x86.  Only hundreds
of times slower than double precision on sparc64).

> Patch 2:
>
> --- ld80/s_expl.c	2013-05-28 09:36:27.000000000 -0700
> +++ ld80/s_expl.c.all	2013-05-28 09:34:41.000000000 -0700
> @@ -302,3 +302,166 @@
> ...
> +	if (k == 0) {
> +		t = s[n2].lo * (r1 + 1) + t * q + s[n2].hi * r1 +
> +		    (s[n2].hi - 1);
> +		RETURNI(t);
> +	}
> +
> +	if (k == -1) {
> +		t = s[n2].lo * (r1 + 1) + t * q + s[n2].hi * r1 +
> +		    (s[n2].hi - 2);
> +		RETURNI(t / 2);
> +	}

Some cases are optimized here.

> ...
> +	if (k > LDBL_MANT_DIG - 1)
> +		t = s[n2].lo - twomk + t * (q + r1) + s[n2].hi;
> +	else
> +		t = s[n2].lo + t * (q + r1)  + (s[n2].hi - twomk);

The last statement isn't accurate enough for k = 0 and k = -1, so
handling of those cases were moved earlier so that this statement
could be optimized to what it is now.  The ld128 version is missing
this.

> ...
> --- ld128/s_expl.c	2013-05-28 09:36:11.000000000 -0700
> +++ ld128/s_expl.c.all	2013-05-28 09:34:52.000000000 -0700
> ...
> +	if (k == 0) {
> +		t = s[n2].lo * (r1 + 1) + t * q + s[n2].hi * r1 +
> +		    (s[n2].hi - 1);
> +		RETURNI(t);
> +	}
> +
> +	if (k == -1) {
> +		t = s[n2].lo * (r1 + 1) + t * q + s[n2].hi * r1 +
> +		    (s[n2].hi - 2);
> +		RETURNI(t / 2);
> +	}
> +
> +

Same as for ld808, except for 2 style bugs instead of 1 (1 more extra
blank line).

> +	if (k > LDBL_MANT_DIG - 1)
> +		t = s[n2].lo - twomk + t * (q + r1) + s[n2].hi;
> +	else if (k < 1)
> +		t = s[n2].lo * (r1 + 1) + t * q + s[n2].hi * r1 +
> +		   (s[n2].hi - twomk);
> +	else
> +		t = s[n2].lo * (q + r1 + 1) + s[n2].hi * (q + r1) +
> +		    (s[n2].hi - twomk);

Not the same as for ld128.  Still has the old slower code, so it probably
still works, but even more slowly than before except for k == 0 and k == -1,
since there are extra branches to filter out those values.

Some patches relative to my version now instead of later:

@ --- z22/s_expl.c	Wed May 29 04:48:10 2013
@ +++ ./s_expl.c	Wed May 29 06:16:29 2013
@ @@ -30,5 +30,5 @@
@  __FBSDID("$FreeBSD: src/lib/msun/ld80/s_expl.c,v 1.10 2012/10/13 19:53:11 kargl Exp $");
@ 
@ -/*-
@ +/**
@   * Compute the exponential of x for Intel 80-bit format.  This is based on:
@   *

This ugliness is now required by style(9) :-(.  You only made this change in
some places places.

The indent protection '/*-' was subverted to mean a copyright markup.  Its
previously-KNF use for non-copyrights was purged in some places but not
all.  It is still used extensively for non-copyrights in kern/kern_prot.c.

@ @@ -83,9 +83,9 @@
@   * (0.002708 is ln2/(2*INTERVALS) rounded up a little).
@   */
@ -A2  = 0.5,
@ -A3  = 1.6666666666666119e-1,		/*  0x15555555555490.0p-55 */
@ -A4  = 4.1666666666665887e-2,		/*  0x155555555554e5.0p-57 */
@ -A5  = 8.3333354987869413e-3,		/*  0x1111115b789919.0p-59 */
@ -A6  = 1.3888891738560272e-3;		/*  0x16c16c651633ae.0p-62 */
@ +A2 =  0.5,
@ +A3 =  1.6666666666666119e-1,		/*  0x15555555555490.0p-55 */
@ +A4 =  4.1666666666665887e-2,		/*  0x155555555554e5.0p-57 */
@ +A5 =  8.3333354987869413e-3,		/*  0x1111115b789919.0p-59 */
@ +A6 =  1.3888891738560272e-3;		/*  0x16c16c651633ae.0p-62 */
@ 
@  /*

Fix regressions relative to a previous development version.

@ @@ -267,11 +275,12 @@
@  	r = x - fn * L1 - fn * L2;	/* r = r1 + r2 done independently. */
@  #if defined(HAVE_EFFICIENT_IRINTL)
@ -	n  = irintl(fn);
@ +	n = irintl(fn);
@  #elif defined(HAVE_EFFICIENT_IRINT)
@ -	n  = irint(fn);
@ +	n = irint(fn);
@  #else
@ -	n  = (int)fn;
@ +	n = (int)fn;

Fix more regressions.

@  #endif
@  	n2 = (unsigned)n % INTERVALS;
@ +	/* Depend on the sign bit being propagated: */
@  	k = n >> LOG2_INTERVALS;
@  	r1 = x - fn * L1;

I think a comment is needed.  This micro-optimization was merged from
s_exp2*.c, where it is commented on more prominently for the long
double versions only.

@ @@ -327,6 +336,15 @@
@ 
@  /*
@ - * Domain [-0.1659, 0.1659], range ~[-1.2027e-22, 3.4417e-22]:
@ - * |(exp(x)-1-x-x**2/2)/x - p(x)| < 2**-71.2
@ + * Domain [-0.1659, 0.1659], range ~[-2.6155e-22, 2.5507e-23]:
@ + * |(exp(x)-1-x-x**2/2)/x - p(x)| < 2**-71.6

The coeffs were improved a little, but the comment wasn't updated to match.

@ + *
@ + * XXX the coeffs aren't very carefully rounded, and I get 4.5 more bits,
@ + * but unlike for ld128 we can't drop any terms.
@ + *
@ + * XXX this still isn't in standard format:
@ + * - extra digits in exponents for decimal values
@ + * - no space for a (not present) minus sign in either the decimal or hex
@ + *   values
@ + * - perhaps they are impossible for double values
@   */
@  static const union IEEEl2bits

The coeffs have lots of style bugs, though not as many as for ld128.

I'm not sure where the latest set of B coeffs came from.  Looks like
you improved your generation of them.  You still seem to minimize the
absolute error.  This gives larger than necessary relative errors,
especially near the endpoints.  I think I wrote the new and old
versions of the comment about the domain and range.  I take a proposed
set of coeffs and plot the relative error of the function given by them,
then copy the results to the comment.

@ @@ -389,4 +409,9 @@
@  		x4 = x2 * x2;
@  		q = x4 * (x2 * (x4 *
@ +		    /*
@ +		     * XXX the number of terms is no longer good for
@ +		     * pairwise grouping of all except B3, and the
@ +		     * grouping is no longer from highest down.
@ +		     */
@  		    (x2 *            B12  + (x * B11 + B10)) +
@  		    (x2 * (x * B9 +  B8) +  (x * B7 +  B6))) +
@ @@ -407,9 +432,9 @@
@  	fn = x * INV_L + 0x1.8p63 - 0x1.8p63;
@  #if defined(HAVE_EFFICIENT_IRINTL)
@ -	n  = irintl(fn);
@ +	n = irintl(fn);
@  #elif defined(HAVE_EFFICIENT_IRINT)
@ -	n  = irint(fn);
@ +	n = irint(fn);
@  #else
@ -	n  = (int)fn;
@ +	n = (int)fn;
@  #endif
@  	n2 = (unsigned)n % INTERVALS;
@ @@ -434,22 +459,21 @@
@ 
@  	if (k == 0) {
@ -		t = s[n2].lo * (r1 + 1) + t * q + s[n2].hi * r1 +
@ -		    (s[n2].hi - 1);
@ +		t = SUM2P(s[n2].hi - 1, s[n2].lo * (r1 + 1) + t * q +
@ +		    s[n2].hi * r1);
@  		RETURNI(t);
@  	}
@ -

Style bug (extra blank line between related statements).

@  	if (k == -1) {
@ -		t = s[n2].lo * (r1 + 1) + t * q + s[n2].hi * r1 + 
@ -		    (s[n2].hi - 2);
@ +		t = SUM2P(s[n2].hi - 2, s[n2].lo * (r1 + 1) + t * q +
@ +		    s[n2].hi * r1);
@  		RETURNI(t / 2);
@  	}
@

This blank line is correct since the statements are unrelated -- the
evaluation method changes significantly.  For k = 0 and k = -1, the
evaluation is the same but we repeat it all to avoid using a variable
for (k - 1) for the 2 values of k.

@  	if (k < -7) {
@ -		t = s[n2].lo + t * (q + r1) + s[n2].hi;
@ +		t = SUM2P(s[n2].hi, s[n2].lo + t * (q + r1));
@  		RETURNI(t * twopk - 1);
@  	}
@ 
@  	if (k > 2 * LDBL_MANT_DIG - 1) {
@ -		t = s[n2].lo + t * (q + r1) + s[n2].hi;
@ +		t = SUM2P(s[n2].hi, s[n2].lo + t * (q + r1));
@  		if (k == LDBL_MAX_EXP)
@  			RETURNI(t * 2 * 0x1p16383L - 1);

Ignore all the other changes in this hunk.

Bruce