From owner-freebsd-numerics@freebsd.org  Thu May 18 06:35:26 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C5B41D72F03
 for <freebsd-numerics@mailman.ysv.freebsd.org>;
 Thu, 18 May 2017 06:35:26 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au
 [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 71E551DC2
 for <freebsd-numerics@freebsd.org>; Thu, 18 May 2017 06:35:26 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 36E473C9655;
 Thu, 18 May 2017 16:35:22 +1000 (AEST)
Date: Thu, 18 May 2017 16:34:57 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170518014226.GA63819@troutmask.apl.washington.edu>
Message-ID: <20170518154820.A8280@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428175851.A1386@besplex.bde.org>
 <20170516224618.GA40855@troutmask.apl.washington.edu>
 <20170517094848.A52247@besplex.bde.org>
 <20170517180927.GA54431@troutmask.apl.washington.edu>
 <20170518072636.E951@besplex.bde.org>
 <20170518014226.GA63819@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=WvBbCZXv c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=AT3K7O-BTnsdAHa8EAYA:9 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 18 May 2017 06:35:26 -0000

On Wed, 17 May 2017, Steve Kargl wrote:

> On Thu, May 18, 2017 at 09:25:58AM +1000, Bruce Evans wrote:
>> On Wed, 17 May 2017, Steve Kargl wrote:
> ...
>>> As such, I've added a comment
>>>
>>> /*
>>> * pi_hi contains the leading 56 bits of a 169 bit approximation for pi.
>>> */
>>
>> Why 56?  53 + 56 = 109.
>
> This is ld128.

ld128 has precision 113.

> static const long double
> pi_hi = 3.14159265358979322702026593105983920e+00L,
> pi_lo = 1.14423774522196636802434264184180742e-17L;
>
> pi_hi has 113/2 = 56 bits.
> pi_lo has 113 bit.
>
> 56 + 113 = 169

But hi is set intentionally sloppily by converting to double, so it has
53 bits instead of the 57 that 56 would go with (57 = 113/2 rounded up).

>> My improvements to lgamma()'s sin_pi*() are buggy.  I forgot the original
>> target of avoiding inexact better:
>
> I'll need to wait until the weekend to study your improved sin_pi.

I now have closer to production version.  sinpi() runs more than 2
times faster that your version, without doing much different except
not having so much EXTRACT/INSERT or special cases.  (66 cycles instead
of 150 on Haswell-i386, and the 66 can be reduced to 58 by removing
the STRICT_ASSIGN() which breaks only i386 with non-default extra
precision.  The difference would be smaller on amd64 or better compilers
, but Haswell-i386 has much the same slownesses as old Athlon from the
EXTRACTs and INSERTs.

However, both versions fail tests.  Min has a max error of 2048 ulps
and an everage error of 928 ulps and yours has a maximum error of many
giga-ups for the sign bit wrong and an average error of 561 ulps.  The
enormous errors are probably for denormals or NaNs.

XX double
XX ysinpi(double x)
XX {
XX 	double s,y,z;
XX 	int32_t hx,ix;
XX 	int n;
XX 
XX 	GET_HIGH_WORD(hx,x);
XX 	ix = hx & 0x7fffffff;
XX 
XX 	if (ix < 0x3fd00000) {		/* |x| < 0.25 */
XX 	    if (ix < 0x3e200000)	/* |x| < 2**-29 */
XX 		/* XXX: extra precision would generate spurious denormals. */
XX 		if(((int)x)==0) return M_PI*x; /* pi*x with inexact if x != 0 */

Note a problem with denormals.  The hi+lo decomposition on values up
to about 2**53 times the largest denormal creates denormals and thus
spurious denormal and underflow exceptions and possibly slowness.  I'm
not sure if it gives correct results either.  Your version has a special
case for calculating Pi*x, but doesn't handle this.  I think it would
work to scale up into non-denormal range for the calculation.  I just
multiply by M_PI.  I suspects this gives an error of nearly 1 ulp and
will test to see if it is strictly below 1 ulp (the details depend on
how closely M_PI approximates pi).

XX 	    return __kernel_sinpi(x);
XX 	}
XX 
XX 	if (ix>=0x7ff00000) return x-x;	/* Inf or NaN -> NaN */
XX

This classification was cloned from sin().  Then adjust constants as in
your version.  Not doing so much here probably gives half of the 150:66
speed improvement.

Don't bother with special cases for |x| <= 2 or 2.25 as done by sinf().
We don't really want the special case for |x| <= 0.25, but need it to
avoid denormals and underflow.  BTW, I just noticed that I broke this
case in lgamma*().  I moved the check for tiny x out of the kernels into
the sin*() and cos*() callers (or kept the duplicate of it there), but
didn't add it for the lgamma*() callers.  I broke it for sinf() or sin()
too.  You broke it for sinl().

XX 	if (ix >= 0x43400000) return 0;	/* even integer -> 0 */
XX 
XX 	y = fabs(x);
XX 
XX 	STRICT_ASSIGN(double,z,y+0x1p50);
XX 	GET_LOW_WORD(n,z);		/* bits for rounded y (units 0.25) */
XX 	z -= 0x1p50;			/* y rounded to a multiple of 0.25 */
XX 	if (z > y) {
XX 	    z -= 0.25;			/* adjust to round down */
XX 	    n--;
XX 	}
XX 	n &= 7;				/* octant of y mod 2 */
XX 	y -= z;				/* y mod 0.25 */

This is from old optimizations in lgamma() with minor improvements:
- remove extra reduction step to classify integers
- use STRICT_ASSIGN() instead of hard-coded volatile to avoid pessimizing
   arches without extra precision

XX 
XX 	if (n >= 4) {
XX 	    s = -1;
XX 	    n -= 4;
XX 	} else
XX 	    s = 1;
XX 	if (n == 0)
XX 	    return s*__kernel_sinpi(y);	/* this handles poles right */

Oops, the poles are specific to gamma().

This handles integers right.  y is 0 and the kernel returns +=0 without
setting inexact (depend on knowing its internals for the latter).  Don't
be careful about the sign.  I think it is always +0 with the default rounding
mode.

XX 	else if (n == 1)
XX 	    return s*__kernel_cospi(y-0.25);
XX 	else if (n == 2)
XX 	    return s*__kernel_cospi(y);

This handles half-integers right.  No problem with the sign since the kernel
always returns 1.

XX 	else
XX 	    return s*__kernel_sinpi(y-0.25);
XX }

This is from old optimizations in lgamma with minor improvements (but
larger code changes):
- don't use a case statement
- don't add n * 0.25 to y only to have to a multiple of 0.25 from y in more
   cases later.

Bruce