From owner-freebsd-numerics@freebsd.org  Sat May 20 00:58:14 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 36C77D738D7
 for <freebsd-numerics@mailman.ysv.freebsd.org>;
 Sat, 20 May 2017 00:58:14 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id 95D5D18AD
 for <freebsd-numerics@freebsd.org>; Sat, 20 May 2017 00:58:13 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id B8DA8104973C;
 Sat, 20 May 2017 10:58:04 +1000 (AEST)
Date: Sat, 20 May 2017 10:58:03 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
cc: Steve Kargl <sgk@troutmask.apl.washington.edu>, 
 freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170519153326.A20099@besplex.bde.org>
Message-ID: <20170520105748.X1017@besplex.bde.org>
References: <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428175851.A1386@besplex.bde.org>
 <20170516224618.GA40855@troutmask.apl.washington.edu>
 <20170517094848.A52247@besplex.bde.org>
 <20170517180927.GA54431@troutmask.apl.washington.edu>
 <20170518072636.E951@besplex.bde.org>
 <20170518014226.GA63819@troutmask.apl.washington.edu>
 <20170518154820.A8280@besplex.bde.org>
 <20170518175434.GA74453@troutmask.apl.washington.edu>
 <20170519074512.M884@besplex.bde.org>
 <20170518234722.GB77471@troutmask.apl.washington.edu>
 <20170519153326.A20099@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=WvBbCZXv c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=rYtKBivQexIEMgcjFssA:9 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 20 May 2017 00:58:14 -0000

On Fri, 19 May 2017, Bruce Evans wrote:

> I fixed all bugs found by my tests:
> ...
> - sinpi(), cospi() and tanpi() were broken on i386 with extra precision.
>  Now they have a smaller error than sin(), etc. (about 0.55 ulps; the
>  bug gave 1.5 ulps).
> ...
>
> XX diff -u2 k_cospi.c~ k_cospi.c
> XX --- k_cospi.c~	Fri May 19 09:05:14 2017
> XX +++ k_cospi.c	Fri May 19 12:02:24 2017
> XX @@ -35,5 +35,8 @@
> XX  __kernel_cospi(double x)
> XX  {
> XX -	double hi, lo;
> XX +	/* XXX: volatile hack to abuse _2sumF(): */
> XX +	volatile double hi;
> XX +	double lo;
> XX +
> XX  	hi = (float)x;
> XX  	lo = x - hi;
>
> This hack is sufficent to get strict assignment to hi.  It is a bit slow.
> It costs about 30% on i386 (70 cycles increased to 90) in my version.

I fixed this more cleanly in the infrastructure:

XX diff -u2 math_private.h~ math_private.h
XX --- math_private.h~	Tue Jun  4 21:39:14 2013
XX +++ math_private.h	Sun Nov 27 17:58:57 2005
XX @@ -21,4 +21,6 @@
XX  #include <machine/endian.h>
XX 
XX +#include <float.h>		/* XXX: pollution for STRICT_ASSIGN() */
XX +
XX  /*
XX   * The original fdlibm code used statements like:
XX @@ -322,5 +537,5 @@
XX  	__typeof(a) __s, __w;	\
XX  				\
XX -	__w = (a) + (b);	\
XX +	STRICT_ASSIGN(__typeof(a), __w, (a) + (b)); \
XX  	__s = __w - (a);	\
XX  	(b) = ((a) - (__w - __s)) + ((b) - __s); \
XX @@ -355,4 +570,8 @@
XX   * reduce their own extra-precision and efficiciency problems.  In
XX   * particular, they shouldn't convert back and forth just to call here.
XX + *
XX + * Update: the 2sum algorithms now use STRICT_ASSIGN(), so they should work
XX + * with any FP type, and only be slow if that type is not float_t, double_t
XX + * or long double.
XX   */
XX  #ifdef DEBUG
XX @@ -365,5 +584,5 @@
XX  	assert(__ia == 0 || fabsl(__ia) >= fabsl(__ib));	\
XX  							\
XX -	__w = (a) + (b);				\
XX +	STRICT_ASSIGN(__typeof(a), __w, (a) + (b));	\
XX  	(b) = ((a) - __w) + (b);			\
XX  	(a) = __w;					\
XX @@ -380,5 +599,5 @@
XX  	__typeof(a) __w;	\
XX  				\
XX -	__w = (a) + (b);	\
XX +	STRICT_ASSIGN(__typeof(a), __w, (a) + (b)); \
XX  	(b) = ((a) - __w) + (b); \
XX  	(a) = __w;		\

Other fixes: my tests only passed because they were on i386 with extra
precision for doubles.  On amd64 and on i386 with the default precision
for doubles, the problem with the hi+lo decomposition for denormals
showed up as errors of about 1.5 ulps for sinpif(), tanpif(), sinpi()
and tanpi().  Previously-discussed fixes for this worked.  I rewrote
them from scratch (not very cleanly -- they should be in a header file,
as should be the kernels.  1 header file can keep kernels and functions
to multiply by pi for all precisions):

XX diff -u2 s_sinpi.c~ s_sinpi.c
XX --- s_sinpi.c~	Fri May 19 09:05:16 2017
XX +++ s_sinpi.c	Fri May 19 19:52:07 2017
XX @@ -77,6 +77,5 @@
XX  sinpi(double x)
XX  {
XX -	volatile static const double vzero = 0;
XX -	double ax, s;
XX +	double ax, hi, lo, s;
XX  	uint32_t hx, ix, j0, lx;
XX 
XX @@ -87,12 +86,15 @@
XX  	if (ix < 0x3ff00000) {			/* |x| < 1 */
XX  		if (ix < 0x3fd00000) {		/* |x| < 0.25 */
XX -			if (ix < 0x3e200000) {	/* |x| < 0x1p-29 */
XX +			if (ix < 0x3de00000) {	/* |x| < 2**-33 */
XX  				if (x == 0)
XX  					return (x);
XX -				INSERT_WORDS(ax, hx, 0);
XX -				x -= ax;
XX -				s = pilo * x + pihi * x + pilo * ax
XX -				    + pihi * ax;
XX -				return (s);
XX +				if ((int)x == 1) /* gen. inexact if x != 0 */
XX +					return (x);	/* NOTREACHED */
XX +				hi = x;
XX +				SET_LOW_WORD(hi, 0);
XX +				hi *= 0x1p1000;
XX +				lo = 0x1p1000 * x  - hi;
XX +				return ((0x1p1000 * pilo * x + pihi * lo +
XX +				    pihi * hi) * 0x1p-1000);
XX  			}
XX

This also fixes the style bugs of clobbering x and ax to hold lo and hi,
and the pessimization of using INSER_WORDS instead of SET_LOW_WORD(),
and the pessimization of not combining low terms, and the bug of not
always setting inexact for nonzero x (e.g., for x = 2**-34), and the
style bug and pessimization of handling the sign explicitly (rounding
to nearest gives oddness automatically exept for x = 0), and has 2
older fixes.

Multiplication by pi is remarkably complicated.  I noticed that you
already avoided the trap of using the usual method of truncating x
by converting to float -- this fails for denormals and other small
values since floats don't have enough range.  Converting long doubles
fails similarly.

So 2 different methods for multiplying by pi are needed:
- a slower general method that works for small values (use bit-fiddling
     to truncate), and scale up
- the usual method of casting to float for medium-sized values
- mercifully, a third method that works on large values is not needed.
These should be in a header file and not duplicated.  But I used the
quick fix of duplicating the above for tanpi().

XX diff -u2 s_sinpif.c~ s_sinpif.c
XX --- s_sinpif.c~	Fri May 19 09:05:16 2017
XX +++ s_sinpif.c	Fri May 19 19:49:33 2017
XX @@ -46,5 +46,4 @@
XX  sinpif(float x)
XX  {
XX -	volatile static const float vzero = 0;
XX  	float ax, s;
XX  	uint32_t hx, ix, j0;
XX @@ -56,14 +55,7 @@
XX  	if (ix < 0x3f800000) {			/* |x| < 1 */
XX  		if (ix < 0x3e800000) {		/* |x| < 0.25 */
XX -	 		if (ix < 0x38800000) {	/* |x| < 0x1p-14 */
XX -				if (x == 0)
XX -					return (x);
XX -				SET_FLOAT_WORD(ax, hx & 0xffff0000);
XX -				x -= ax;
XX -				s = pilo * x + pihi * x + pilo * ax
XX -				    + pihi * ax;
XX -				return (s);
XX -			}
XX -
XX +	 		if (ix < 0x34800000)	/* |x| < 2**-22 */
XX +				if ((int)x == 0) /* gen. inexact if x != 0 */
XX +					return (M_PI * x);
XX  			s = __kernel_sinpif(ax);
XX  			return ((hx & 0x80000000) ? -s : s);

This is simpler because it can just use double precision.  Also fix missing
setting of inexact and 1 style bug.  This can use essentially the same code
as sinf().  It doesn't need a special case for x == 0 since multiplication
if +-0 by M_PI preserves the sign.  I didn't fix the style bug and pessimal
sign handling for non-small x.  This should be the same as for sinf() too.
The kernel preserves oddness in the default rounding mode except for x == 0
which is not handled by the kernel.

My sinpi() needed similar fixes for the scaling.  I changed all of its
formatting away from fdlibm towards KNF:

XX #include <float.h>
XX 
XX #include "math.h"
XX #include "math_private.h"
XX 
XX static const double
XX pi_hi = 3.1415926218032837e+00,	/* 0x400921fb 0x50000000 */
XX pi_lo = 3.1786509547050787e-08;	/* 0x3e6110b4 0x611a5f14 */
XX 
XX static double
XX __kernel_cospi(double x)
XX {
XX 	double hi, lo;
XX 
XX 	hi = (float)x;
XX 	lo = x - hi;
XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX 	hi = pi_hi * hi;
XX 	_2sumF(hi, lo);
XX 	return (__kernel_cos(hi, lo));
XX }
XX 
XX static double
XX __kernel_sinpi(double x)
XX {
XX 	double hi, lo;
XX 
XX 	hi = (float)x;
XX 	lo = x - hi;
XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX 	hi = pi_hi * hi;
XX 	_2sumF(hi, lo);
XX 	return (__kernel_sin(hi, lo, 1));
XX }
XX 
XX static inline double
XX __kernel_mpi(double x)
XX {
XX 	double hi, lo;
XX 	int32_t hx, lx;
XX 
XX 	EXTRACT_WORDS(hx, lx, x);
XX 	hi = x;
XX 	SET_LOW_WORD(hi, 0);
XX 	hi *= 0x1p1000;
XX 	lo = 0x1p1000 * x  - hi;
XX 	return ((0x1p1000 * pi_lo * x + pi_hi * lo + pi_hi * hi) * 0x1p-1000);
XX }

This is cleaner, but not yet suitable for a kernel in a header file since
it does a pessimal second extraction (which should be only of the high word
again).  Since this is inline, perhaps the compile can optimize away the
second extraction.  It can do this easily only if the extractions are the
same.  If the compiler can't do this (perhaps because this isn't inlined),
then hx (and lx if available) should be passed to the kernel.

The above isn't a general multiplication kernel as originally intended,
since it has to scale up to handle small values, so doesn't work for
large values.  Actually, it would work up to 0x1p53, but be slower.

The amount of scaling up is tricky to determine.  0x1p1000 is about
half-way in the range of ~2048 exponents, to scale up denormals to
nearly 1 so that they obviously aren't denormals.

XX 
XX double
XX ysinpi(double x)
XX {
XX 	double s, y, z;
XX 	int32_t hx, ix;
XX 	int n;
XX 
XX 	GET_HIGH_WORD(hx, x);
XX 	ix = hx & 0x7fffffff;
XX 
XX 	if (ix < 0x3fd00000) {		/* |x| < 0.25 */
XX 		if (ix < 0x3de00000) {	/* |x| < 2**-33 */
XX 			if (x == 0)
XX 				return (x);
XX 			if ((int)x == 0) /* generate inexact if x != 0 */
XX 				return __kernel_mpi(x);
XX 		}
XX 		return (__kernel_sinpi(x));
XX 	}
XX 	if (ix >= 0x7ff00000)		/* Inf or NaN -> NaN */
XX 		return (x - x);
XX 	if (ix >= 0x43300000)		/* 2**52 <= |x| (integer) -> +-0 */
XX 		return (x < 0 ? -0.0 : 0);
XX 	if (ix >= 0x43200000) {		/* 2**51 <= |x| < 2**52 (int/2) */
XX 		GET_LOW_WORD(n, x);
XX 		n = (n & 3) * 2;
XX 		y = 0;
XX 	} else if (ix >= 0x43100000) {	/* 2**50 <= |x| < 2**51 (int/4) */
XX 		GET_LOW_WORD(n, x);
XX 		n &= 7;
XX 		y = 0;
XX 	} else {
XX 		y = fabs(x);
XX 		STRICT_ASSIGN(double, z, y + 0x1p50);
XX 		GET_LOW_WORD(n, z);	/* bits for rounded y (units 0.25) */
XX 		z -= 0x1p50;		/* y rounded to a multiple of 0.25 */
XX 		if (z > y) {
XX 			z -= 0.25;	/* adjust to round down */
XX 			n--;
XX 		}
XX 		n &= 7;			/* octant of y mod 2 */
XX 		y -= z;			/* y mod 0.25 */
XX 	}
XX 
XX 	s = (n >= 3 && n < 7) ? -1 : 1;
XX 	if (x < 0)
XX 		s = -s;
XX 	n &= 3;
XX 	if (y == 0 && n == 0)
XX 		return (x < 0 ? -0.0 : 0);
XX 	if (n == 0)
XX 		return (s * __kernel_sinpi(y));
XX 	if (n == 1)
XX 		return (s * __kernel_cospi(y - 0.25));
XX 	if (n == 2)
XX 		return (s * __kernel_cospi(y));
XX 	return (s * __kernel_sinpi(y - 0.25));
XX }

With better-designed kernels, there would be no kernels for cospi or sinpi,
and the code would look something like:

   		return (s * __k_cos(k_mpi(y)));

where:
- __k_cos() is essentially __kernel_cos() with the following fixes:
     - it must take double_t args to work optimally and simply with _2sumF()
     - these args should be packaged in a struct for convenience
     - rename it because its API changed.  Fix its verbose name.  I'd like to
       remove the leading underscores too, but inlining many copies of it
       would be too much.  It isn't even inlined now, and inlining it tends
       to be a pessimizaition.
- k_mpi() is the first part of __kernel_cospi() with the hi+lo decomposition
     packaged as a struct for convenience.  It takes a fairly smart compiler
     to not pessimize such structs (gcc-3.3.3 pessimizes them, but gcc-4.2.1
     mostly handles them optimally, and clang always handles them optimally).
     k_mpi() should always be inline so it doesn't need underscores.

The kernels can't quite be type-generic without complications.  The
truncation part depends on the precision.

The fix for _2sumF() is needed mainly because the current kernels don't
support float_t or double_t (except the internal one in logl(), and
my uncommitted ones for exp() and expf()).  Requiring use types without
extra precision for _2sumF() was intentional to promote use of double_t,
etc., but I forgot that the old kernels didn't support it.

Bruce