From owner-freebsd-numerics@freebsd.org Sat May 20 00:58:14 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 36C77D738D7 for ; Sat, 20 May 2017 00:58:14 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 95D5D18AD for ; Sat, 20 May 2017 00:58:13 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id B8DA8104973C; Sat, 20 May 2017 10:58:04 +1000 (AEST) Date: Sat, 20 May 2017 10:58:03 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Steve Kargl , freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170519153326.A20099@besplex.bde.org> Message-ID: <20170520105748.X1017@besplex.bde.org> References: <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428175851.A1386@besplex.bde.org> <20170516224618.GA40855@troutmask.apl.washington.edu> <20170517094848.A52247@besplex.bde.org> <20170517180927.GA54431@troutmask.apl.washington.edu> <20170518072636.E951@besplex.bde.org> <20170518014226.GA63819@troutmask.apl.washington.edu> <20170518154820.A8280@besplex.bde.org> <20170518175434.GA74453@troutmask.apl.washington.edu> <20170519074512.M884@besplex.bde.org> <20170518234722.GB77471@troutmask.apl.washington.edu> <20170519153326.A20099@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=WvBbCZXv c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=rYtKBivQexIEMgcjFssA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 20 May 2017 00:58:14 -0000 On Fri, 19 May 2017, Bruce Evans wrote: > I fixed all bugs found by my tests: > ... > - sinpi(), cospi() and tanpi() were broken on i386 with extra precision. > Now they have a smaller error than sin(), etc. (about 0.55 ulps; the > bug gave 1.5 ulps). > ... > > XX diff -u2 k_cospi.c~ k_cospi.c > XX --- k_cospi.c~ Fri May 19 09:05:14 2017 > XX +++ k_cospi.c Fri May 19 12:02:24 2017 > XX @@ -35,5 +35,8 @@ > XX __kernel_cospi(double x) > XX { > XX - double hi, lo; > XX + /* XXX: volatile hack to abuse _2sumF(): */ > XX + volatile double hi; > XX + double lo; > XX + > XX hi = (float)x; > XX lo = x - hi; > > This hack is sufficent to get strict assignment to hi. It is a bit slow. > It costs about 30% on i386 (70 cycles increased to 90) in my version. I fixed this more cleanly in the infrastructure: XX diff -u2 math_private.h~ math_private.h XX --- math_private.h~ Tue Jun 4 21:39:14 2013 XX +++ math_private.h Sun Nov 27 17:58:57 2005 XX @@ -21,4 +21,6 @@ XX #include XX XX +#include /* XXX: pollution for STRICT_ASSIGN() */ XX + XX /* XX * The original fdlibm code used statements like: XX @@ -322,5 +537,5 @@ XX __typeof(a) __s, __w; \ XX \ XX - __w = (a) + (b); \ XX + STRICT_ASSIGN(__typeof(a), __w, (a) + (b)); \ XX __s = __w - (a); \ XX (b) = ((a) - (__w - __s)) + ((b) - __s); \ XX @@ -355,4 +570,8 @@ XX * reduce their own extra-precision and efficiciency problems. In XX * particular, they shouldn't convert back and forth just to call here. XX + * XX + * Update: the 2sum algorithms now use STRICT_ASSIGN(), so they should work XX + * with any FP type, and only be slow if that type is not float_t, double_t XX + * or long double. XX */ XX #ifdef DEBUG XX @@ -365,5 +584,5 @@ XX assert(__ia == 0 || fabsl(__ia) >= fabsl(__ib)); \ XX \ XX - __w = (a) + (b); \ XX + STRICT_ASSIGN(__typeof(a), __w, (a) + (b)); \ XX (b) = ((a) - __w) + (b); \ XX (a) = __w; \ XX @@ -380,5 +599,5 @@ XX __typeof(a) __w; \ XX \ XX - __w = (a) + (b); \ XX + STRICT_ASSIGN(__typeof(a), __w, (a) + (b)); \ XX (b) = ((a) - __w) + (b); \ XX (a) = __w; \ Other fixes: my tests only passed because they were on i386 with extra precision for doubles. On amd64 and on i386 with the default precision for doubles, the problem with the hi+lo decomposition for denormals showed up as errors of about 1.5 ulps for sinpif(), tanpif(), sinpi() and tanpi(). Previously-discussed fixes for this worked. I rewrote them from scratch (not very cleanly -- they should be in a header file, as should be the kernels. 1 header file can keep kernels and functions to multiply by pi for all precisions): XX diff -u2 s_sinpi.c~ s_sinpi.c XX --- s_sinpi.c~ Fri May 19 09:05:16 2017 XX +++ s_sinpi.c Fri May 19 19:52:07 2017 XX @@ -77,6 +77,5 @@ XX sinpi(double x) XX { XX - volatile static const double vzero = 0; XX - double ax, s; XX + double ax, hi, lo, s; XX uint32_t hx, ix, j0, lx; XX XX @@ -87,12 +86,15 @@ XX if (ix < 0x3ff00000) { /* |x| < 1 */ XX if (ix < 0x3fd00000) { /* |x| < 0.25 */ XX - if (ix < 0x3e200000) { /* |x| < 0x1p-29 */ XX + if (ix < 0x3de00000) { /* |x| < 2**-33 */ XX if (x == 0) XX return (x); XX - INSERT_WORDS(ax, hx, 0); XX - x -= ax; XX - s = pilo * x + pihi * x + pilo * ax XX - + pihi * ax; XX - return (s); XX + if ((int)x == 1) /* gen. inexact if x != 0 */ XX + return (x); /* NOTREACHED */ XX + hi = x; XX + SET_LOW_WORD(hi, 0); XX + hi *= 0x1p1000; XX + lo = 0x1p1000 * x - hi; XX + return ((0x1p1000 * pilo * x + pihi * lo + XX + pihi * hi) * 0x1p-1000); XX } XX This also fixes the style bugs of clobbering x and ax to hold lo and hi, and the pessimization of using INSER_WORDS instead of SET_LOW_WORD(), and the pessimization of not combining low terms, and the bug of not always setting inexact for nonzero x (e.g., for x = 2**-34), and the style bug and pessimization of handling the sign explicitly (rounding to nearest gives oddness automatically exept for x = 0), and has 2 older fixes. Multiplication by pi is remarkably complicated. I noticed that you already avoided the trap of using the usual method of truncating x by converting to float -- this fails for denormals and other small values since floats don't have enough range. Converting long doubles fails similarly. So 2 different methods for multiplying by pi are needed: - a slower general method that works for small values (use bit-fiddling to truncate), and scale up - the usual method of casting to float for medium-sized values - mercifully, a third method that works on large values is not needed. These should be in a header file and not duplicated. But I used the quick fix of duplicating the above for tanpi(). XX diff -u2 s_sinpif.c~ s_sinpif.c XX --- s_sinpif.c~ Fri May 19 09:05:16 2017 XX +++ s_sinpif.c Fri May 19 19:49:33 2017 XX @@ -46,5 +46,4 @@ XX sinpif(float x) XX { XX - volatile static const float vzero = 0; XX float ax, s; XX uint32_t hx, ix, j0; XX @@ -56,14 +55,7 @@ XX if (ix < 0x3f800000) { /* |x| < 1 */ XX if (ix < 0x3e800000) { /* |x| < 0.25 */ XX - if (ix < 0x38800000) { /* |x| < 0x1p-14 */ XX - if (x == 0) XX - return (x); XX - SET_FLOAT_WORD(ax, hx & 0xffff0000); XX - x -= ax; XX - s = pilo * x + pihi * x + pilo * ax XX - + pihi * ax; XX - return (s); XX - } XX - XX + if (ix < 0x34800000) /* |x| < 2**-22 */ XX + if ((int)x == 0) /* gen. inexact if x != 0 */ XX + return (M_PI * x); XX s = __kernel_sinpif(ax); XX return ((hx & 0x80000000) ? -s : s); This is simpler because it can just use double precision. Also fix missing setting of inexact and 1 style bug. This can use essentially the same code as sinf(). It doesn't need a special case for x == 0 since multiplication if +-0 by M_PI preserves the sign. I didn't fix the style bug and pessimal sign handling for non-small x. This should be the same as for sinf() too. The kernel preserves oddness in the default rounding mode except for x == 0 which is not handled by the kernel. My sinpi() needed similar fixes for the scaling. I changed all of its formatting away from fdlibm towards KNF: XX #include XX XX #include "math.h" XX #include "math_private.h" XX XX static const double XX pi_hi = 3.1415926218032837e+00, /* 0x400921fb 0x50000000 */ XX pi_lo = 3.1786509547050787e-08; /* 0x3e6110b4 0x611a5f14 */ XX XX static double XX __kernel_cospi(double x) XX { XX double hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return (__kernel_cos(hi, lo)); XX } XX XX static double XX __kernel_sinpi(double x) XX { XX double hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return (__kernel_sin(hi, lo, 1)); XX } XX XX static inline double XX __kernel_mpi(double x) XX { XX double hi, lo; XX int32_t hx, lx; XX XX EXTRACT_WORDS(hx, lx, x); XX hi = x; XX SET_LOW_WORD(hi, 0); XX hi *= 0x1p1000; XX lo = 0x1p1000 * x - hi; XX return ((0x1p1000 * pi_lo * x + pi_hi * lo + pi_hi * hi) * 0x1p-1000); XX } This is cleaner, but not yet suitable for a kernel in a header file since it does a pessimal second extraction (which should be only of the high word again). Since this is inline, perhaps the compile can optimize away the second extraction. It can do this easily only if the extractions are the same. If the compiler can't do this (perhaps because this isn't inlined), then hx (and lx if available) should be passed to the kernel. The above isn't a general multiplication kernel as originally intended, since it has to scale up to handle small values, so doesn't work for large values. Actually, it would work up to 0x1p53, but be slower. The amount of scaling up is tricky to determine. 0x1p1000 is about half-way in the range of ~2048 exponents, to scale up denormals to nearly 1 so that they obviously aren't denormals. XX XX double XX ysinpi(double x) XX { XX double s, y, z; XX int32_t hx, ix; XX int n; XX XX GET_HIGH_WORD(hx, x); XX ix = hx & 0x7fffffff; XX XX if (ix < 0x3fd00000) { /* |x| < 0.25 */ XX if (ix < 0x3de00000) { /* |x| < 2**-33 */ XX if (x == 0) XX return (x); XX if ((int)x == 0) /* generate inexact if x != 0 */ XX return __kernel_mpi(x); XX } XX return (__kernel_sinpi(x)); XX } XX if (ix >= 0x7ff00000) /* Inf or NaN -> NaN */ XX return (x - x); XX if (ix >= 0x43300000) /* 2**52 <= |x| (integer) -> +-0 */ XX return (x < 0 ? -0.0 : 0); XX if (ix >= 0x43200000) { /* 2**51 <= |x| < 2**52 (int/2) */ XX GET_LOW_WORD(n, x); XX n = (n & 3) * 2; XX y = 0; XX } else if (ix >= 0x43100000) { /* 2**50 <= |x| < 2**51 (int/4) */ XX GET_LOW_WORD(n, x); XX n &= 7; XX y = 0; XX } else { XX y = fabs(x); XX STRICT_ASSIGN(double, z, y + 0x1p50); XX GET_LOW_WORD(n, z); /* bits for rounded y (units 0.25) */ XX z -= 0x1p50; /* y rounded to a multiple of 0.25 */ XX if (z > y) { XX z -= 0.25; /* adjust to round down */ XX n--; XX } XX n &= 7; /* octant of y mod 2 */ XX y -= z; /* y mod 0.25 */ XX } XX XX s = (n >= 3 && n < 7) ? -1 : 1; XX if (x < 0) XX s = -s; XX n &= 3; XX if (y == 0 && n == 0) XX return (x < 0 ? -0.0 : 0); XX if (n == 0) XX return (s * __kernel_sinpi(y)); XX if (n == 1) XX return (s * __kernel_cospi(y - 0.25)); XX if (n == 2) XX return (s * __kernel_cospi(y)); XX return (s * __kernel_sinpi(y - 0.25)); XX } With better-designed kernels, there would be no kernels for cospi or sinpi, and the code would look something like: return (s * __k_cos(k_mpi(y))); where: - __k_cos() is essentially __kernel_cos() with the following fixes: - it must take double_t args to work optimally and simply with _2sumF() - these args should be packaged in a struct for convenience - rename it because its API changed. Fix its verbose name. I'd like to remove the leading underscores too, but inlining many copies of it would be too much. It isn't even inlined now, and inlining it tends to be a pessimizaition. - k_mpi() is the first part of __kernel_cospi() with the hi+lo decomposition packaged as a struct for convenience. It takes a fairly smart compiler to not pessimize such structs (gcc-3.3.3 pessimizes them, but gcc-4.2.1 mostly handles them optimally, and clang always handles them optimally). k_mpi() should always be inline so it doesn't need underscores. The kernels can't quite be type-generic without complications. The truncation part depends on the precision. The fix for _2sumF() is needed mainly because the current kernels don't support float_t or double_t (except the internal one in logl(), and my uncommitted ones for exp() and expf()). Requiring use types without extra precision for _2sumF() was intentional to promote use of double_t, etc., but I forgot that the old kernels didn't support it. Bruce