From owner-freebsd-hackers@freebsd.org Fri Apr 28 23:36:00 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2CA5DD55E45; Fri, 28 Apr 2017 23:36:00 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 146D41D75; Fri, 28 Apr 2017 23:36:00 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3SNZqXN034900 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 28 Apr 2017 16:35:52 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3SNZqtZ034899; Fri, 28 Apr 2017 16:35:52 -0700 (PDT) (envelope-from sgk) Date: Fri, 28 Apr 2017 16:35:52 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170428233552.GA34580@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170429070036.A4005@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 23:36:00 -0000 On Sat, Apr 29, 2017 at 08:09:39AM +1000, Bruce Evans wrote: > On Fri, 28 Apr 2017, Steve Kargl wrote: > > > On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote: > >> On Fri, 28 Apr 2017, Steve Kargl wrote: > >>> ... > >>> The slowness comes from handling extra precision arithmetic. For > >>> sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial > >>> approximation and x2 = x * x. One needs to take care in the ordering > >>> on the evaluation where x and x2 are split into high and low parts. > >>> See the source code posted in response to your other email. > >> ... > >> Calculating the polynomial as x * p(x) seems wrong. > > > > It starts life as the Remes approximation for p(x) = sin(pi*x) / x. > > Yes, I also looked into using p(x) = sin(pi*x) / (pi*x). At some > > point, one must do sinpi(x) = sin(pi*x) = x * p(x). > > I forgot that the linear term needs to be multiplied by Pi. > > Actually, the problem is only in the notation and the above description. > s0 looks like a constant, but it is actually Pi*x in extra precision, > and no extra precision is needed or done for x2. I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. static const float s0hi = 3.14062500e+00f, /* 0x40490000 */ s0lo = 9.67653585e-04f, /* 0x3a7daa22 */ s1 = -5.16771269e+00f, /* 0xc0a55de7 */ s2 = 2.55016255e+00f, /* 0x402335dd */ s3 = -5.99202096e-01f, /* 0xbf19654f */ s4 = 8.10018554e-02f; /* 0x3da5e44d */ static inline float __kernel_sinpif(float x) { double s; float p, x2; x2 = x * x; p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; s = x * (x2 * (double)p + s0hi + s0lo); return ((float)s); } x2 and p are computed in float but is accumulates the final result in double. Using a higher precision works for float, but double and long double becomes problematic. > __kernel_cospi() seems to be buggy. It splits x2, but doesn't use > extra precision for calculating x2. Is extra precision really needed > for the x2 term? Yes. The last part of __kernel_cospif() is a fused-multiple and add. For small |x2|, we can do a normal summation. That's this block of code GET_FLOAT_WORD(ix, x2); if (ix < 0x3c800000) /* |x2| < 0x1p-6 */ return (c0 + c * x2); If |x2| > 0x1p-6, then c0 + c * x2 is computed with a FMA, but its a little trickier. This splits c in clo and chi, and adds in a correction for c1 GET_FLOAT_WORD(ix, c); SET_FLOAT_WORD(chi, (ix >> 14) << 14); clo = c - chi + c1lo; In the final expression, c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0); chi * c2hi is exact, and the parenthesis are required to achieve max ULP < ~0.88. The grouping of the other terms does not seem to matter. > It is like the x3 term for sin*(), with the critical > difference that the coeffient is much larger (-1/2 instead of -1/6 > before multiplication by pi). This requires extra accuracy for cos() > too, but not for the calculating the term. I optimized this by changing > a slow fdlibm method to essentially 2sum. The x2/2 term is inaccurate > but no significant further inaccuracies occur for adding terms. You > have Pi*Pi*x2/2 which is more inaccurate. I guess the extra half an > ulp accuracy here is too much. You avoid this half an ulp using extra > precision, and seem to use lots of extra precision instead of the final > 2sum step. Probably much slower. > > Here is my modified method: > > x2 = x*x; /* hopefully really don't need extra prec */ > (hi, lo) = split(x2); > resid = S(x2); > hi = C2_hi * hi; > lo = C2_lo * hi + C1_hi * lo + C1_lo * lo; > lo = lo + resid; > #ifdef hopefully_not_needed > Re-split hi+lo if lo is too large. > #endif > return 3sum(1, hi, lo) > > 3sum is in math_private.h and is supposed to be used instead of magic- > looking open code. I completely missed 3sum in math_private.h. I did try using your 2sum and 2sumF macros, but did not recall seeing any difference from my hand-rolled code. I probably didn't spend enough time looking into if I was using 2sum[F] properly. I'll see if I can adapt my code to use these macros. > Here it would reduce to the same expressions as in > __kernel_sin(). > > 1+hi = hi_1 + lo_1; /* by hi-lo decomp. */ > return (lo + lo_1 + hi_1); /* safely add the lo's and "carry" */ > > > One could do, as is done with sin(x), > > > > f(x) = (sin(pi*x) - pi*x)/(pi*x)**3 > > > > then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds > > the complication that pi*x must be computed in extra precision. > > Isn't that what is done now? Of course pi*x must be evaluated in > extra precision, but hopefully no more is needed. The x3 term > would be much harder to do all in extra precision than the x2 > term for cospi(). Well, yes, I supposed that's one way to look at it. In the approximation of sinpif(x) = x * (s0 + x2 * S(x2)), the s0 term is nearly pi. The s1 coefficient, as determined by Remes, is s1 = -5.16771269, which happens to be close to -pi**3/6 = -5.16771278 (i.e, next term in McClauren series of sin(pi*x) = pi*x - (pi**3/6)*x**3 + .... So, the Remes coefficients have absorbed the factors of pi. > >> That is too inaccurate even for sin(x), and probably slower too. > > > > I'm not sure why you're claiming that it is too inaccurate. > > Writing sin(x) as x * (1 + C2*x*x + ...) without extra precision. > This gives at best an error of half an ulp for the part in parentheses > and another half an ulp for the multiplication. The final step > must be an addition of a large term with a term at least 2 times > smaller so that the error can be controlled (this depends on 1 > of the terms being exact). Ah, okay, I understand what you're getting at. I played a number of games with grouping intermediate results and trying to accumulate high and low parts. You're seeing the best that I can do. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow