From owner-freebsd-numerics@freebsd.org Thu May 18 06:35:26 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C5B41D72F03 for ; Thu, 18 May 2017 06:35:26 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 71E551DC2 for ; Thu, 18 May 2017 06:35:26 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 36E473C9655; Thu, 18 May 2017 16:35:22 +1000 (AEST) Date: Thu, 18 May 2017 16:34:57 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170518014226.GA63819@troutmask.apl.washington.edu> Message-ID: <20170518154820.A8280@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428175851.A1386@besplex.bde.org> <20170516224618.GA40855@troutmask.apl.washington.edu> <20170517094848.A52247@besplex.bde.org> <20170517180927.GA54431@troutmask.apl.washington.edu> <20170518072636.E951@besplex.bde.org> <20170518014226.GA63819@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=WvBbCZXv c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=AT3K7O-BTnsdAHa8EAYA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 18 May 2017 06:35:26 -0000 On Wed, 17 May 2017, Steve Kargl wrote: > On Thu, May 18, 2017 at 09:25:58AM +1000, Bruce Evans wrote: >> On Wed, 17 May 2017, Steve Kargl wrote: > ... >>> As such, I've added a comment >>> >>> /* >>> * pi_hi contains the leading 56 bits of a 169 bit approximation for pi. >>> */ >> >> Why 56? 53 + 56 = 109. > > This is ld128. ld128 has precision 113. > static const long double > pi_hi = 3.14159265358979322702026593105983920e+00L, > pi_lo = 1.14423774522196636802434264184180742e-17L; > > pi_hi has 113/2 = 56 bits. > pi_lo has 113 bit. > > 56 + 113 = 169 But hi is set intentionally sloppily by converting to double, so it has 53 bits instead of the 57 that 56 would go with (57 = 113/2 rounded up). >> My improvements to lgamma()'s sin_pi*() are buggy. I forgot the original >> target of avoiding inexact better: > > I'll need to wait until the weekend to study your improved sin_pi. I now have closer to production version. sinpi() runs more than 2 times faster that your version, without doing much different except not having so much EXTRACT/INSERT or special cases. (66 cycles instead of 150 on Haswell-i386, and the 66 can be reduced to 58 by removing the STRICT_ASSIGN() which breaks only i386 with non-default extra precision. The difference would be smaller on amd64 or better compilers , but Haswell-i386 has much the same slownesses as old Athlon from the EXTRACTs and INSERTs. However, both versions fail tests. Min has a max error of 2048 ulps and an everage error of 928 ulps and yours has a maximum error of many giga-ups for the sign bit wrong and an average error of 561 ulps. The enormous errors are probably for denormals or NaNs. XX double XX ysinpi(double x) XX { XX double s,y,z; XX int32_t hx,ix; XX int n; XX XX GET_HIGH_WORD(hx,x); XX ix = hx & 0x7fffffff; XX XX if (ix < 0x3fd00000) { /* |x| < 0.25 */ XX if (ix < 0x3e200000) /* |x| < 2**-29 */ XX /* XXX: extra precision would generate spurious denormals. */ XX if(((int)x)==0) return M_PI*x; /* pi*x with inexact if x != 0 */ Note a problem with denormals. The hi+lo decomposition on values up to about 2**53 times the largest denormal creates denormals and thus spurious denormal and underflow exceptions and possibly slowness. I'm not sure if it gives correct results either. Your version has a special case for calculating Pi*x, but doesn't handle this. I think it would work to scale up into non-denormal range for the calculation. I just multiply by M_PI. I suspects this gives an error of nearly 1 ulp and will test to see if it is strictly below 1 ulp (the details depend on how closely M_PI approximates pi). XX return __kernel_sinpi(x); XX } XX XX if (ix>=0x7ff00000) return x-x; /* Inf or NaN -> NaN */ XX This classification was cloned from sin(). Then adjust constants as in your version. Not doing so much here probably gives half of the 150:66 speed improvement. Don't bother with special cases for |x| <= 2 or 2.25 as done by sinf(). We don't really want the special case for |x| <= 0.25, but need it to avoid denormals and underflow. BTW, I just noticed that I broke this case in lgamma*(). I moved the check for tiny x out of the kernels into the sin*() and cos*() callers (or kept the duplicate of it there), but didn't add it for the lgamma*() callers. I broke it for sinf() or sin() too. You broke it for sinl(). XX if (ix >= 0x43400000) return 0; /* even integer -> 0 */ XX XX y = fabs(x); XX XX STRICT_ASSIGN(double,z,y+0x1p50); XX GET_LOW_WORD(n,z); /* bits for rounded y (units 0.25) */ XX z -= 0x1p50; /* y rounded to a multiple of 0.25 */ XX if (z > y) { XX z -= 0.25; /* adjust to round down */ XX n--; XX } XX n &= 7; /* octant of y mod 2 */ XX y -= z; /* y mod 0.25 */ This is from old optimizations in lgamma() with minor improvements: - remove extra reduction step to classify integers - use STRICT_ASSIGN() instead of hard-coded volatile to avoid pessimizing arches without extra precision XX XX if (n >= 4) { XX s = -1; XX n -= 4; XX } else XX s = 1; XX if (n == 0) XX return s*__kernel_sinpi(y); /* this handles poles right */ Oops, the poles are specific to gamma(). This handles integers right. y is 0 and the kernel returns +=0 without setting inexact (depend on knowing its internals for the latter). Don't be careful about the sign. I think it is always +0 with the default rounding mode. XX else if (n == 1) XX return s*__kernel_cospi(y-0.25); XX else if (n == 2) XX return s*__kernel_cospi(y); This handles half-integers right. No problem with the sign since the kernel always returns 1. XX else XX return s*__kernel_sinpi(y-0.25); XX } This is from old optimizations in lgamma with minor improvements (but larger code changes): - don't use a case statement - don't add n * 0.25 to y only to have to a multiple of 0.25 from y in more cases later. Bruce