Date: Thu, 9 Mar 2017 11:51:34 -0800 From: Steve Kargl <sgk@troutmask.apl.washington.edu> To: Bruce Evans <brde@optusnet.com.au> Cc: freebsd-numerics@freebsd.org Subject: Re: Bit twiddling question Message-ID: <20170309195134.GA36213@troutmask.apl.washington.edu> In-Reply-To: <20170310025417.U3723@besplex.bde.org> References: <20170308202417.GA23103@troutmask.apl.washington.edu> <20170309173152.F1269@besplex.bde.org> <20170309075236.GA30199@troutmask.apl.washington.edu> <20170309152307.GA32781@troutmask.apl.washington.edu> <20170310025417.U3723@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Mar 10, 2017 at 05:02:13AM +1100, Bruce Evans wrote: > On Thu, 9 Mar 2017, Steve Kargl wrote: > > > On Wed, Mar 08, 2017 at 11:52:36PM -0800, Steve Kargl wrote: > >> To give a hint at what I have been working on, I have implementations > >> for sinpif(x), sinpi(x), cospif(x), and cospi(x). For sinpif(x) and > >> cospif(x) I have kernels that give correctly rounded results for > >> FLT_MIN <= x < 0x1p23 (at least on x86_64). I also have slightly > > I didn't see the original mail. Odd. Maybe munched by a spam filter. > I'd prefer you to finish the long double versions of cosl(), sinl() and > tanl() (maybe das is responsible for the latter). These are still too > slow and minor bugs for at least sinl() on small x, partly by not copying > the lower-precision versions enough. sinl, cosl, and tanl are good enough for my purposes at the moment. tgammal and powl are higher on the list of things to fix. But, sinpi[fl] and company are high on the list because: (1) I can use those in my actual job; and (2), the desirable property that sinpi(x) = 0 whereas sin(pi*x) = 0.mumbleE-17 (or even E-16) for integer x. > >> faster kernels that give max ULPs of about 0.5008. For sinpi(x) and > > Similar to for sinf() and cosf(). The maximum for those is 0.5009, using > double precision calculations which I now know to be a mistake, and > a polynomial of degree higher than necessary for good enough rounding in > float precision and lower than necessary for perfect rounding in float > precision. I have three polynomials for __kernel_sinpif() over x in [0,1/4]. All are accumulated in double precision. With a 6th order polynomial (12 double FOP), I get a max ULP of 0.5. With a 5th order polynomial (10 double FOP), I get a max ULP of 0.50000006. With the 4th order polynomial (8 double FOP), the max ULP is 0.50007051. Note, in my Remes algorithm I used f(x) = sin(pi*x) / x. I still haven't found a way to round 20*53-bit coefficients to a desired precision, which here is 53 bits. (Well, I have thought about a recursive Remes algorithm, but it requires much more work/time that I can commit.) I also have three polynomials for __kernel_cospif() over the same interval. 6th order gives Max ULP of 0.5, the 5th order gives 0.50000089, and the 4th order gives 0.5008257. > >> cospi(x), my kernels currently give about 1.25 ULP for the worse case > >> error if I accumulate the polynomials in double precision. If I > > Not too good. I think the problem is that the coeffs for the lowest > terms are no longer exactly representable. The linear function f(x) = a*x > is harder to evaluate accurately than exp(x) = 1 + x + ..., unless a > is exactly representable and also has few digits (like 1 or -0.5 for > early coeffs of sin() and cos()). I agree. I'm still working on the double routines and starting to contemplate the ld80 and ld128 ones. > >> accumulate the polynomials in long double precision, then I get > >> correctly rounded results. To complete the set, I was hoping to > > Don't do that. It is much slower, and not available starting with > long double precision, or even starting with double precision on arches > without real long doubles. It is also a mistake to use it for float > precision like I did for trig functions and an uncommitted version of > logf(). I managed to make the specialized logf() run 50% faster than > the less specialized log() (also uncommitted), by using double precision > and manually schedling for Athlon-XP, but modern CPUs and compilers handle > the less sprecialized log() so well that the specialized logf() is only > a couple of cycles faster. Similarly but less pronounced for trig functions. Don't worry. I have no intention to use long double to accumulate the polynomials. I was just curious to see if the extra 11-bits would reduce the ulp from 1.25 to something below 1. I was surprised at how well those 11 bits worked. I also tried fma() in my Horner's rule representation of the polynomial, but that did not help. > >> work out ld80 and ld128 versions. `ld128 is going to be problematic > >> due to the absense of int128_t. > >> > >> I'll send you what I have in a few days. > > > > It is sometime amazing what happens when I sleep. For sinpi[fl](x) > > and the method I came up with one needs to do argument reduction > > where x = n + r. Here n is an integer and 0 < r < 1. If n is even > > then one is in the positive half-cyle of sin(x) and if n is odd then > > one has the negative half-cycle. For sinpif(x) this looks like > > if (ix < 0x4b000000) { /* 1 <= |x| < 0x1p23 */ > > /* 1 */ ix = (uint32_t)ax; > > /* 2 */ ax = ax == ix ? zero : __compute_sinpif(ax - ix); > > /* 3 */ if (ix & 1) ax = -ax; > > /* 4 */ return ((hx & 0x80000000) ? -ax : ax); > > } > > > > Line 1 determines n. Line 2 computes either sinpi(n) exactly or > > sinpi(r). Line 3 uses n odd to set the sign for the half-cycle. > > Line 4 is used to set the sign from sin(-x) = -sin(x). > > > > For double and ld80 one can use int64_t instead of uint32_t. Last > > night I realized that I don't need to use int64_t, and more importantly > > I don't need int128_t. The argument reduction can use uint32_t > > everywhere. For double, I currently have > > > > if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */ > > int64_t n; > > n = (int64_t)ax; > > ax = ax == n ? zero : __compute_sinpi(ax - n); > > if (n & 1) ax = -ax; > > return ((hx & 0x80000000) ? -ax : ax); > > } > > > > which can be written (off the top-of-head and not tested) > > if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */ > > double volatile vx; > > uint32_t n; > > vx = ax + 0x1p52; > > vx = vx - 0x1p52; > > if (vx == ax) return(0); > > if (vx > UINT32_MAX) { /* ax = m + n + r with m + n */ > > vx -= UINT32_MAX; /* m = UINT32_MAX. n is in range */ > > ax -= UINT32_MAX; > > } > > n = (uint32_t)vx; > > ax = __compute_sinpi(ax - n); > > if (n & 1) ax = -ax; > > return ((hx & 0x80000000) ? -ax : ax); > > } > > > > Something similar can be applied to ld128, but reduction may take > > two rounds (ie., a comparison with UINT64_MAX and then UINT32_MAX). > > I don't like that much. The 0x1p52 magic is tricky and probably buggy. > s_rint.c uses the same magic but needs many complications to avoid > double rounding. Yep. Seems to have some issue. From my working copy of s_sinpif(x), I tried #if 0 ix = (uint32_t)ax; a = ax == ix ? zero : __compute_sinpif(ax - ix); if (ix & 1) ax = -ax; return ((hx & 0x80000000) ? -ax : ax); #else volatile float vx; float y; vx = ax + 0x1p23F; y = vx - 0x1p23F; ix = (uint32_t)y; ax = ax == y ? zero : __compute_sinpif(ax - ix); if (ix & 1) ax = -ax; return ((hx & 0x80000000) ? -ax : ax); #endif My max ULP went from 0.5 to 0.50398505 for exhaustive testing in the interval [1,100]. If I grab the worse case x and use the '#if 1' code, I see ./testf -S -a 1.50121641e+00f a = 1.50121641e+00f, /* 0x3fc027dc */ sinpif(a) = -9.99992669e-01f, /* 0xbf7fff85 */ sin(pi*a) = -9.99992669e-01f, /* 0xbf7fff85 */ MPFR 96-bits rounded 24-bits. ULP: 0.49601495 The '#if 0' code gives ./testf -S -a 1.50121641e+00f a = 1.50121641e+00f, /* 0x3fc027dc */ sinpif(a) = -9.99992728e-01f, /* 0xbf7fff86 */ sin(pi*a) = -9.99992669e-01f, /* 0xbf7fff85 */ MPFR 96-bits rounded 24-bits. ULP: 0.50398505 > Similar-looking magic for trig reduction in radians > uses floating the magic value of 0x1.8p52 to bias rounding errors in > a safe way, and arranges so that rounding errors don't matter then. > Perhaps you can do the same here -- do a fixup of +-UINT32_MAX if > necessary. This seem to be no harder than using rint(), round() or > floor(). Except for floor(), we it is hard to control whether the > rounded result is too high or too low. Using floor: > > prec_t n = floor(x); > r = x - n; > > but we want to classify quadrants, and still don't have an n that will > fit in an integer, so first scale by 2 (4 quadrants of size 1/2): > > y = x * 0.5 - floor(x * 0.5); /* 4 quads in [2n, 2n+2) -> [0, 1) */ > y = 4 * y; /* -> [0, 4) */ > int n = y; /* quadrant */ > r = (y - n) / 4; /* offset in quadrant */ > > This is a bit slow, and too much like lgamma's method. I did look over what you did in lgamma[f], but forgot the logic behind the reduction into octants. I suppose that I'm more concern with 'make it work', 'make it correct', and then I will/might worry about 'make it fast'. But, I don't want it to be too slow, so I'm trying to avoid function calls and reclassification of x as +-inf or nan. Heck, r = remquof(ax, vx, &n) should work, but there's that speed issue. -- Steve 20161221 https://www.youtube.com/watch?v=IbCHE-hONow
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170309195134.GA36213>