Date: Fri, 10 Mar 2017 05:02:13 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Steve Kargl <sgk@troutmask.apl.washington.edu> Cc: Bruce Evans <brde@optusnet.com.au>, freebsd-numerics@freebsd.org Subject: Re: Bit twiddling question Message-ID: <20170310025417.U3723@besplex.bde.org> In-Reply-To: <20170309152307.GA32781@troutmask.apl.washington.edu> References: <20170308202417.GA23103@troutmask.apl.washington.edu> <20170309173152.F1269@besplex.bde.org> <20170309075236.GA30199@troutmask.apl.washington.edu> <20170309152307.GA32781@troutmask.apl.washington.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 9 Mar 2017, Steve Kargl wrote: > On Wed, Mar 08, 2017 at 11:52:36PM -0800, Steve Kargl wrote: >> To give a hint at what I have been working on, I have implementations >> for sinpif(x), sinpi(x), cospif(x), and cospi(x). For sinpif(x) and >> cospif(x) I have kernels that give correctly rounded results for >> FLT_MIN <= x < 0x1p23 (at least on x86_64). I also have slightly I didn't see the original mail. I'd prefer you to finish the long double versions of cosl(), sinl() and tanl() (maybe das is responsible for the latter). These are still too slow and minor bugs for at least sinl() on small x, partly by not copying the lower-precision versions enough. >> faster kernels that give max ULPs of about 0.5008. For sinpi(x) and Similar to for sinf() and cosf(). The maximum for those is 0.5009, using double precision calculations which I now know to be a mistake, and a polynomial of degree higher than necessary for good enough rounding in float precision and lower than necessary for perfect rounding in float precision. >> cospi(x), my kernels currently give about 1.25 ULP for the worse case >> error if I accumulate the polynomials in double precision. If I Not too good. I think the problem is that the coeffs for the lowest terms are no longer exactly representable. The linear function f(x) = a*x is harder to evaluate accurately than exp(x) = 1 + x + ..., unless a is exactly representable and also has few digits (like 1 or -0.5 for early coeffs of sin() and cos()). >> accumulate the polynomials in long double precision, then I get >> correctly rounded results. To complete the set, I was hoping to Don't do that. It is much slower, and not available starting with long double precision, or even starting with double precision on arches without real long doubles. It is also a mistake to use it for float precision like I did for trig functions and an uncommitted version of logf(). I managed to make the specialized logf() run 50% faster than the less specialized log() (also uncommitted), by using double precision and manually schedling for Athlon-XP, but modern CPUs and compilers handle the less sprecialized log() so well that the specialized logf() is only a couple of cycles faster. Similarly but less pronounced for trig functions. >> work out ld80 and ld128 versions. `ld128 is going to be problematic >> due to the absense of int128_t. >> >> I'll send you what I have in a few days. > > It is sometime amazing what happens when I sleep. For sinpi[fl](x) > and the method I came up with one needs to do argument reduction > where x = n + r. Here n is an integer and 0 < r < 1. If n is even > then one is in the positive half-cyle of sin(x) and if n is odd then > one has the negative half-cycle. For sinpif(x) this looks like > if (ix < 0x4b000000) { /* 1 <= |x| < 0x1p23 */ > /* 1 */ ix = (uint32_t)ax; > /* 2 */ ax = ax == ix ? zero : __compute_sinpif(ax - ix); > /* 3 */ if (ix & 1) ax = -ax; > /* 4 */ return ((hx & 0x80000000) ? -ax : ax); > } > > Line 1 determines n. Line 2 computes either sinpi(n) exactly or > sinpi(r). Line 3 uses n odd to set the sign for the half-cycle. > Line 4 is used to set the sign from sin(-x) = -sin(x). > > For double and ld80 one can use int64_t instead of uint32_t. Last > night I realized that I don't need to use int64_t, and more importantly > I don't need int128_t. The argument reduction can use uint32_t > everywhere. For double, I currently have > > if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */ > int64_t n; > n = (int64_t)ax; > ax = ax == n ? zero : __compute_sinpi(ax - n); > if (n & 1) ax = -ax; > return ((hx & 0x80000000) ? -ax : ax); > } > > which can be written (off the top-of-head and not tested) > if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */ > double volatile vx; > uint32_t n; > vx = ax + 0x1p52; > vx = vx - 0x1p52; > if (vx == ax) return(0); > if (vx > UINT32_MAX) { /* ax = m + n + r with m + n */ > vx -= UINT32_MAX; /* m = UINT32_MAX. n is in range */ > ax -= UINT32_MAX; > } > n = (uint32_t)vx; > ax = __compute_sinpi(ax - n); > if (n & 1) ax = -ax; > return ((hx & 0x80000000) ? -ax : ax); > } > > Something similar can be applied to ld128, but reduction may take > two rounds (ie., a comparison with UINT64_MAX and then UINT32_MAX). I don't like that much. The 0x1p52 magic is tricky and probably buggy. s_rint.c uses the same magic but needs many complications to avoid double rounding. Similar-looking magic for trig reduction in radians uses floating the magic value of 0x1.8p52 to bias rounding errors in a safe way, and arranges so that rounding errors don't matter then. Perhaps you can do the same here -- do a fixup of +-UINT32_MAX if necessary. This seem to be no harder than using rint(), round() or floor(). Except for floor(), we it is hard to control whether the rounded result is too high or too low. Using floor: prec_t n = floor(x); r = x - n; but we want to classify quadrants, and still don't have an n that will fit in an integer, so first scale by 2 (4 quadrants of size 1/2): y = x * 0.5 - floor(x * 0.5); /* 4 quads in [2n, 2n+2) -> [0, 1) */ y = 4 * y; /* -> [0, 4) */ int n = y; /* quadrant */ r = (y - n) / 4; /* offset in quadrant */ This is a bit slow, and too much like lgamma's method. frexp() can be used instead of floor() to slowly extract the fraction. I think it can be done using the 0x1p52 magic a bit like you say. s_rint.c already uses this magic, but has to work hard to handle errors in double rounding. s_floor.c uses it via bit fiddling. If we replace the slow floor(z) by the 0x1p52 calculation on z, then we get an approximation to floor(z) that is hopefully an integer (possibly an integer + 0.5) within 1 or 2 of the correct result. Adjust it until it is correct. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170310025417.U3723>