Date: Wed, 17 May 2017 12:49:45 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Steve Kargl <sgk@troutmask.apl.washington.edu> Cc: freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170517094848.A52247@besplex.bde.org> In-Reply-To: <20170516224618.GA40855@troutmask.apl.washington.edu> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428175851.A1386@besplex.bde.org> <20170516224618.GA40855@troutmask.apl.washington.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 16 May 2017, Steve Kargl wrote: > ... > I have removed my new kernels and written shims to > use the fdlibm trig kernels. I know that in some > cases this leads to slower code, but a small (if not > tiny) improvement in accuracy. This also leads to > a significant reduction in code size. A couple of > notes: Still quite large. On Tue, 16 May 2017, Steve Kargl wrote: > ... > I have removed my new kernels and written shims to > use the fdlibm trig kernels. I know that in some > cases this leads to slower code, but a small (if not > tiny) improvement in accuracy. This also leads to > a significant reduction in code size. A couple of > notes: Still quite large. > Index: lib/msun/ld128/k_cospil.c This is a header file, not a .c file or even a self-sufficient kernel, since it only compiles when included by C files which declare the things that it uses. My kernels for exp are header files and are also self-sufficient (except they only define static functions and objects, so are useless unless they are included). > +static inline long double > +__kernel_cospil(long double x) > +{ > + long double hi, lo; > + hi = (double)x; Style (missing blank line after declararions). > Index: lib/msun/ld128/k_sinpil.c > ... > + long double hi, lo; > + hi = (double)x; Style. > Index: lib/msun/ld128/s_cospil.c > ... > +static const long double > +pihi = 3.14159265358979322702026593105983920e+00L, > +pilo = 1.14423774522196636802434264184180742e-17L; These are not in normal format, and are hard to read. I can't see if pihi has the correct number of zero bits for exact multiplication. These don't have the normal spelling. fdlibm never uses pihi or pio2hi, or pi_hi. It often uses pio2_hi and other pio2_*. My s_sinpi.c uses pi_hi. > +long double > +cospil(long double x) > +{ > + volatile static const double vzero = 0; > + long double ax, c, xf; > + uint32_t ix; > + > + ax = fabsl(x); > + Style (lots of extra blank lines). > + if (ax < 0x1p112) { > + > + xf = floorl(ax); > + > + ax -= xf; > + These extra blank lines are especially bad, since they separate related code. > + if (xf > 0x1p50) xf -= 0x1p50; > + if (xf > 0x1p30) xf -= 0x1p30; Style. > Index: lib/msun/ld128/s_sinpil.c Similarly. > Index: lib/msun/ld128/s_tanpil.c Similarly. > + * FIXME: This should use a polynomial approximation in [0,0.25], but the > + * FIXME: increase in max ULP is probably small, so who cares. Doesn't it already use the standard kernel, which uses a poly approx? Splitting up [0, Pi/4] doesn't work very well for reducing the degree of the poly. The current poly has degree about 56 (maybe 52 or 54), and I couldn't reduce it to below 40 using reasonably large interval. I think it can be reduced to degree about 10 using intervals of length 1/128 as for logl(), but unlike for logl() this obvious way of doing this needs about 128 diferent polys. > +static inline long double > +__kernel_tanpi(long double x) > +{ > + long double hi, lo, t; > + > + if (x < 0.25) { > + hi = (double)x; > + lo = x - hi; > + lo = lo * (pilo + pihi) + hi * pilo; > + hi *= pihi; > + _2sumF(hi, lo); > + t = __kernel_tanl(hi, lo, -1); Oops, I forgot that this 0.25 is really Pi/4 when scaled. It is impossible to use 1 poly over the whole range, since the poly would need to have even higher degree than 50+, and would still have bad numeric properties. so __kernel_tan*() uses special methods. > +long double > +tanpil(long double x) > +{ > + volatile static const double vzero = 0; > + long double ax, xf; > + uint32_t ix; > + > + ax = fabsl(ax); > + > + if (ax < 1) { > + if (ax < 0.5) { > + if (ax < 0x1p-60) { > + if (x == 0) > + return (x); > + ax = (double)x; Style (unnecessary cast). > + x -= ax; > + t = pilo * x + pihi * x + pilo * ax > + + pihi * ax; Style: - unnecessary line splitting - gnu style for placement of the operator on a new line. If not following fdlibm style, use strict KNF. > + return (t); > + } > + t = __kernel_tanpil(ax); For most long double functions, we put everything except the kernels in the same files, with ifdefs for LDBL_MANT_DIG. Here even the hi+lo decomposition and multiplication by Pi can be in a kernel. The style bugs are especially painful to read when sextuplicated. (The ld80 variants are quite different due to being better optimized, but this is a different style bug.) > Index: lib/msun/ld80/k_cospil.c No more comments on sextuplicated style bugs. > Index: lib/msun/ld80/s_cospil.c > ... > + if (ix < 0x3ffe) /* |x| < 0.5 */ > + c = __kernel_sinpil(0.5 - ax); > + else if (lx < 0xc000000000000000ull) { /* |x| < 0.75 */ Style bug (spelling of the long long abomination using "ull"). The explicit abomination is unfortunately needed by old versions of gcc with CFLAGS asking for nearly C90. > + if (ax == 0.5) > + RETURNI(0); Perhaps pessimal to not test for 0.5 in bits. It is pessimal to avoid generating inexact for 0.5 at all. I don't see a much better way. > + if (ix < 0x403e) { /* 1 <= |x| < 0x1p63 */ > + /* Determine integer part of ax. */ > + j0 = ix - 0x3fff + 1; > + if (j0 < 32) { > + lx = (lx >> 32) << 32; > + lx &= ~(((lx << 32)-1) >> j0); > + } else { > + uint64_t m = (uint64_t)-1 >> (j0 + 1); > + if (lx & m) lx &= ~m; Style: - nested declaration - initialization in declaration - half the code in the initialization - no newline for statement after 'if' > + } > + INSERT_LDBL80_WORDS(x, ix, lx); > + > + ax -= x; > + EXTRACT_LDBL80_WORDS(ix, lx, ax); I don't know if this is a good method. INSERT/EXTRACT are slow, so I doubt that it is. In e_lgamma*.c:sin_pi*(), we use only a single GET of the high word. This function already used EXTRACT/INSERT to classify, and 2 more here. (_e_lgamma*.c used GET to classify before calling sin_pi*(), then 1 more GET.) INSERT is especially slow. Care is needed near the 0x1p63 boundary, since the fast methods require adding approx 0x1p63. At or above the boundary, these methods don't obviously work. Perhaps they give the correct result starting at 2 times the boundary, except for setting inexact. Ugh. Do you avoid the fast methods since they set inexact for most half-integers? That costs more than avoiding inexact for 0.5. tanpi(0.25 is exact too, so quarter-integers need special treatment too if you really want to avoid all spurious inexacts. I now see that some of the above complications are from inlining floor(). floor() can't use the fast method since it is careful about inexact, and the spec may require this. It actually sets inexact whenever the result is not an integer. So it use it for sinpi() without getting spurious inexact, the arg must be multiplied by 2 first, and for tanpi(), multiply by 4 first. This is excessive. > Index: lib/msun/ld80/s_sinpil.c Similarly. > Index: lib/msun/ld80/s_tanpil.c Similarly. > Index: lib/msun/src/k_cospi.c > ... > +/* > + * The basic kernel for x in [0,0.25]. To exploit the kernel for cos(x), > + * the argument to __kernel_cospi() must be multiply by pi. As pi is an > + * irrational number, this allows the splitting of pi*x into high and low > + * parts. > + */ Allows? > + > +static inline double > +__kernel_cospi(double x) > +{ > + double hi, lo; > + hi = (float)x; _2sumF is extensively documented to not work with double. It requires double_t here. Many style bugs are actually duodecimally (?) tuplicated (3 copies for ld128, 3 for ld80, 3 for double, 3 for float). > Index: lib/msun/src/k_sinpi.c Similarly. > Index: lib/msun/src/math.h > =================================================================== > --- lib/msun/src/math.h (revision 318383) > +++ lib/msun/src/math.h (working copy) > @@ -500,6 +500,15 @@ > > #if __BSD_VISIBLE > long double lgammal_r(long double, int *); > +double cospi(double); > +float cospif(float); > +long double cospil(long double); > +double sinpi(double); > +float sinpif(float); > +long double sinpil(long double); > +double tanpi(double); > +float tanpif(float); > +long double tanpil(long double); > #endif Should be sorted into the double, float and long double sections. > Index: lib/msun/src/s_cospi.c > +/** > + * cospi(x) computes cos(pi*x) without multiplication by pi (almost). First, You mean with multiplication by pi in extra precision. See s_sinpi.c for more. > Index: lib/msun/src/s_cospif.c Similarly. > Index: lib/msun/src/s_cospif.c > +#define INLINE_KERNEL_SINDF > +#define INLINE_KERNEL_COSDF Prossibly not worth inlining. > +#define __kernel_cospif(x) ((float)__kernel_cosdf(M_PI * (x))) > +#define __kernel_sinpif(x) ((float)__kernel_sindf(M_PI * (x))) Style bugs (not even corrupt tabs, but 2 spaces after #define and 3 spaces instead of a second tab). Bogus casts. The kernels already return float, although that is pessimal and destroys their the xtra precision that they naturally have. The cost of avoiding inexact is clearer here. Returning the non-kernel sin(M_PI * (x)), etc. with no classifation would probably be faster than the slow reduction done here (except for large x). See s_sinpi.c for more. > Index: lib/msun/src/s_sinpi.c > ... > +double > +sinpi(double x) > +{ > + volatile static const double vzero = 0; > + double ax, s; > + uint32_t hx, ix, j0, lx; > + > + EXTRACT_WORDS(hx, lx, x); > + ix = hx & 0x7fffffff; > + INSERT_WORDS(ax, ix, lx); A slow way to set ax to fabs(x). This is otherwise much slower than sin_pi() for lgamma(). That started with just a GET of hx, while this EXTRACTs 2 words, like floor(), apparently to avoid spurious inexact. Note that lgamma(0.5) is naturally inexact. lgamma(integer) should be exact, but I doubt that it is. Maybe we broke it. > + > + if (ix < 0x3ff00000) { /* |x| < 1 */ > + if (ix < 0x3fd00000) { /* |x| < 0.25 */ > + if (ix < 0x3e200000) { /* |x| < 0x1p-29 */ > + if (x == 0) > + return (x); > + INSERT_WORDS(ax, hx, 0); > + x -= ax; Flipping minus signs to reduce the number of cases for odd functions is a bad way to do things. I speeded up several functions by 4-10 cycles (5-20%) by avoiding such flipping. Here there is a dependency on ax and a slower than usual calculation of it with. (On modern x86, fabs would take ~2 cycles latency and ~1/2 cycles throughput), while going throigh these extractions and insertions might take 10 cycles latency at best and 50 or more with store to load mismatch penalties on not so modern x86). Apart from being slower, sign flipping it prevents gives different rounding in non-default modes. C99 says that sin() is odd, so may require the flipping to break the rounding modes intentionally. If so, then the bug is in C99. I believe that it intentionally doesn't require avoiding spurious inexact because this would be onerous. It also doesn't require sin() to be rounded in the current rounding mode. Requiring sin() to be perfectly old is similarly onerous. On Haswell, fdlibm floor() takes about 33 cycles and fdlibm sin() about 44 cycles below 2*Pi, so any sinpi() that uses floor() or its internals is going to be about twice as slow as sin() because its classification is too slow. > + if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */ > + /* Determine integer part of ax. */ > + j0 = ((ix >> 20) & 0x7ff) - 0x3ff; > + if (j0 < 20) { > + ix &= ~(0x000fffff >> j0); > + lx = 0; > + } else { > + lx &= ~((uint32_t)0xffffffff >> (j0 - 20)); > + } > + INSERT_WORDS(x, ix, lx); > + > + ax -= x; > + EXTRACT_WORDS(ix, lx, ax); Much slower than what lgamma()s sin_pi() does. After our optimizations, that does basically rint() inline using ax + 0x1p52 - 0x1p52, then a further += 0x1p50 to get octants instead of half-cycles. This takes approx 16 cycles latency on modern x86, which is lot. I think and64 can do the above in 10-20 cycles, while i386 might take 20-40 or much longer with store-to-load mismatch penalties. I think we can avoid the spurious inexacts with a little care. When ax is an integer (and < 0x1p52), we can add at least 0x1p51 to it exactly. We also want to avoid inexact for half-integers for sinpi and cospi, and quarter-integers for tanpi, and it is good to reduce to quadrants anyway. So we might intitially multiply by 4 and do the += 0x1p52 trick only if 4*ax < 0x1p52 (or whatever the safe threshold is). Efficiency doesn't matter for larger ax, so slow methods using things like floor(16*ax) are adequate, or we can reduce by a few powers of 2 using special cases: v = 4 * ax; if (v >= 0x1p54) ; /* ax is an even integer (check factor) / else { if (v >= 0x1p53) v -= 0x1p53; if (v >= 0x1p52) v -= 0x1p52; } I will finish this for lgamma*(). The fdlibm version does multiplications by 2 and we seem to have removed a bit too much of that method. > + /* > + * |x| >= 0x1p52 is always an integer, so return +-0. > + * FIXME: should this raise FE_INEXACT or FE_INVALID? > + */ The result is always +0, exact and valid (maybe -0 in unsupported unusual rounding modes). We just did a lot of work to avoid inexact. > + if (ax + 1 > 1) > + ax = copysign(0, x); An even slower way of setting the sign than INSERT, especially if the compiler doesn't inline copysign(). Using the correct +0 fixes this. We might have classified slightly smaller values as odd integers and have to select the sign for them. This is done differently now. > Index: lib/msun/src/s_sinpif.c > ... > Index: lib/msun/src/s_tanpi.c > ... > Index: lib/msun/src/s_tanpif.c > ... Similarly. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170517094848.A52247>