Date: Tue, 16 May 2017 21:20:23 -0700 From: Steve Kargl <sgk@troutmask.apl.washington.edu> To: Bruce Evans <brde@optusnet.com.au> Cc: freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170517042023.GA43647@troutmask.apl.washington.edu> In-Reply-To: <20170517094848.A52247@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428175851.A1386@besplex.bde.org> <20170516224618.GA40855@troutmask.apl.washington.edu> <20170517094848.A52247@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, May 17, 2017 at 12:49:45PM +1000, Bruce Evans wrote: > On Tue, 16 May 2017, Steve Kargl wrote: > > Index: lib/msun/ld128/s_cospil.c > > +long double > > +cospil(long double x) > > +{ > > + volatile static const double vzero = 0; > > + long double ax, c, xf; > > + uint32_t ix; > > + > > + ax = fabsl(x); > > + > > Style (lots of extra blank lines). If there are only style issues, then we are fortunate as I have not even compiled the code. > > + ax -= xf; > > + > > These extra blank lines are especially bad, since they separate related > code. Extra blank lines make things easier to read. > > > + * FIXME: This should use a polynomial approximation in [0,0.25], but the > > + * FIXME: increase in max ULP is probably small, so who cares. > > Doesn't it already use the standard kernel, which uses a poly approx? Whoops forgot to remove the unneeded comment. > > + volatile static const double vzero = 0; > > + long double ax, xf; > > + uint32_t ix; > > + > > + ax = fabsl(ax); > > + > > + if (ax < 1) { > > + if (ax < 0.5) { > > + if (ax < 0x1p-60) { > > + if (x == 0) > > + return (x); > > + ax = (double)x; > > Style (unnecessary cast). Huh? ax is long double. x is long double. The cast should remove 11 bits. > > + x -= ax; > > + t = pilo * x + pihi * x + pilo * ax > > + + pihi * ax; > > Style: > - unnecessary line splitting The is line is 80 characters without splitting (and even longer if one mandates pi_lo and pi_hi). I use an 80 character wide and wrap lines that are 79+ characters. > > Index: lib/msun/ld80/s_cospil.c > > ... > > + if (ix < 0x3ffe) /* |x| < 0.5 */ > > + c = __kernel_sinpil(0.5 - ax); > > + else if (lx < 0xc000000000000000ull) { /* |x| < 0.75 */ > > Style bug (spelling of the long long abomination using "ull"). The > explicit abomination is unfortunately needed by old versions of gcc > with CFLAGS asking for nearly C90. Whoops that one may have slipped through. > > + if (ax == 0.5) > > + RETURNI(0); > > Perhaps pessimal to not test for 0.5 in bits. > > It is pessimal to avoid generating inexact for 0.5 at all. I don't see > a much better way. I tried using bits and floating point. Saw no difference. cospi(n.5) is exact (see n1950.pdf). See comment in s_cospi.c that documents special cases. > > + } > > + INSERT_LDBL80_WORDS(x, ix, lx); > > + > > + ax -= x; > > + EXTRACT_LDBL80_WORDS(ix, lx, ax); > > I don't know if this is a good method. INSERT/EXTRACT are slow, so I > doubt that it is. In e_lgamma*.c:sin_pi*(), we use only a single GET > of the high word. This function already used EXTRACT/INSERT to classify, > and 2 more here. (_e_lgamma*.c used GET to classify before calling > sin_pi*(), then 1 more GET.) INSERT is especially slow. > > Care is needed near the 0x1p63 boundary, since the fast methods require > adding approx 0x1p63. At or above the boundary, these methods don't > obviously work. Perhaps they give the correct result starting at 2 times > the boundary, except for setting inexact. Ugh. Do you avoid the fast > methods since they set inexact for most half-integers? That costs more > than avoiding inexact for 0.5. n1950.pdf says cospi(n.5) is exact. Perhaps, the __kernel_cos() and __kernel_sin() provide an exact result. I did not investigate. > tanpi(0.25 is exact too, so quarter-integers need special treatment too > if you really want to avoid all spurious inexacts. It does. > I now see that some of the above complications are from inlining floor(). Yes, it is an inline of floor(). > > +/* > > + * The basic kernel for x in [0,0.25]. To exploit the kernel for cos(x), > > + * the argument to __kernel_cospi() must be multiply by pi. As pi is an > > + * irrational number, this allows the splitting of pi*x into high and low > > + * parts. > > + */ > > Allows? x = n + r. For double, r has 53 bits. There are no trailing low bits from the argument reduction. pi is irrational, and therefore not exactly representable by 53 bits (or any number of bits). __kernel_cospi() uses __kernel_cos(), and the low bits come about because of pi. I split x into high in low. So, we have lo = pilo * xlo + pilo * xhi + pihi * xlo; /* all the low bits */ hi = pihi * xhi; /* Exact */ _2sumF(hi, lo); /* Maybe superfluous? */ > > + > > +static inline double > > +__kernel_cospi(double x) > > +{ > > + double hi, lo; > > + hi = (float)x; > > _2sumF is extensively documented to not work with double. It requires > double_t here. It seems to work base on testing hundreds of millions of values. :-) > > +#define INLINE_KERNEL_SINDF > > +#define INLINE_KERNEL_COSDF > > Prossibly not worth inlining. > > > +#define __kernel_cospif(x) ((float)__kernel_cosdf(M_PI * (x))) > > +#define __kernel_sinpif(x) ((float)__kernel_sindf(M_PI * (x))) > > Style bugs (not even corrupt tabs, but 2 spaces after #define and 3 spaces > instead of a second tab). Ugh. My bad. I have nedit set to do 3-space tabs. I forgot to reset to use hard tabs. > Bogus casts. The kernels already return float, although that is pessimal > and destroys their the xtra precision that they naturally have. Exhaustive testing shows max ULP < 0.500xyyy where might be 0, but can't recall now. How exact to you want these? > > Index: lib/msun/src/s_sinpi.c > > ... > > +double > > +sinpi(double x) > > +{ > > + volatile static const double vzero = 0; > > + double ax, s; > > + uint32_t hx, ix, j0, lx; > > + > > + EXTRACT_WORDS(hx, lx, x); > > + ix = hx & 0x7fffffff; > > + INSERT_WORDS(ax, ix, lx); > > A slow way to set ax to fabs(x). Make it work. Make it correct. Make it fast. I've undertaken the first two. > This is otherwise much slower than sin_pi() for lgamma(). That started > with just a GET of hx, while this EXTRACTs 2 words, like floor(), apparently > to avoid spurious inexact. > > Note that lgamma(0.5) is naturally inexact. lgamma(integer) should be exact, > but I doubt that it is. Maybe we broke it. See below. > > + if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */ > > + /* Determine integer part of ax. */ > > + j0 = ((ix >> 20) & 0x7ff) - 0x3ff; > > + if (j0 < 20) { > > + ix &= ~(0x000fffff >> j0); > > + lx = 0; > > + } else { > > + lx &= ~((uint32_t)0xffffffff >> (j0 - 20)); > > + } > > + INSERT_WORDS(x, ix, lx); > > + > > + ax -= x; > > + EXTRACT_WORDS(ix, lx, ax); > > Much slower than what lgamma()s sin_pi() does. After our optimizations, > that does basically rint() inline using ax + 0x1p52 - 0x1p52, I found a situation where this trick seems to round in the wrong direction. You participated in the discussion. https://lists.freebsd.org/pipermail/freebsd-numerics/2017-March/000658.html https://lists.freebsd.org/pipermail/freebsd-numerics/2017-March/000660.html > then a > further += 0x1p50 to get octants instead of half-cycles. This takes > approx 16 cycles latency on modern x86, which is lot. I think and64 can > do the above in 10-20 cycles, while i386 might take 20-40 or much > longer with store-to-load mismatch penalties. > > I think we can avoid the spurious inexacts with a little care. When > ax is an integer (and < 0x1p52), we can add at least 0x1p51 to it exactly. > We also want to avoid inexact for half-integers for sinpi and cospi, and > quarter-integers for tanpi, and it is good to reduce to quadrants anyway. > So we might intitially multiply by 4 and do the += 0x1p52 trick only if > 4*ax < 0x1p52 (or whatever the safe threshold is). Efficiency doesn't > matter for larger ax, so slow methods using things like floor(16*ax) > are adequate, or we can reduce by a few powers of 2 using special cases: > > v = 4 * ax; > if (v >= 0x1p54) > ; /* ax is an even integer (check factor) / > else { > if (v >= 0x1p53) > v -= 0x1p53; > if (v >= 0x1p52) > v -= 0x1p52; > } I'll see if I can understand this and modify waht I have. Unfortunately, I'm running out of time to work on this. > The result is always +0, exact and valid (maybe -0 in unsupported > unusual rounding modes). We just did a lot of work to avoid inexact. Not according to n1950.pdf. It states sinpi(+-n) = +-0 for positive integers. > > > + if (ax + 1 > 1) > > + ax = copysign(0, x); > > An even slower way of setting the sign than INSERT, especially if > the compiler doesn't inline copysign(). Using the correct +0 fixes > this. |x| > 0x1p(P-1) is (should be) exceptional. Who cares if it's fast. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170517042023.GA43647>