Date: Tue, 25 Apr 2017 20:23:50 -0700 From: Steve Kargl <sgk@troutmask.apl.washington.edu> To: Bruce Evans <brde@optusnet.com.au> Cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170426032350.GA87524@troutmask.apl.washington.edu> In-Reply-To: <20170425173542.S1401@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170413171248.GA86780@troutmask.apl.washington.edu> <20170424232743.GA54621@troutmask.apl.washington.edu> <20170425143120.R933@besplex.bde.org> <20170425063643.GA45906@troutmask.apl.washington.edu> <20170425173542.S1401@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Apr 25, 2017 at 06:27:52PM +1000, Bruce Evans wrote: > On Mon, 24 Apr 2017, Steve Kargl wrote: > > > On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote: > > >> [for expl] > >> This is arranged to look a bit like Horner's method at first (operations > >> not written in separate statements), but it is basically Estrin's method > >> with no extensions to more complicated sub-expressions than Estrin's, > >> but complications for accuracy: > >> - I think accuracy for the B3 term is unimportant, but for some reason > >> (perhaps just the one in the comment) it is grouped separately. Estrin > >> might use x3 * (B3.e + x * B4.e). > > > > Yes, it is most likely an accuracy issue. I found something > > similar with sinpi/cospi. My k_sinpi.c has > > > > #if 0 > > double x2; > > /* Horner's method. */ > > x2 = x * x; > > sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3) > > + s2) + s1; > > Horner's of course has about as many dependencies as operations (13), so > on modern x86 this is limited by the latency of 3-5 cycles/instructions > or 39-65 cycles. Multiplications take an extra cycle in more cases, so > it is better to have more additions, but unfortunately Horner gives 1 > more multuplication. > > With no dependencies, the limit would be throughput -- 2 instructions/cycle > of 6.5 cycles rounded up to 7; then add latency for the last operation to > get about 11. > > With practical dependencies, it is hard to get more than a 2-fold speedup. > > > #else > > double x2, x4; > > /* Sort of Estrin's method. */ > > x2 = x * x; > > x4 = x2 * x2; > > sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4 > > + (s2 + s3 * x2 + s4 * x4) * x2 + s1; > > #endif > > This looks OK. The only obvious improvement is to try grouping x4*x4 > so that it can be evaluated separately. This reduces the length of > dependency chains by 1, but increases early pressure on multiplication. > Don't write this as x8 = x4*x4. Compilers will do the equivalent. > You can also replace x4 by (x2*x2) but there are many x4's so this > might be less readable. The grouping of (x4*x4) and (x2 * x4) in other polynomials indeed give small improvements. I've reworked the approximations in sinpi[l], cospi[l], and tanpi[l]. I did not see any improves with the float versions as the order of the polynomials are fairly small (order of 4). > > The addition of s1 must occur last, or I get greater 1 ULP > > for a number of values. > > Indeed. This prevents using the best order, which is the opposite. > You used the best order for the other inner terms, but not so obviously > for the first outer addition. However, the compiler is free to reorder > that: consider: > > T2 + T1 + T0 > > T2 is more complicated and everything depends on it, so should be started > first, but possibly T1 can be started first because it is simpler. Here > it is not significantly simpler. However, for: > > T3 + T2 + T1 + T0 > > the natural left-to-right grouping is usually pessimial. This should be > written as: > > T3 + (T2 + T1) + T0 > > Unfortunately, we can't write this naturally and usually optimally as: > > T0 + T1 + T2 + T3 > > due to the accuracy problem with T0. Maybe right this as: > > T1 + T2 + T3 + T0 // only T0 in unnatural order > > For expl, I tried all orders. I tried a few different orders; particularly with the long double functions. I also managed to break the code a few times. Thankfully, I have everything in a subversion repository. > > > My timing show on Intel core2 duo > > > > /* Horner's methods. */ > > laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22 > > sinpi time: 0.066 usec per call > > cospi time: 0.063 usec per call > > > > /* Estrin's methods. */ > > laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22 > > sinpi time: 0.059 usec per call > > cospi time: 0.055 usec per call > > > > Here, -n 22 means to test 2**22 - 1 values in the interval > > [0,0.25]. > > The improvement is marginal. Only worth doing if it doesn't complicate > the code much. Older CPUs can't exploit te parallelism so well, so I > would expect a relatively larger improvement. Don't measure in usec, > since they are hard to scale. 0.66 usec and 1.86 GHz is more than 100 > cycles. This is quite slow. I get similar times for cos and sin on an > older Athlon-XP (~40 in float prec, ~160 in double and ~270 in long > double). Newer CPUs are about 4 times faster in double prec (only 2 > times fast in float prec) in cycles despite having similar nominal > timing, by unclogging their pipes. I'm assuming your 0.66 in the above is a typo as I wrote 0.066. I also got some improvement by changing the splitting of a few entities into high and low parts by a simple cast. Instead of doing EXTRACT_WORDS(hx, lx, x); INSERT_WORDS(xhi, hx, 0); xlo = x - xhi; I do xhi = (float)x; xlo = x - xhi; but this seems to only help in some situation. My new timing for x in [0,0.25] on my 1995.05 MHz Core2 duo gives Horner sinpi time: 0.073 usec per call (~146 cycles) cospi time: 0.070 usec per call (~140) tanpi time: 0.112 usec per call (~224) Estrin sinpi time: 0.064 usec per call (~128) cospi time: 0.060 usec per call (~120) tanpi time: 0.088 usec per call (~176) tanpi benefits the most as it has an order 16 polynomial, which allows some freedom in group terms. I don't think I can do much better. For example, the approximation for sinpi(x) is sinpi(x) = x * (s0 + x2 * S(x2)) where S(x2) is the polynomial and I need to split x2 into high and low parts and s0 is split. The kernel with the Estrin method and with the polynomial coefficients is static inline double __kernel_sinpi(double x) { double hi, lo, sm, xhi, xlo; uint32_t hx, lx; double x2, x4; /* Sort of Estrin's method. */ x2 = x * x; x4 = x2 * x2; sm = (s5 + s6 * x2 + s7 * x4) * (x4 * x4) + (s2 + s3 * x2 + s4 * x4) * x2 + s1; EXTRACT_WORDS(hx, lx, x); INSERT_WORDS(xhi, hx, 0); xlo = x - xhi; lo = xlo * (x + xhi) * sm; hi = xhi * xhi * sm; lo = x * lo + xlo * (s0hi + s0lo) + x * hi; return (lo + xhi * s0lo + xhi * s0hi); } The order of terms in the last 5 lines matters. Testing yields a = 2.4220192246482908e-01, /* 0x3fcf0078, 0xfc01e3f0 */ sinpi(a) = 6.8957335930938313e-01, /* 0x3fe610fc, 0x264da72e */ sin(pi*a) = 6.8957335930938324e-01, /* 0x3fe610fc, 0x264da72f */ ULP: 0.79950446 MAX ULP: 0.79950446 Total tested: 8388606 1.0 < ULP : 0 0.9 < ULP <= 1.0: 0 0.8 < ULP <= 0.9: 0 0.7 < ULP <= 0.8: 554 0.6 < ULP <= 0.7: 11275 > We should also try to use the expression A*x + B more often so that > things with be faster when this is done by fma, but this happens > naturally anyway. Using A*x+B or B+A*x did not seem to matter. -- Steve 20161221 https://www.youtube.com/watch?v=IbCHE-hONow
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170426032350.GA87524>