From owner-freebsd-standards@freebsd.org Wed Apr 26 03:23:58 2017 Return-Path: Delivered-To: freebsd-standards@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4FE8CD50FC3; Wed, 26 Apr 2017 03:23:58 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 2B09C9C; Wed, 26 Apr 2017 03:23:58 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3Q3No8O087663 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 25 Apr 2017 20:23:50 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3Q3NoEL087662; Tue, 25 Apr 2017 20:23:50 -0700 (PDT) (envelope-from sgk) Date: Tue, 25 Apr 2017 20:23:50 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170426032350.GA87524@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170413171248.GA86780@troutmask.apl.washington.edu> <20170424232743.GA54621@troutmask.apl.washington.edu> <20170425143120.R933@besplex.bde.org> <20170425063643.GA45906@troutmask.apl.washington.edu> <20170425173542.S1401@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170425173542.S1401@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Apr 2017 03:23:58 -0000 On Tue, Apr 25, 2017 at 06:27:52PM +1000, Bruce Evans wrote: > On Mon, 24 Apr 2017, Steve Kargl wrote: > > > On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote: > > >> [for expl] > >> This is arranged to look a bit like Horner's method at first (operations > >> not written in separate statements), but it is basically Estrin's method > >> with no extensions to more complicated sub-expressions than Estrin's, > >> but complications for accuracy: > >> - I think accuracy for the B3 term is unimportant, but for some reason > >> (perhaps just the one in the comment) it is grouped separately. Estrin > >> might use x3 * (B3.e + x * B4.e). > > > > Yes, it is most likely an accuracy issue. I found something > > similar with sinpi/cospi. My k_sinpi.c has > > > > #if 0 > > double x2; > > /* Horner's method. */ > > x2 = x * x; > > sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3) > > + s2) + s1; > > Horner's of course has about as many dependencies as operations (13), so > on modern x86 this is limited by the latency of 3-5 cycles/instructions > or 39-65 cycles. Multiplications take an extra cycle in more cases, so > it is better to have more additions, but unfortunately Horner gives 1 > more multuplication. > > With no dependencies, the limit would be throughput -- 2 instructions/cycle > of 6.5 cycles rounded up to 7; then add latency for the last operation to > get about 11. > > With practical dependencies, it is hard to get more than a 2-fold speedup. > > > #else > > double x2, x4; > > /* Sort of Estrin's method. */ > > x2 = x * x; > > x4 = x2 * x2; > > sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4 > > + (s2 + s3 * x2 + s4 * x4) * x2 + s1; > > #endif > > This looks OK. The only obvious improvement is to try grouping x4*x4 > so that it can be evaluated separately. This reduces the length of > dependency chains by 1, but increases early pressure on multiplication. > Don't write this as x8 = x4*x4. Compilers will do the equivalent. > You can also replace x4 by (x2*x2) but there are many x4's so this > might be less readable. The grouping of (x4*x4) and (x2 * x4) in other polynomials indeed give small improvements. I've reworked the approximations in sinpi[l], cospi[l], and tanpi[l]. I did not see any improves with the float versions as the order of the polynomials are fairly small (order of 4). > > The addition of s1 must occur last, or I get greater 1 ULP > > for a number of values. > > Indeed. This prevents using the best order, which is the opposite. > You used the best order for the other inner terms, but not so obviously > for the first outer addition. However, the compiler is free to reorder > that: consider: > > T2 + T1 + T0 > > T2 is more complicated and everything depends on it, so should be started > first, but possibly T1 can be started first because it is simpler. Here > it is not significantly simpler. However, for: > > T3 + T2 + T1 + T0 > > the natural left-to-right grouping is usually pessimial. This should be > written as: > > T3 + (T2 + T1) + T0 > > Unfortunately, we can't write this naturally and usually optimally as: > > T0 + T1 + T2 + T3 > > due to the accuracy problem with T0. Maybe right this as: > > T1 + T2 + T3 + T0 // only T0 in unnatural order > > For expl, I tried all orders. I tried a few different orders; particularly with the long double functions. I also managed to break the code a few times. Thankfully, I have everything in a subversion repository. > > > My timing show on Intel core2 duo > > > > /* Horner's methods. */ > > laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22 > > sinpi time: 0.066 usec per call > > cospi time: 0.063 usec per call > > > > /* Estrin's methods. */ > > laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22 > > sinpi time: 0.059 usec per call > > cospi time: 0.055 usec per call > > > > Here, -n 22 means to test 2**22 - 1 values in the interval > > [0,0.25]. > > The improvement is marginal. Only worth doing if it doesn't complicate > the code much. Older CPUs can't exploit te parallelism so well, so I > would expect a relatively larger improvement. Don't measure in usec, > since they are hard to scale. 0.66 usec and 1.86 GHz is more than 100 > cycles. This is quite slow. I get similar times for cos and sin on an > older Athlon-XP (~40 in float prec, ~160 in double and ~270 in long > double). Newer CPUs are about 4 times faster in double prec (only 2 > times fast in float prec) in cycles despite having similar nominal > timing, by unclogging their pipes. I'm assuming your 0.66 in the above is a typo as I wrote 0.066. I also got some improvement by changing the splitting of a few entities into high and low parts by a simple cast. Instead of doing EXTRACT_WORDS(hx, lx, x); INSERT_WORDS(xhi, hx, 0); xlo = x - xhi; I do xhi = (float)x; xlo = x - xhi; but this seems to only help in some situation. My new timing for x in [0,0.25] on my 1995.05 MHz Core2 duo gives Horner sinpi time: 0.073 usec per call (~146 cycles) cospi time: 0.070 usec per call (~140) tanpi time: 0.112 usec per call (~224) Estrin sinpi time: 0.064 usec per call (~128) cospi time: 0.060 usec per call (~120) tanpi time: 0.088 usec per call (~176) tanpi benefits the most as it has an order 16 polynomial, which allows some freedom in group terms. I don't think I can do much better. For example, the approximation for sinpi(x) is sinpi(x) = x * (s0 + x2 * S(x2)) where S(x2) is the polynomial and I need to split x2 into high and low parts and s0 is split. The kernel with the Estrin method and with the polynomial coefficients is static inline double __kernel_sinpi(double x) { double hi, lo, sm, xhi, xlo; uint32_t hx, lx; double x2, x4; /* Sort of Estrin's method. */ x2 = x * x; x4 = x2 * x2; sm = (s5 + s6 * x2 + s7 * x4) * (x4 * x4) + (s2 + s3 * x2 + s4 * x4) * x2 + s1; EXTRACT_WORDS(hx, lx, x); INSERT_WORDS(xhi, hx, 0); xlo = x - xhi; lo = xlo * (x + xhi) * sm; hi = xhi * xhi * sm; lo = x * lo + xlo * (s0hi + s0lo) + x * hi; return (lo + xhi * s0lo + xhi * s0hi); } The order of terms in the last 5 lines matters. Testing yields a = 2.4220192246482908e-01, /* 0x3fcf0078, 0xfc01e3f0 */ sinpi(a) = 6.8957335930938313e-01, /* 0x3fe610fc, 0x264da72e */ sin(pi*a) = 6.8957335930938324e-01, /* 0x3fe610fc, 0x264da72f */ ULP: 0.79950446 MAX ULP: 0.79950446 Total tested: 8388606 1.0 < ULP : 0 0.9 < ULP <= 1.0: 0 0.8 < ULP <= 0.9: 0 0.7 < ULP <= 0.8: 554 0.6 < ULP <= 0.7: 11275 > We should also try to use the expression A*x + B more often so that > things with be faster when this is done by fma, but this happens > naturally anyway. Using A*x+B or B+A*x did not seem to matter. -- Steve 20161221 https://www.youtube.com/watch?v=IbCHE-hONow