From owner-freebsd-standards@freebsd.org Tue Apr 25 08:28:08 2017 Return-Path: Delivered-To: freebsd-standards@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A9E72D4EB9C; Tue, 25 Apr 2017 08:28:08 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 4040C1C43; Tue, 25 Apr 2017 08:28:07 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id B19C0104877B; Tue, 25 Apr 2017 18:27:57 +1000 (AEST) Date: Tue, 25 Apr 2017 18:27:52 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170425063643.GA45906@troutmask.apl.washington.edu> Message-ID: <20170425173542.S1401@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170413171248.GA86780@troutmask.apl.washington.edu> <20170424232743.GA54621@troutmask.apl.washington.edu> <20170425143120.R933@besplex.bde.org> <20170425063643.GA45906@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=pMMk53J5QUvvM_BoBYoA:9 a=JWkNqyrBzBJkAUbv:21 a=d3hanlBOIU_nJgXb:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Apr 2017 08:28:08 -0000 On Mon, 24 Apr 2017, Steve Kargl wrote: > On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote: >> [for expl] >> This is arranged to look a bit like Horner's method at first (operations >> not written in separate statements), but it is basically Estrin's method >> with no extensions to more complicated sub-expressions than Estrin's, >> but complications for accuracy: >> - I think accuracy for the B3 term is unimportant, but for some reason >> (perhaps just the one in the comment) it is grouped separately. Estrin >> might use x3 * (B3.e + x * B4.e). > > Yes, it is most likely an accuracy issue. I found something > similar with sinpi/cospi. My k_sinpi.c has > > #if 0 > double x2; > /* Horner's method. */ > x2 = x * x; > sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3) > + s2) + s1; Horner's of course has about as many dependencies as operations (13), so on modern x86 this is limited by the latency of 3-5 cycles/instructions or 39-65 cycles. Multiplications take an extra cycle in more cases, so it is better to have more additions, but unfortunately Horner gives 1 more multuplication. With no dependencies, the limit would be throughput -- 2 instructions/cycle of 6.5 cycles rounded up to 7; then add latency for the last operation to get about 11. With practical dependencies, it is hard to get more than a 2-fold speedup. > #else > double x2, x4; > /* Sort of Estrin's method. */ > x2 = x * x; > x4 = x2 * x2; > sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4 > + (s2 + s3 * x2 + s4 * x4) * x2 + s1; > #endif This looks OK. The only obvious improvement is to try grouping x4*x4 so that it can be evaluated separately. This reduces the length of dependency chains by 1, but increases early pressure on multiplication. Don't write this as x8 = x4*x4. Compilers will do the equivalent. You can also replace x4 by (x2*x2) but there are many x4's so this might be less readable. > The addition of s1 must occur last, or I get greater 1 ULP > for a number of values. Indeed. This prevents using the best order, which is the opposite. You used the best order for the other inner terms, but not so obviously for the first outer addition. However, the compiler is free to reorder that: consider: T2 + T1 + T0 T2 is more complicated and everything depends on it, so should be started first, but possibly T1 can be started first because it is simpler. Here it is not significantly simpler. However, for: T3 + T2 + T1 + T0 the natural left-to-right grouping is usually pessimial. This should be written as: T3 + (T2 + T1) + T0 Unfortunately, we can't write this naturally and usually optimally as: T0 + T1 + T2 + T3 due to the accuracy problem with T0. Maybe right this as: T1 + T2 + T3 + T0 // only T0 in unnatural order For expl, I tried all orders. > My timing show on Intel core2 duo > > /* Horner's methods. */ > laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22 > sinpi time: 0.066 usec per call > cospi time: 0.063 usec per call > > /* Estrin's methods. */ > laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22 > sinpi time: 0.059 usec per call > cospi time: 0.055 usec per call > > Here, -n 22 means to test 2**22 - 1 values in the interval > [0,0.25]. The improvement is marginal. Only worth doing if it doesn't complicate the code much. Older CPUs can't exploit te parallelism so well, so I would expect a relatively larger improvement. Don't measure in usec, since they are hard to scale. 0.66 usec and 1.86 GHz is more than 100 cycles. This is quite slow. I get similar times for cos and sin on an older Athlon-XP (~40 in float prec, ~160 in double and ~270 in long double). Newer CPUs are about 4 times faster in double prec (only 2 times fast in float prec) in cycles despite having similar nominal timing, by unclogging their pipes. We should also try to use the expression A*x + B more often so that things with be faster when this is done by fma, but this happens naturally anyway. Bruce