Date: Sat, 29 Apr 2017 20:19:23 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Bruce Evans <brde@optusnet.com.au> Cc: Steve Kargl <sgk@troutmask.apl.washington.edu>, freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170429194239.P3294@besplex.bde.org> In-Reply-To: <20170429151457.F809@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 29 Apr 2017, Bruce Evans wrote: > On Fri, 28 Apr 2017, Steve Kargl wrote: > >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote: >>> >>> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. > > Comments on this below. > > This is all rather over-engineered. Optimizing these functions is > unimportant comparing with finishing cosl() and sinl() and optimizing > all of the standard trig functions better, but we need correctness. > But I now see many simplifications and improvements: > > (1) There is no need for new kernels. The standard kernels already handle > extra precision using approximations like: > > sin(x+y) ~= sin(x) + (1-x*x/2)*y. > > Simply reduce x and write Pi*x = hi+lo. Then > > sin(Pi*x) = __kernel_sin(hi, lo, 1). > > I now see how to do the extra-precision calculations without any > multiplications. But that is over-engineered too. Using the standard kernels is easy and works well: XX #include <float.h> XX #include <math.h> XX XX #include "math_private.h" XX XX static const double XX pi_hi = 3.1415926218032837e+00, /* 0x400921fb 0x50000000 */ XX pi_lo = 3.1786509547050787e-08; /* 0x3e6110b4 0x611a5f14 */ XX XX /* Only for |x| <= ~0.25 (missing range reduction. */ XX XX double XX cospi(double x) XX { XX double_t hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return __kernel_cos(hi, lo); XX } XX XX double XX sinpi(double x) XX { XX double_t hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return __kernel_sin(hi, lo, 1); XX } XX XX double XX tanpi(double x) XX { XX double_t hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return __kernel_tan(hi, lo, 1); XX } I only did a sloppy accuracy test for sinpi(). It was 0.03 ulps less accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for sin(). Efficiency is very good in some cases, but anomalous in others: all times in cycles, on i386, on the range [0, 0.25] athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 cos: 61-62 44 43 cospi: 69-71 (8-9 extra) 78 (anomalous...) 42 (faster to do more!) sin: 59-60 51 37 sinpi: 67-68 (8 extra) 80 42 tan: 136-172 93-195 67-94 tanpi: 144-187 (8-15 extra) 145-176 61-189 That was a throughput test. Latency is not so good. My latency test doesn't use serializing instructions, but uses random args and the partial serialization of making each result depend on the previous one. athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 cos: 84-85 69 79 cospi: 103-104 (19-21 extra) 117 94 sin: 75-76 89 77 sinpi: 105-106 (30 extra) 116 90 tan: 168-170 167-168 147 tanpi: 191-194 (23-24 extra) 191 154 This also indicates that the longest times for tan in the throughput test are what happens when the function doesn't run in parallel with itself. The high-degree polynomial and other complications in tan() are too complicated for much cross-function parallelism. Anywyay, it looks like the cost of using the kernel is at most 8-9 in the parallel case and at most 30 in the serial case. The extra- precision code has about 10 dependent instructions, so it s is doing OK to take 30. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170429194239.P3294>