Date: Sat, 13 May 2017 13:55:17 -0700 From: Steve Kargl <sgk@troutmask.apl.washington.edu> To: Bruce Evans <brde@optusnet.com.au> Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170513205517.GA91911@troutmask.apl.washington.edu> In-Reply-To: <20170429194239.P3294@besplex.bde.org> References: <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> <20170429194239.P3294@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote: > On Sat, 29 Apr 2017, Bruce Evans wrote: > > On Fri, 28 Apr 2017, Steve Kargl wrote: > >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote: > >>> > >>> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. > > > > Comments on this below. > > > > This is all rather over-engineered. Optimizing these functions is > > unimportant comparing with finishing cosl() and sinl() and optimizing > > all of the standard trig functions better, but we need correctness. > > But I now see many simplifications and improvements: > > > > (1) There is no need for new kernels. The standard kernels already handle > > extra precision using approximations like: > > > > sin(x+y) ~= sin(x) + (1-x*x/2)*y. > > > > Simply reduce x and write Pi*x = hi+lo. Then > > > > sin(Pi*x) = __kernel_sin(hi, lo, 1). > > > > I now see how to do the extra-precision calculations without any > > multiplications. > > But that is over-engineered too. > > Using the standard kernels is easy and works well: Maybe works well. See below. > Efficiency is very good in some cases, but anomalous in others: all > times in cycles, on i386, on the range [0, 0.25] > > athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 > cos: 61-62 44 43 > cospi: 69-71 (8-9 extra) 78 (anomalous...) 42 (faster to do more!) > sin: 59-60 51 37 > sinpi: 67-68 (8 extra) 80 42 > tan: 136-172 93-195 67-94 > tanpi: 144-187 (8-15 extra) 145-176 61-189 > > That was a throughput test. Latency is not so good. My latency test > doesn't use serializing instructions, but uses random args and the > partial serialization of making each result depend on the previous > one. > > athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 > cos: 84-85 69 79 > cospi: 103-104 (19-21 extra) 117 94 > sin: 75-76 89 77 > sinpi: 105-106 (30 extra) 116 90 > tan: 168-170 167-168 147 > tanpi: 191-194 (23-24 extra) 191 154 > > This also indicates that the longest times for tan in the throughput > test are what happens when the function doesn't run in parallel with > itself. The high-degree polynomial and other complications in tan() > are too complicated for much cross-function parallelism. > > Anywyay, it looks like the cost of using the kernel is at most 8-9 > in the parallel case and at most 30 in the serial case. The extra- > precision code has about 10 dependent instructions, so it s is > doing OK to take 30. Based on other replies in this email exchange, I have gone back and looked at improvements to my __kernel_{cos|sin|tan}pi[fl] routines. The improvements where for both accuracy and speed. I have tested on i686 and x86_64 systems with libm built with -O2 -march=native -mtune=native. My timing loop is of the form float dx, f, x; long i, k; f = 0; k = 1 << 23; dx = (xmax - xmin) / (k - 1); time_start(); for (i = 0; i < k; i++) { x = xmin + i * dx; f += cospif(x); }; time_end(); x = (time_cpu() / k) * 1.e6; printf("cospif time: %.4f usec per call\n", x); if (f == 0) printf("Can't happen!\n"); The assumption here is that loop overhead is the same for all tested kernels. Test intervals for kernels. float: [0x1p-14, 0.25] double: [0x1p-29, 0.25] ld80: [0x1p-34, 0.25] Core2 Duo T7250 @ 2.00GHz || AMD FX8350 Eight-Core CPU (1995.05-MHz 686-class) || (4018.34-MHz K8-class) ----------------------------------++-------------------------- | Horner | Estrin | Fdlibm || Horner | Estrin | Fdlibm -------+--------+--------+--------++--------+--------+-------- cospif | 0.0223 | | 0.0325 || 0.0112 | | 0.0085 sinpif | 0.0233 | Note 1 | 0.0309 || 0.0125 | | 0.0085 tanpif | 0.0340 | | Note 2 || 0.0222 | | -------+--------+--------+--------++--------+--------+-------- cospi | 0.0641 | 0.0571 | 0.0604 || 0.0157 | 0.0142 | 0.0149 sinpi | 0.0722 | 0.0626 | 0.0712 || 0.0178 | 0.0161 | 0.0166 tanpi | 0.1049 | 0.0801 | || 0.0323 | 0.0238 | -------+--------+--------+--------++--------+--------+-------- cospil | 0.0817 | 0.0716 | 0.0921 || 0.0558 | 0.0560 | 0.0755 sinpil | 0.0951 | 0.0847 | 0.0994 || 0.0627 | 0.0568 | 0.0768 tanpil | 0.1310 | 0.1004 | || 0.1005 | 0.0827 | -------+--------+--------+--------++--------+--------+-------- Time in usec/call. Note 1. In re-arranging the polynomials for Estrin's method and float, I found appreciable benefit. Note 2. I have been unable to use the tan[fl] kernels to implement satisfactory kernels for tanpi[fl]. In particular, for x in [0.25,0.5] and using tanf kernel leads to 6 digit ULPs in 0.5 whereas my kernel near 2 ULP. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170513205517.GA91911>