Date: Sun, 14 May 2017 08:30:34 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Steve Kargl <sgk@troutmask.apl.washington.edu> Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170514071942.T1084@besplex.bde.org> In-Reply-To: <20170513205517.GA91911@troutmask.apl.washington.edu> References: <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> <20170429194239.P3294@besplex.bde.org> <20170513205517.GA91911@troutmask.apl.washington.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 13 May 2017, Steve Kargl wrote: > On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote: >> ... >> Using the standard kernels is easy and works well: > > Maybe works well. See below. >> ... >> Anywyay, it looks like the cost of using the kernel is at most 8-9 >> in the parallel case and at most 30 in the serial case. The extra- >> precision code has about 10 dependent instructions, so it s is >> doing OK to take 30. Probably a few more than 8. I got nowhere using inline versions for double precision. Apparently the only large win for inlining is when it avoids repeating the classification, as it does for e_rem_pio2*. The kernels don't repeat anything, so the only cost a function call, plus a few cycles for testing iy for __kernel_sin() only. > Based on other replies in this email exchange, I have gone back > and looked at improvements to my __kernel_{cos|sin|tan}pi[fl] > routines. The improvements where for both accuracy and speed. I really don't want another set of kernels (or more sets for degrees instead of radians, and sincos). Improvements to the existing kernels are welcome, but difficult except for long double precision. I got nowhere tweaking the polynominal in __kernel_sin(). Every change that that I tried just moved the tradeoff between accuracy and efficiency. The one best for efficiency is only about 4 cycles faster, and increases the error by 0.1 to 0.2 ulps. This change involves adding up the terms in a different order. > I have tested on i686 and x86_64 systems with libm built with > -O2 -march=native -mtune=native. My timing loop is of the > form > > float dx, f, x; > long i, k; > > f = 0; > k = 1 << 23; > dx = (xmax - xmin) / (k - 1); > time_start(); > for (i = 0; i < k; i++) { > x = xmin + i * dx; This asks for a conversions from long to double which tends to be slow, and a multiplication in the inner loop. The compiler shouldn't optimize it to x += dx since this has different inaccuracy. My test loop does x += dx with FP an test that x < limit. This sometimes has problems when dx is so small that x + dx == x. Also, x, dx and limit are double precision for testing all precision, so that the loop overhead is the same for all precisions. This works best on i386/i387. Otherwise, there are larger conversion overheads. This usually prevents x + dx == x in float precision, but in long double precison it results in x + dx == x more often. Double precision just can't handle a large limit like LDBL_MAX or even small steps up to DBL_MAX. > f += cospif(x); > }; > time_end(); > > x = (time_cpu() / k) * 1.e6; > printf("cospif time: %.4f usec per call\n", x); > > if (f == 0) > printf("Can't happen!\n"); Otherwise, this is a reasonable throughput test. But please count times in cycles if possible. rdtsc() is very easy to use on x86. > > The assumption here is that loop overhead is the same for > all tested kernels. It is probably much larger for long double precision. I get minimal times like 9 cycles for float and double precision, but more like 30 for long double on x86. > Test intervals for kernels. > > float: [0x1p-14, 0.25] > double: [0x1p-29, 0.25] > ld80: [0x1p-34, 0.25] > > Core2 Duo T7250 @ 2.00GHz || AMD FX8350 Eight-Core CPU > (1995.05-MHz 686-class) || (4018.34-MHz K8-class) > ----------------------------------++-------------------------- > | Horner | Estrin | Fdlibm || Horner | Estrin | Fdlibm > -------+--------+--------+--------++--------+--------+-------- > cospif | 0.0223 | | 0.0325 || 0.0112 | | 0.0085 > sinpif | 0.0233 | Note 1 | 0.0309 || 0.0125 | | 0.0085 > tanpif | 0.0340 | | Note 2 || 0.0222 | | The fdlibm kernels are almost impossible to beat in float precision, since they use double precision so the correct way to use them is for example 'cospif: return __kernel_cosdf(M_PI * x);' after reduction to |x| ~< 0.25 Any pure float precision method is going to take 10-20 cycles longer. It is interesting that you measured fdlibm to be faster on the newer system but much slower on the older system. The latter must be a bug somewhere. > -------+--------+--------+--------++--------+--------+-------- > cospi | 0.0641 | 0.0571 | 0.0604 || 0.0157 | 0.0142 | 0.0149 > sinpi | 0.0722 | 0.0626 | 0.0712 || 0.0178 | 0.0161 | 0.0166 > tanpi | 0.1049 | 0.0801 | || 0.0323 | 0.0238 | > -------+--------+--------+--------++--------+--------+-------- Now the differences are almost small enough to be noise. > cospil | 0.0817 | 0.0716 | 0.0921 || 0.0558 | 0.0560 | 0.0755 > sinpil | 0.0951 | 0.0847 | 0.0994 || 0.0627 | 0.0568 | 0.0768 > tanpil | 0.1310 | 0.1004 | || 0.1005 | 0.0827 | > -------+--------+--------+--------++--------+--------+-------- Now the differences are that the kernels for long double precision are unoptimized. They use Horner. Actually, they do use the optimization of using double precision constants if possible (but not the larger optimization for sparc64 of calculating higher terms in double precision). > Time in usec/call. > > Note 1. In re-arranging the polynomials for Estrin's method and > float, I found appreciable benefit. Do you mean "no appreciable benefit"? No times are shown. Short polynomials benefit less. There is also the problem that measuring throughput vs latency is hard. If the CPU can execute several functions in parallel, it is best (iff the load has candidates for such functions, as simple tests do) to use something like Horner's method to minimise the number of operations. Horner's method is only very bad for latency, and on in-order CPUs. Some of the timing anomalys are probably explained by this -- newer CPUs have fewer bottlenecks so do better at executing functions in parallel; this is also easier in float precision. > Note 2. I have been unable to use the tan[fl] kernels to implement > satisfactory kernels for tanpi[fl]. In particular, for x in [0.25,0.5] > and using tanf kernel leads to 6 digit ULPs in 0.5 whereas my kernel > near 2 ULP. The tanf kernel should be very accurate since it is in double precision. But its polynomial is chosen to only give an accuracy of 0.7999 ulps, while the polys for cosf and sing are chosen to give an accuracy of 0.5009 ulps, since the high accuracy is only almost free for the latter. Any extra error on 0.7999 might be too much. But multiplication by M_PI in double precision shouldn't change the error by more than 0.0001 ulps. The tanl kernel has to struggle to get even sub-ulp precision. Its degree is too high for efficiency, and I don't trust it to give even sub-ulp precision, especially for ld128. I didn't manage to get cospi(x) and sinpi(x) using the kernels as fast as cos(x) and sin(x), even with |x| restricted to < 0.25 so that the range reduction step is null. The extra precision operations just take longer than the range reduction even when the latter is not simplifed for the reduced range. Conversion of degrees to multiples of Pi is interesting. E.g., cosd(x) = cos(x * Pi / 180) = cospi(x / 180) in infinite precision. The natural way to implement it is to convert to cospi() first. This is only easy using a remainder operation. Remainder operations work for this, unlike for converting radians to a quadrand plus a remainder, because 180 is exactly representable but Pi isn't. But exact remainder operations are slow too. They are just not as slow or inexact as ones for 18000+ digit approximations to Pi. So cosd(x) can only be implemented much more efficiently than cos(x) for the unimportant case of large |x|. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170514071942.T1084>