Date: Tue, 25 Apr 2017 15:13:44 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Steve Kargl <sgk@troutmask.apl.washington.edu> Cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170425143120.R933@besplex.bde.org> In-Reply-To: <20170424232743.GA54621@troutmask.apl.washington.edu> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170413171248.GA86780@troutmask.apl.washington.edu> <20170424232743.GA54621@troutmask.apl.washington.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 24 Apr 2017, Steve Kargl wrote: > I have what appears to be working versions of asinpi[fl]. It > was suggested elsewhere that using an Estrin's method to > sum the polynomial approximations instead of Horner's method > may allow modern CPUs to better schedule generated code. > I have implemented an Estrin's-like method for sinpi[l] > and cospi[l], and indeed the generated code is faster on > my Intel core2 duo with only a slight degradation in the > observed max ULP. I'll post new versions to bugzilla in > the near future. I didn't remember Estrin, but checked Knuth and see that his method is a subset of what I use routinely. I might have learned it from Knuth. I tried all methods in Knuth, but they didn't seem to be any better at first. Later I learned more about scheduling. Estrin's method now seems routine as a subset of scheduling. You already used it in expl: for ld80: X x2 = x * x; X x4 = x2 * x2; Unlike in Knuth's description of Estrin, we don't try to order the operations to suit the CPU(s) in the code, but just try to keep them independent, as required for them to execute independently. Compilers will rearrange them anyway, sometimes even correctly, but it makes little difference (on OOE CPUs) to throw all of the operations into the pipeline and see how fast something comes out). X q = x4 * (x2 * (x4 * X /* X * XXX the number of terms is no longer good for X * pairwise grouping of all except B3, and the X * grouping is no longer from highest down. X */ X (x2 * B12 + (x * B11 + B10)) + X (x2 * (x * B9 + B8) + (x * B7 + B6))) + X (x * B5 + B4.e)) + x2 * x * B3.e; This is arranged to look a bit like Horner's method at first (operations not written in separate statements), but it is basically Estrin's method with no extensions to more complicated sub-expressions than Estrin's, but complications for accuracy: - I think accuracy for the B3 term is unimportant, but for some reason (perhaps just the one in the comment) it is grouped separately. Estrin might use x3 * (B3.e + x * B4.e). - accuracy is important for B0 through B2 terms, and we don't use Estrin for them. Efficiency is a matter of running the above in parallel with the evaluation of lower terms. The above shouldn't steal any resources from evaluation of the lower terms unless it is the slow part. Normally you want N (2 independent streams to complete at the same time ready for a final addition). The above also does some power-of-2 grouping which has been messed up by changing the number of terms in the polynomial (see the comment). Generally this costs time time of 1 operation for each missed optimization. The above hasn't been rearranged since it is a lot of work to keep rearranging it during development and development stalled after committing it. The ld128 cases has larger problems from this. It has 2 sets of polynomials, both higher order, with sub-optimal numbers of terms. Here is my program for testing the results of quadratic grouping on x86. It shows little qualitative difference between Athlon-XP and Haswell. Haswell is just faster: XX #include <stdlib.h> XX XX static __inline void XX cpuid(int ax) XX { XX __asm __volatile("cpuid" : "=a" (ax) : "0" (ax) : "cx", "dx", "bx"); XX } XX #define puid(ax) XX XX double P0, P1, P2, P3; XX double P4, P5, P6, P7; XX double P8, P9, Pa, Pb; XX double Pc, Pd, Pe, Pf; XX double Q0, Q1, Q2, Q3; XX double Q4, Q5, Q6, Q7; XX double Q8, Q9, Qa, Qb; XX double Qc, Qd, Qe, Qf; XX volatile double a, r; XX XX #define FREQ 2010168298 /* sysctl -n machdep.tsc_freq */ XX XX #define NITER (FREQ / 10) XX XX int XX main(int argc, char **argv) XX { XX double x; XX int i, n; XX XX n = (argc == 1 ? 0 : atoi(argv[1])); XX switch (n) { XX default: XX for (i = 0; i < NITER; i++) { XX cpuid(0); XX } XX break; XX case 2: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = x*P1+P0; XX cpuid(0); XX } XX break; XX case -4: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = P0+x*(P1+x*(P2+x*P3)); XX cpuid(0); XX } XX break; XX case 4: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = (x*P3+P2)*(x*x)+(x*P1+P0); XX cpuid(0); XX } XX break; XX case 5: XX for (i = 0; i < NITER; i++) { XX x = a; XX double t32 = x*P3+P2; XX double t10 = x*P1+P0; XX double x2 = x*x; XX r = t32*x2+t10; XX cpuid(0); XX } XX break; XX case -8: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*P7)))))); XX cpuid(0); XX } XX break; XX case 6: XX for (i = 0; i < NITER; i++) { XX x = a; XX double lo = 0.0; XX r = ((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) + XX ((x*P3+P2)*(x*x)+lo); XX cpuid(0); XX } XX break; XX case 7: XX for (i = 0; i < NITER; i++) { XX double d = a; XX double lo = 0.0; XX double t = d*P7+P6; XX double z = d*d; XX double s = d*P5+P4; XX double rl = d*P3+P2; XX #if 0 XX r = lo + z * rl + z * z * s + z * z * (z * t); XX #elif 0 XX r = lo + z * z * (z * t) + z * z * s + z * rl; XX #elif 0 XX r = z * z * (z * t) + (z * z * s + lo) + z * rl; XX #else XX r = z * z * (z * t) + z * z * s + (z * rl + lo); XX #endif XX cpuid(0); XX } XX break; XX case 8: XX for (i = 0; i < NITER; i++) { XX cpuid(0); XX } XX break; XX case 9: XX for (i = 0; i < NITER; i++) { XX x = a; XX double t76 = x*P7+P6; XX double t54 = x*P5+P4; XX double t32 = x*P3+P2; XX double x2 = x*x; XX r = ((t76 )*x2+(t54 ))*(x2*x2) + XX ((t32 )*x2+(x*P1+P0)); XX cpuid(0); XX } XX break; XX case 10: XX for (i = 0; i < NITER; i++) { XX x = a; XX double t76 = x*P7+P6; XX double t32 = x*P3+P2; XX double t54 = x*P5+P4; XX double x2 = x*x; XX r = ((t76 )*x2+(t54 ))*(x2*x2) + XX ((t32 )*x2+(x*P1+P0)); XX cpuid(0); XX } XX break; XX case 11: XX for (i = 0; i < NITER; i++) { XX x = a; XX double t76 = x*P7+P6; XX double t54 = x*P5+P4; XX double t32 = x*P3+P2; XX double x2 = x*x; XX double t7654 = t76*x2+t54; XX double t3210 = t32*x2+(x*P1+P0); XX r = t7654*(x2*x2) + t3210; XX cpuid(0); XX } XX break; XX case -16: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*(P7+x*(P8+x*(P9+x*(Pa+x*(Pb+x*(Pc+x*(Pd+x*(Pe+x*(Pf))))))))))))))); XX cpuid(0); XX } XX break; XX case 16: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = (((x*Pf+Pe)*(x*x)+(x*Pd+Pc))*((x*x)*(x*x)) + XX ((x*Pb+Pa)*(x*x)+(x*P9+P8)))* XX (((x*x)*(x*x))*((x*x)*(x*x))) + XX (((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) + XX ((x*P3+P2)*(x*x)+(x*P1+P0))); XX cpuid(0); XX } XX break; XX case 17: XX for (i = 0; i < NITER; i++) { XX x = a; XX double tfe = x*Pf+Pe; XX double tba = x*Pb+Pa; XX double t76 = x*P7+P6; XX double t32 = x*P3+P2; XX double x2 = x*x; XX double tdc = x*Pd+Pc; XX double t98 = x*P9+P8; XX double t54 = x*P5+P4; XX r = (((tfe )*x2+(tdc ))*(x2*x2) + XX ((tba )*x2+(t98 )))*((x2*x2)*(x2*x2)) + XX (((t76 )*x2+(t54 ))*(x2*x2) + XX ((t32 )*x2+(x*P1+P0))); XX cpuid(0); XX } XX break; XX case 18: XX for (i = 0; i < NITER; i++) { XX x = a; XX double tfe = x*Pf+Pe; XX double tba = x*Pb+Pa; XX double t76 = x*P7+P6; XX double tdc = x*Pd+Pc; XX double x2 = x*x; XX double t32 = x*P3+P2; XX double t98 = x*P9+P8; XX double t54 = x*P5+P4; XX double t10 = x*P1+P0; XX double x4 = x2*x2; XX double tfedc = x2*tfe+tdc; XX double tba98 = x2*tba+t98; XX double t7654 = x2*t76+t54; XX double t3210 = x2*t32+t10; XX double tfedcba98 = x4*tfedc+tba98; XX double t76543210 = x4*t7654+t3210; XX r = (x4*x4)*tfedcba98+t76543210; XX cpuid(0); XX } XX break; XX case -32: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*(P7+x*(P8+x*(P9+x*(Pa+x*(Pb+x*(Pc+x*(Pd+x*(Pe+x*(Pf+x*(Q0+x*(Q1+x*(Q2+x*(Q3+x*(Q4+x*(Q5+x*(Q6+x*(Q7+x*(Q8+x*(Q9+x*(Qa+x*(Qb+x*(Qc+x*(Qd+x*(Qe+x*(Qf))))))))))))))))))))))))))))))); XX cpuid(0); XX } XX break; XX case 32: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = ((((x*Qf+Qe)*(x*x)+(x*Qd+Qc))*((x*x)*(x*x)) + XX ((x*Qb+Qa)*(x*x)+(x*Q9+Q8)))* XX (((x*x)*(x*x))*((x*x)*(x*x))) + XX (((x*Q7+Q6)*(x*x)+(x*Q5+Q4))*((x*x)*(x*x)) + XX ((x*Q3+Q2)*(x*x)+(x*Q1+Q0))))* XX ((((x*x)*(x*x))*((x*x)*(x*x)))* XX (((x*x)*(x*x))*((x*x)*(x*x)))) + XX ((((x*Pf+Pe)*(x*x)+(x*Pd+Pc))*((x*x)*(x*x)) + XX ((x*Pb+Pa)*(x*x)+(x*P9+P8)))* XX (((x*x)*(x*x))*((x*x)*(x*x))) + XX (((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) + XX ((x*P3+P2)*(x*x)+(x*P1+P0)))); XX cpuid(0); XX } XX break; XX case 33: XX for (i = 0; i < NITER; i++) { XX x = a; XX double Tfe = x*Qf+Qe; XX double tfe = x*Pf+Pd; XX double T76 = x*Q7+Q6; XX double t76 = x*P7+P6; XX double x2 = x*x; XX r = ((((Tfe )*x2+(x*Qd+Qc))*(x2*x2) + XX ((x*Qb+Qa)*x2+(x*Q9+Q8)))*((x2*x2)*(x2*x2)) + XX (((T76 )*x2+(x*Q5+Q4))*(x2*x2) + XX ((x*Q3+Q2)*x2+(x*Q1+Q0))))*(((x2*x2)*(x2*x2))*((x2*x2)*(x2*x2))) + XX ((((tfe )*x2+(x*Pd+Pc))*(x2*x2) + XX ((x*Pb+Pa)*x2+(x*P9+P8)))*((x2*x2)*(x2*x2)) + XX (((t76 )*x2+(x*P5+P4))*(x2*x2) + XX ((x*P3+P2)*x2+(x*P1+P0)))); XX cpuid(0); XX } XX break; XX } XX return (0); XX } It only tests power-of-2 number of terms. With say 6 or 7 terms, it is harder to run much faster than with 8 terms, but easy to try all other resonable groupings. With 8-15 terms, it is harder to try enough other groupings. It tests using temporary variables to try to get the best order. This makes a little difference with gcc-3.3, less than that with gcc-4.2 and no difference with clang. Looking at the generated code can be instructive. The compiler sometimes finds a good order that you might not think of. Also, you can see where it is constrained by dependencies. Modern x86 (SSE-mumble or AVX-mumble) have the basic Estrin operation of B0 + x * B1 as a single operation (v*fma* in AVX). Basic Estrin would reduce to Horner if this were used. But the C standard doesn't really allow using it, and compilers don't use it on x86. gcc did use it on ia64. It also takes unportable CFLAGS to get high SSE/AVX support, and gives unportable binaries when used. Don't use the libm fma since that tends to be 10-100 times slower than not using it (perhaps only 10 times slower if it claims to be fast). This libm fma is only useful for algorithm development and cases where efficiency doesn't matter. Fma is about half an ulp more accurate than separate operations. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170425143120.R933>