From owner-freebsd-hackers@freebsd.org Sat Apr 29 20:13:00 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E8B38D55A65; Sat, 29 Apr 2017 20:13:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 952EC16C1; Sat, 29 Apr 2017 20:13:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 2DB851041D4A; Sun, 30 Apr 2017 06:12:46 +1000 (AEST) Date: Sun, 30 Apr 2017 06:12:43 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: Bruce Evans , freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170429184209.GB41420@troutmask.apl.washington.edu> Message-ID: <20170430051230.X1000@besplex.bde.org> References: <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> <20170429194239.P3294@besplex.bde.org> <20170429184209.GB41420@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=ZDInmu-daXXT0OxnvG4A:9 a=dM07vABZqfTYfvwQ:21 a=CjuIK1q_8ugA:10 X-Mailman-Approved-At: Sat, 29 Apr 2017 21:34:02 +0000 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 20:13:01 -0000 On Sat, 29 Apr 2017, Steve Kargl wrote: > On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote: >. ... >> Using the standard kernels is easy and works well: > > As your code only works on the interval [0,0.25], I took > the liberty to use it as a __kernel_sinpi and __kernel_cospi. You already wrote the reduction part, and lgamma already has it to. I just checked lgamma. Its reduction part is short (especially after your improvements to it). Then it is sloppy and doesn't caclulate Pi*x in extra precision. I think this is justified since lgamma generally has errors much larger than 1 ulp. it does some micro-optimizations involving minus signs. There is some danger that with the kernels inline, the compiler will generate large inline code for multiple copies whose difference is just the placement of a single instruction to do the negation. If this is an optimization, then it is small. > >> XX double >> XX cospi(double x) >> XX { >> XX double_t hi, lo; > > If sizeof(double_t) indicates what I think it means, > This is slow on my Core2 duo (aka ia32 system). I doubt that it is what you think it is, since you neglect testing on i386 :-). It makes no difference on amd64, but is needed on i386 to optimized and give correct code. However, double_t is broken with clang on i386 with certain CFLAGS that give SSE. Then clang produces code that is incompatible with FLT_EVAL_METHOD and double_t. This makes clangs use of SSE a pessimization. It dodn't seem to break correctness (the result is the same as using long double instead of double_t). >> XX hi = (float)x; >> XX lo = x - hi; > > This is the splitting I use in my double version versions > with hi and lo as simply double. It will work without double_t for the splitting. double_t (or lots of slow STRICT_ASSIGN()s is needed for 2sum. 2sum is like the above except it casts to double instead of float, and compilers intentionally break casts. >> ... >> Efficiency is very good in some cases, but anomalous in others: all >> times in cycles, on i386, on the range [0, 0.25] >> >> athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 >> cos: 61-62 44 43 >> cospi: 69-71 (8-9 extra) 78 (anomalous...) 42 (faster to do more!) >> ... > > I is unclear how you're making your measurements. My timings > with my kernels compared to kernels based on your code: Basically by timing a for loop, but with many refinements to make sure that I am timing the function and not the loop, or at least to make sure that the loop timing doesn't change. For example, on amd64 and i386, the stack is aligned at 4K plus a smaller adjustable parameter since otherwise the timing for long doubles is too variable. Also, rebuilding all the object files used with identical CFLAGS (usually -O -march=better-than-native). This probably makes my tests run 20-50 cycles per iteration faster (the simplest non-inline function takes 8 or 9 cycles and that takes running it in parallel with itself). > | Bruce | Steve > ------+--------------+-------------- > sinpi | 0.0742 (148) | 0.0633 (126) > cospi | 0.0720 (144) | 0.0513 (102) > > First number is microseconds per call and the (xxx) is the time*cpu_freq. The difference might be loop overhead or loss of parallelism. IIRC, you use something to serialize after every function. I have 4 variations and rarely use the serialization ones. Or thie difference might be more that the kernels are not inlined. I forgot to inline them too. I always rebuild the kernels for every test, so they get compiled with non-default CFLAGS, but they don't get inlined without #include statements. > As far as over-engineering, for sinpi I find > > sinpi Bruce kernel Steve kernel > MAX ULP: 0.73021263 0.73955815 > Total tested: 33554431 33554431 > 0.7 < ULP <= 0.8: 154 280 > 0.6 < ULP <= 0.7: 27650 29197 See, the kernels are already tuned almost as much as possible for accuracy/efficiency tradeoffs. > cospi is much more interesting and as you state above more > difficult to get right. I have reworked my kernel, yet, > but I find > > cospi Bruce kernel Steve kernel > MAX ULP: 0.78223389 0.89921787 > Total tested: 33554431 33554431 > 0.8 < ULP <= 0.9: 0 3262 > 0.7 < ULP <= 0.8: 9663 68305 > 0.6 < ULP <= 0.7: 132948 346214 > > Perhaps, using double_t would reduce my max ULP at the expense > of speed. On i386, it should reduce the error at the expense of slowness. On amd64, it makes no difference, but using long double should reduct the error at the expense of speed. Usually the reduction in the maximum error is only about 0.1 ulps unless the algorithm uses extra precision intentionally. This is available on i386, but is not large enough to compensate for the slowness. My example of log10*() unintentionally gave an example of using extra precision intentionally for efficiency. Here it is for log10(): XX double XX log10(double x) XX { XX struct Double r; XX double_t hi, lo; XX XX DOPRINT_START(&x); XX k_log(x, &r); XX if (!r.lo_set) XX RETURNP(r.hi); XX if (0 && sizeof(double_t) > sizeof(double)) XX RETURNP((invln10_lo + (double_t)invln10_hi) * (r.lo + r.hi)); This avoids the emulated extra-precision calculations of possible. This is under 'if (0)' since it is too hard to determine if extra precision is available. The sizeof() calculation doesn't tell you if it is. On i386, double_t is long double, but extra precision is not available under FreeBSD unless the program used fpsetprec(FP_PE) to change the default of double precision. This is not under 'if (0)' for float precision. Then it is assume that the program doesn't use fpsetprec(FP_PS) to change the default to float precision. XX _2sumF(r.hi, r.lo); XX hi = (float)r.hi; XX lo = r.lo + (r.hi - hi); XX RETURN2P(invln10_hi * hi, XX (invln10_lo + invln10_hi) * lo + invln10_lo * hi); XX } i386/i387, when used as designed, can execute this function much faster than SSE by not executing the lines beginning with _2sumF(). These lines take 20-40 cycles (latency) on modern SSE. Throughput is better. This would probably be a good optimization for amd64 too (using long double instead of double_t for the extra precision). The Intel ia64 math library uses this method of switching to extended precision often. ia64 has a better FP design. I don't know its details, but it seems to combine the best features of i387 and SSE. The FP registers should have extra precision as on i387, but the instructions should make it efficient to not use the extra precision for languages and programmers that don't want it. Bruce