Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 29 Apr 2017 11:42:09 -0700
From:      Steve Kargl <sgk@troutmask.apl.washington.edu>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject:   Re: Implementation of half-cycle trignometric functions
Message-ID:  <20170429184209.GB41420@troutmask.apl.washington.edu>
In-Reply-To: <20170429194239.P3294@besplex.bde.org>
References:  <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> <20170429194239.P3294@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote:
> On Sat, 29 Apr 2017, Bruce Evans wrote:
> 
> > On Fri, 28 Apr 2017, Steve Kargl wrote:
> >
> >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
> >>> 
> >>> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.
> >
> > Comments on this below.
> >
> > This is all rather over-engineered.  Optimizing these functions is
> > unimportant comparing with finishing cosl() and sinl() and optimizing
> > all of the standard trig functions better, but we need correctness.
> > But I now see many simplifications and improvements:
> >
> > (1) There is no need for new kernels.  The standard kernels already handle
> > extra precision using approximations like:
> >
> >    sin(x+y) ~= sin(x) + (1-x*x/2)*y.
> >
> > Simply reduce x and write Pi*x = hi+lo.  Then
> >
> >    sin(Pi*x) = __kernel_sin(hi, lo, 1).
> >
> > I now see how to do the extra-precision calculations without any
> > multiplications.
> 
> But that is over-engineered too.
> 
> Using the standard kernels is easy and works well:

As your code only works on the interval [0,0.25], I took
the liberty to use it as a __kernel_sinpi and __kernel_cospi.

> XX double
> XX cospi(double x)
> XX {
> XX 	double_t hi, lo;

If sizeof(double_t) indicates what I think it means,
This is slow on my Core2 duo (aka ia32 system).

> XX 	hi = (float)x;
> XX 	lo = x - hi;

This is the splitting I use in my double version versions
with hi and lo as simply double.

> XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
> XX 	hi = pi_hi * hi;
> XX 	_2sumF(hi, lo);
> XX 	return __kernel_cos(hi, lo);
> XX }
> XX 
> 
> I only did a sloppy accuracy test for sinpi().  It was 0.03 ulps less
> accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for
> sin().
> 
> Efficiency is very good in some cases, but anomalous in others: all
> times in cycles, on i386, on the range [0, 0.25]
> 
> athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
> cos:   61-62                 44                 43
> cospi: 69-71 (8-9 extra)     78 (anomalous...)  42 (faster to do more!)
> sin:   59-60                 51                 37
> sinpi: 67-68 (8 extra)       80                 42
> tan:   136-172               93-195             67-94
> tanpi: 144-187 (8-15 extra)  145-176            61-189
> 
> That was a throughput test.  Latency is not so good.  My latency test
> doesn't use serializing instructions, but uses random args and the
> partial serialization of making each result depend on the previous
> one.
> 
> athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
> cos:   84-85                 69                 79
> cospi: 103-104 (19-21 extra) 117                94
> sin:   75-76                 89                 77
> sinpi: 105-106 (30 extra)    116                90
> tan:   168-170               167-168            147
> tanpi: 191-194 (23-24 extra) 191                154

I is unclear how you're making your measurements.   My timings
with my kernels compared to kernels based on your code:

      |   Bruce      |   Steve
------+--------------+--------------
sinpi | 0.0742 (148) | 0.0633 (126)
cospi | 0.0720 (144) | 0.0513 (102)

First number is microseconds per call and the (xxx) is the time*cpu_freq.

As far as over-engineering, for sinpi I find

sinpi            Bruce kernel    Steve kernel
         MAX ULP: 0.73021263     0.73955815
    Total tested: 33554431       33554431
0.7 < ULP <= 0.8: 154            280
0.6 < ULP <= 0.7: 27650          29197

cospi is much more interesting and as you state above more
difficult to get right.  I have reworked my kernel, yet,
but I find

cospi             Bruce kernel    Steve kernel
         MAX ULP: 0.78223389      0.89921787
    Total tested: 33554431        33554431
0.8 < ULP <= 0.9: 0               3262
0.7 < ULP <= 0.8: 9663            68305
0.6 < ULP <= 0.7: 132948          346214

Perhaps, using double_t would reduce my max ULP at the expense
of speed. 

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170429184209.GB41420>