From owner-freebsd-numerics@freebsd.org  Sat Apr 29 18:42:11 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B8BCDD567C2;
 Sat, 29 Apr 2017 18:42:11 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id A043F1D4B;
 Sat, 29 Apr 2017 18:42:11 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3TIg9KM041637
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Sat, 29 Apr 2017 11:42:09 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3TIg95N041636;
 Sat, 29 Apr 2017 11:42:09 -0700 (PDT) (envelope-from sgk)
Date: Sat, 29 Apr 2017 11:42:09 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170429184209.GB41420@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
 <20170429005924.GA37947@troutmask.apl.washington.edu>
 <20170429151457.F809@besplex.bde.org>
 <20170429194239.P3294@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170429194239.P3294@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Apr 2017 18:42:11 -0000

On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote:
> On Sat, 29 Apr 2017, Bruce Evans wrote:
> 
> > On Fri, 28 Apr 2017, Steve Kargl wrote:
> >
> >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
> >>> 
> >>> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.
> >
> > Comments on this below.
> >
> > This is all rather over-engineered.  Optimizing these functions is
> > unimportant comparing with finishing cosl() and sinl() and optimizing
> > all of the standard trig functions better, but we need correctness.
> > But I now see many simplifications and improvements:
> >
> > (1) There is no need for new kernels.  The standard kernels already handle
> > extra precision using approximations like:
> >
> >    sin(x+y) ~= sin(x) + (1-x*x/2)*y.
> >
> > Simply reduce x and write Pi*x = hi+lo.  Then
> >
> >    sin(Pi*x) = __kernel_sin(hi, lo, 1).
> >
> > I now see how to do the extra-precision calculations without any
> > multiplications.
> 
> But that is over-engineered too.
> 
> Using the standard kernels is easy and works well:

As your code only works on the interval [0,0.25], I took
the liberty to use it as a __kernel_sinpi and __kernel_cospi.

> XX double
> XX cospi(double x)
> XX {
> XX 	double_t hi, lo;

If sizeof(double_t) indicates what I think it means,
This is slow on my Core2 duo (aka ia32 system).

> XX 	hi = (float)x;
> XX 	lo = x - hi;

This is the splitting I use in my double version versions
with hi and lo as simply double.

> XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
> XX 	hi = pi_hi * hi;
> XX 	_2sumF(hi, lo);
> XX 	return __kernel_cos(hi, lo);
> XX }
> XX 
> 
> I only did a sloppy accuracy test for sinpi().  It was 0.03 ulps less
> accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for
> sin().
> 
> Efficiency is very good in some cases, but anomalous in others: all
> times in cycles, on i386, on the range [0, 0.25]
> 
> athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
> cos:   61-62                 44                 43
> cospi: 69-71 (8-9 extra)     78 (anomalous...)  42 (faster to do more!)
> sin:   59-60                 51                 37
> sinpi: 67-68 (8 extra)       80                 42
> tan:   136-172               93-195             67-94
> tanpi: 144-187 (8-15 extra)  145-176            61-189
> 
> That was a throughput test.  Latency is not so good.  My latency test
> doesn't use serializing instructions, but uses random args and the
> partial serialization of making each result depend on the previous
> one.
> 
> athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
> cos:   84-85                 69                 79
> cospi: 103-104 (19-21 extra) 117                94
> sin:   75-76                 89                 77
> sinpi: 105-106 (30 extra)    116                90
> tan:   168-170               167-168            147
> tanpi: 191-194 (23-24 extra) 191                154

I is unclear how you're making your measurements.   My timings
with my kernels compared to kernels based on your code:

      |   Bruce      |   Steve
------+--------------+--------------
sinpi | 0.0742 (148) | 0.0633 (126)
cospi | 0.0720 (144) | 0.0513 (102)

First number is microseconds per call and the (xxx) is the time*cpu_freq.

As far as over-engineering, for sinpi I find

sinpi            Bruce kernel    Steve kernel
         MAX ULP: 0.73021263     0.73955815
    Total tested: 33554431       33554431
0.7 < ULP <= 0.8: 154            280
0.6 < ULP <= 0.7: 27650          29197

cospi is much more interesting and as you state above more
difficult to get right.  I have reworked my kernel, yet,
but I find

cospi             Bruce kernel    Steve kernel
         MAX ULP: 0.78223389      0.89921787
    Total tested: 33554431        33554431
0.8 < ULP <= 0.9: 0               3262
0.7 < ULP <= 0.8: 9663            68305
0.6 < ULP <= 0.7: 132948          346214

Perhaps, using double_t would reduce my max ULP at the expense
of speed. 

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow