From owner-freebsd-numerics@freebsd.org  Sat May 13 20:55:19 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 78514D6B1DC;
 Sat, 13 May 2017 20:55:19 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 535621C2C;
 Sat, 13 May 2017 20:55:19 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v4DKtHpT091964
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Sat, 13 May 2017 13:55:18 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v4DKtH98091963;
 Sat, 13 May 2017 13:55:17 -0700 (PDT) (envelope-from sgk)
Date: Sat, 13 May 2017 13:55:17 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170513205517.GA91911@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
 <20170429005924.GA37947@troutmask.apl.washington.edu>
 <20170429151457.F809@besplex.bde.org>
 <20170429194239.P3294@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170429194239.P3294@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 13 May 2017 20:55:19 -0000

On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote:
> On Sat, 29 Apr 2017, Bruce Evans wrote:
> > On Fri, 28 Apr 2017, Steve Kargl wrote:
> >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
> >>> 
> >>> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.
> >
> > Comments on this below.
> >
> > This is all rather over-engineered.  Optimizing these functions is
> > unimportant comparing with finishing cosl() and sinl() and optimizing
> > all of the standard trig functions better, but we need correctness.
> > But I now see many simplifications and improvements:
> >
> > (1) There is no need for new kernels.  The standard kernels already handle
> > extra precision using approximations like:
> >
> >    sin(x+y) ~= sin(x) + (1-x*x/2)*y.
> >
> > Simply reduce x and write Pi*x = hi+lo.  Then
> >
> >    sin(Pi*x) = __kernel_sin(hi, lo, 1).
> >
> > I now see how to do the extra-precision calculations without any
> > multiplications.
> 
> But that is over-engineered too.
> 
> Using the standard kernels is easy and works well:

Maybe works well.  See below.

> Efficiency is very good in some cases, but anomalous in others: all
> times in cycles, on i386, on the range [0, 0.25]
> 
> athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
> cos:   61-62                 44                 43
> cospi: 69-71 (8-9 extra)     78 (anomalous...)  42 (faster to do more!)
> sin:   59-60                 51                 37
> sinpi: 67-68 (8 extra)       80                 42
> tan:   136-172               93-195             67-94
> tanpi: 144-187 (8-15 extra)  145-176            61-189
> 
> That was a throughput test.  Latency is not so good.  My latency test
> doesn't use serializing instructions, but uses random args and the
> partial serialization of making each result depend on the previous
> one.
> 
> athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
> cos:   84-85                 69                 79
> cospi: 103-104 (19-21 extra) 117                94
> sin:   75-76                 89                 77
> sinpi: 105-106 (30 extra)    116                90
> tan:   168-170               167-168            147
> tanpi: 191-194 (23-24 extra) 191                154
> 
> This also indicates that the longest times for tan in the throughput
> test are what happens when the function doesn't run in parallel with
> itself.  The high-degree polynomial and other complications in tan()
> are too complicated for much cross-function parallelism.
> 
> Anywyay, it looks like the cost of using the kernel is at most 8-9
> in the parallel case and at most 30 in the serial case.  The extra-
> precision code has about 10 dependent instructions, so it s is
> doing OK to take 30.

Based on other replies in this email exchange, I have gone back
and looked at improvements to my __kernel_{cos|sin|tan}pi[fl]
routines.  The improvements where for both accuracy and speed.
I have tested on i686 and x86_64 systems with libm built with
-O2 -march=native -mtune=native.  My timing loop is of the
form

        float dx, f, x;
        long i, k;

        f = 0;
        k = 1 << 23;
        dx = (xmax - xmin) / (k - 1);
        time_start();
        for (i = 0; i < k; i++) {
                x = xmin + i * dx;
                f += cospif(x);
        };
        time_end();

        x = (time_cpu() / k) * 1.e6;
        printf("cospif time: %.4f usec per call\n", x);

        if (f == 0)
                printf("Can't happen!\n");

The assumption here is that loop overhead is the same for
all tested kernels.

Test intervals for kernels.

 float: [0x1p-14, 0.25]
double: [0x1p-29, 0.25]
  ld80: [0x1p-34, 0.25] 

   Core2 Duo T7250 @ 2.00GHz      || AMD FX8350 Eight-Core CPU
    (1995.05-MHz 686-class)       ||  (4018.34-MHz K8-class)
----------------------------------++--------------------------
       | Horner | Estrin | Fdlibm || Horner | Estrin | Fdlibm 
-------+--------+--------+--------++--------+--------+--------
cospif | 0.0223 |        | 0.0325 || 0.0112 |        | 0.0085
sinpif | 0.0233 | Note 1 | 0.0309 || 0.0125 |        | 0.0085
tanpif | 0.0340 |        | Note 2 || 0.0222 |        |
-------+--------+--------+--------++--------+--------+--------
cospi  | 0.0641 | 0.0571 | 0.0604 || 0.0157 | 0.0142 | 0.0149
sinpi  | 0.0722 | 0.0626 | 0.0712 || 0.0178 | 0.0161 | 0.0166
tanpi  | 0.1049 | 0.0801 |        || 0.0323 | 0.0238 |
-------+--------+--------+--------++--------+--------+--------
cospil | 0.0817 | 0.0716 | 0.0921 || 0.0558 | 0.0560 | 0.0755
sinpil | 0.0951 | 0.0847 | 0.0994 || 0.0627 | 0.0568 | 0.0768
tanpil | 0.1310 | 0.1004 |        || 0.1005 | 0.0827 |
-------+--------+--------+--------++--------+--------+--------

Time in usec/call.

Note 1.  In re-arranging the polynomials for Estrin's method and
float, I found appreciable benefit.

Note 2.  I have been unable to use the tan[fl] kernels to implement
satisfactory kernels for tanpi[fl].  In particular, for x in [0.25,0.5]
and using tanf kernel leads to 6 digit ULPs in 0.5 whereas my kernel
near 2 ULP.

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow