From owner-freebsd-numerics@freebsd.org  Tue Apr 25 06:36:45 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 37F36D4F513;
 Tue, 25 Apr 2017 06:36:45 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 2199F1D18;
 Tue, 25 Apr 2017 06:36:45 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3P6ah72046094
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Mon, 24 Apr 2017 23:36:43 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3P6ahPV046093;
 Mon, 24 Apr 2017 23:36:43 -0700 (PDT) (envelope-from sgk)
Date: Mon, 24 Apr 2017 23:36:43 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170425063643.GA45906@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170413171248.GA86780@troutmask.apl.washington.edu>
 <20170424232743.GA54621@troutmask.apl.washington.edu>
 <20170425143120.R933@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170425143120.R933@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Apr 2017 06:36:45 -0000

On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote:
> On Mon, 24 Apr 2017, Steve Kargl wrote:
> 
> > I have what appears to be working versions of asinpi[fl].  It
> > was suggested elsewhere that using an Estrin's method to
> > sum the polynomial approximations instead of Horner's method
> > may allow modern CPUs to better schedule generated code.
> > I have implemented an Estrin's-like method for sinpi[l]
> > and cospi[l], and indeed the generated code is faster on
> > my Intel core2 duo with only a slight degradation in the
> > observed max ULP.  I'll post new versions to bugzilla in
> > the near future.
> 
> I didn't remember Estrin, but checked Knuth and see that his
> method is a subset of what I use routinely.  I might have learned
> it from Knuth.  I tried all methods in Knuth, but they didn't seem
> to be any better at first.  Later I learned more about scheduling.
> Estrin's method now seems routine as a subset of scheduling.

I think I came across this in Knuth as well, but I may have
picked of the name Estrin from Mueller's book (or google :).

> 
> You already used it in expl: for ld80:
> 

I recall that that is your handy work.

> X 		x2 = x * x;
> X 		x4 = x2 * x2;
> 
> Unlike in Knuth's description of Estrin, we don't try to order the
> operations to suit the CPU(s) in the code, but just try to keep them
> independent, as required for them to execute independently.  Compilers
> will rearrange them anyway, sometimes even correctly, but it makes
> little difference (on OOE CPUs) to throw all of the operations into
> the pipeline and see how fast something comes out).
> 
> X 		q = x4 * (x2 * (x4 *
> X 		    /*
> X 		     * XXX the number of terms is no longer good for
> X 		     * pairwise grouping of all except B3, and the
> X 		     * grouping is no longer from highest down.
> X 		     */
> X 		    (x2 *            B12  + (x * B11 + B10)) +
> X 		    (x2 * (x * B9 +  B8) +  (x * B7 +  B6))) +
> X 			  (x * B5 +  B4.e)) + x2 * x * B3.e;
> 
> This is arranged to look a bit like Horner's method at first (operations
> not written in separate statements), but it is basically Estrin's method
> with no extensions to more complicated sub-expressions than Estrin's,
> but complications for accuracy:
> - I think accuracy for the B3 term is unimportant, but for some reason
>    (perhaps just the one in the comment) it is grouped separately.  Estrin
>    might use x3 * (B3.e + x * B4.e).

Yes, it is most likely an accuracy issue.  I found something 
similar with sinpi/cospi.  My k_sinpi.c has

#if 0
        double x2;
        /* Horner's method. */
        x2 = x * x;
        sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3)
            + s2) + s1;
#else
        double x2, x4;
        /* Sort of Estrin's method. */
        x2 = x * x;
        x4 = x2 * x2;
        sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4
            + (s2 + s3 * x2 + s4 * x4) * x2 + s1;
#endif

The addition of s1 must occur last, or I get greater 1 ULP
for a number of values.

My timing show on Intel core2 duo

/* Horner's methods. */
laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22
sinpi time: 0.066 usec per call
cospi time: 0.063 usec per call

/* Estrin's methods. */
laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22
sinpi time: 0.059 usec per call
cospi time: 0.055 usec per call

Here, -n 22 means to test 2**22 - 1 values in the interval
[0,0.25].

For the older functions that were committed a few years
ago, I should probably go back to recheck timings and accuracy
with and without Estrin's method.

-- 
Steve
20161221 https://www.youtube.com/watch?v=IbCHE-hONow