From owner-freebsd-numerics@freebsd.org  Sun Apr 23 18:40:47 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 110DCD4DF8A
 for <freebsd-numerics@mailman.ysv.freebsd.org>;
 Sun, 23 Apr 2017 18:40:47 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2001:1900:2254:206a::16:76])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 00A9A1F97
 for <freebsd-numerics@FreeBSD.org>; Sun, 23 Apr 2017 18:40:47 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from bugs.freebsd.org ([127.0.1.118])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v3NIeki6020766
 for <freebsd-numerics@FreeBSD.org>; Sun, 23 Apr 2017 18:40:46 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
From: bugzilla-noreply@freebsd.org
To: freebsd-numerics@FreeBSD.org
Subject: [Bug 170206] [msun] [patch] complex arcsinh, log, etc.
Date: Sun, 23 Apr 2017 18:40:47 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: bin
X-Bugzilla-Version: 8.3-STABLE
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: Affects Only Me
X-Bugzilla-Who: jbeich@FreeBSD.org
X-Bugzilla-Status: In Progress
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: Normal
X-Bugzilla-Assigned-To: freebsd-numerics@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: attachments.mimetype
Message-ID: <bug-170206-18636-lHP4ZtbFjc@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-170206-18636@https.bugs.freebsd.org/bugzilla/>
References: <bug-170206-18636@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Apr 2017 18:40:47 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D170206

Jan Beich <jbeich@FreeBSD.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #126474|text/x-csrc                 |text/plain
          mime type|                            |

--=20
You are receiving this mail because:
You are the assignee for the bug.=

From owner-freebsd-numerics@freebsd.org  Sun Apr 23 18:40:52 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id EB0F6D4DFA0
 for <freebsd-numerics@mailman.ysv.freebsd.org>;
 Sun, 23 Apr 2017 18:40:52 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2001:1900:2254:206a::16:76])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id DAA9E1FB2
 for <freebsd-numerics@FreeBSD.org>; Sun, 23 Apr 2017 18:40:52 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from bugs.freebsd.org ([127.0.1.118])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v3NIeqES020905
 for <freebsd-numerics@FreeBSD.org>; Sun, 23 Apr 2017 18:40:52 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
From: bugzilla-noreply@freebsd.org
To: freebsd-numerics@FreeBSD.org
Subject: [Bug 170206] [msun] [patch] complex arcsinh, log, etc.
Date: Sun, 23 Apr 2017 18:40:52 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: bin
X-Bugzilla-Version: 8.3-STABLE
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: Affects Only Me
X-Bugzilla-Who: jbeich@FreeBSD.org
X-Bugzilla-Status: In Progress
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: Normal
X-Bugzilla-Assigned-To: freebsd-numerics@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: attachments.mimetype
Message-ID: <bug-170206-18636-EofnpKGtE8@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-170206-18636@https.bugs.freebsd.org/bugzilla/>
References: <bug-170206-18636@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Apr 2017 18:40:53 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D170206

Jan Beich <jbeich@FreeBSD.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #126475|text/x-csrc                 |text/plain
          mime type|                            |

--=20
You are receiving this mail because:
You are the assignee for the bug.=

From owner-freebsd-numerics@freebsd.org  Sun Apr 23 18:40:58 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 913D3D4DFB7
 for <freebsd-numerics@mailman.ysv.freebsd.org>;
 Sun, 23 Apr 2017 18:40:58 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2001:1900:2254:206a::16:76])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 80EAC1FD1
 for <freebsd-numerics@FreeBSD.org>; Sun, 23 Apr 2017 18:40:58 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from bugs.freebsd.org ([127.0.1.118])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v3NIewn1021065
 for <freebsd-numerics@FreeBSD.org>; Sun, 23 Apr 2017 18:40:58 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
From: bugzilla-noreply@freebsd.org
To: freebsd-numerics@FreeBSD.org
Subject: [Bug 170206] [msun] [patch] complex arcsinh, log, etc.
Date: Sun, 23 Apr 2017 18:40:58 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: bin
X-Bugzilla-Version: 8.3-STABLE
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: Affects Only Me
X-Bugzilla-Who: jbeich@FreeBSD.org
X-Bugzilla-Status: In Progress
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: Normal
X-Bugzilla-Assigned-To: freebsd-numerics@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: attachments.mimetype
Message-ID: <bug-170206-18636-PuHdv7taiY@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-170206-18636@https.bugs.freebsd.org/bugzilla/>
References: <bug-170206-18636@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 23 Apr 2017 18:40:58 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D170206

Jan Beich <jbeich@FreeBSD.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
 Attachment #126478|text/x-csrc                 |text/plain
          mime type|                            |

--=20
You are receiving this mail because:
You are the assignee for the bug.=

From owner-freebsd-numerics@freebsd.org  Mon Apr 24 23:55:39 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D117ED4EB18;
 Mon, 24 Apr 2017 23:55:39 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9E412A4E;
 Mon, 24 Apr 2017 23:55:39 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3ONRhmP054815
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Mon, 24 Apr 2017 16:27:43 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3ONRhSI054814;
 Mon, 24 Apr 2017 16:27:43 -0700 (PDT) (envelope-from sgk)
Date: Mon, 24 Apr 2017 16:27:43 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: freebsd-numerics@freebsd.org
Cc: freebsd-standards@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170424232743.GA54621@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170413171248.GA86780@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170413171248.GA86780@troutmask.apl.washington.edu>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Apr 2017 23:55:39 -0000

On Thu, Apr 13, 2017 at 10:12:48AM -0700, Steve Kargl wrote:
> On Sun, Apr 09, 2017 at 03:08:09PM -0700, Steve Kargl wrote:
> > Both IEEE-754 2008 and ISO/IEC TS 18661-4 define the half-cycle
> > trignometric functions cospi, sinpi, and tanpi.  The attached
> > patch implements cospi[fl], sinpi[fl], and tanpi[fl].  Limited
> > testing on the cospi and sinpi reveal a max ULP less than 0.89;
> > while tanpi is more problematic with a max ULP less than 2.01 
> > in the interval [0,0.5].  The algorithms used in these functions
> > are documented in {ks}_cospi.c, {ks}_sinpi.c, and s_tanpi.c.
> > 
> > Note 1.  ISO/IEC TS 18661-4 says these funstions are guarded by
> > a predefine macro.  I have no idea or interest in what clang and
> > gcc do with regards to this macro.  I've put the functions behind
> > __BSD_VISIBLE.
> > 
> > Note 2.  I no longer have access to a system with ld128 and 
> > adequate support to compile and test the ld128 implementations
> > of these functions.  Given the almost complete lack of input from
> > others on improvements to libm, I doubt that anyone cares.  If 
> > someone does care, the ld128 files contain a number of FIXME comments,
> > and in particular, while the polynomial coefficients are given
> > I did not update the polynomial algorithms to properly use the
> > coefficients.
> > 
> > The code is attached the bug reportr.
> > 
> > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218514
> > 
> 
> While everyone is busy reviewing and testing the patch available
> in bugzilla, I suspect some may be wondering about the inverse
> half-cycle trignometric functions.  I have worked out an algorithm
> for asinpi[fl] and have a working implemenation of asinpif(x).  
> It will take a couple of weeks (due to limited available time)
> before I can submit asinpi[fl], acospi[fl], and atanpi[fl], but
> work is in progress.
> 

I have what appears to be working versions of asinpi[fl].  It
was suggested elsewhere that using an Estrin's method to
sum the polynomial approximations instead of Horner's method
may allow modern CPUs to better schedule generated code.
I have implemented an Estrin's-like method for sinpi[l]
and cospi[l], and indeed the generated code is faster on
my Intel core2 duo with only a slight degradation in the
observed max ULP.  I'll post new versions to bugzilla in
the near future. 

-- 
Steve
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Tue Apr 25 05:30:44 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5C6A0D4F358;
 Tue, 25 Apr 2017 05:30:44 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au
 [211.29.132.246])
 by mx1.freebsd.org (Postfix) with ESMTP id BB1E81D90;
 Tue, 25 Apr 2017 05:30:43 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 40EE142B512;
 Tue, 25 Apr 2017 15:13:45 +1000 (AEST)
Date: Tue, 25 Apr 2017 15:13:44 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170424232743.GA54621@troutmask.apl.washington.edu>
Message-ID: <20170425143120.R933@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170413171248.GA86780@troutmask.apl.washington.edu>
 <20170424232743.GA54621@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=VIhQITioF-Sxmys2TWEA:9 a=s4eJatwz83rb22bD:21
 a=v3ws3rOyY_mnV_cf:21 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Apr 2017 05:30:44 -0000

On Mon, 24 Apr 2017, Steve Kargl wrote:

> I have what appears to be working versions of asinpi[fl].  It
> was suggested elsewhere that using an Estrin's method to
> sum the polynomial approximations instead of Horner's method
> may allow modern CPUs to better schedule generated code.
> I have implemented an Estrin's-like method for sinpi[l]
> and cospi[l], and indeed the generated code is faster on
> my Intel core2 duo with only a slight degradation in the
> observed max ULP.  I'll post new versions to bugzilla in
> the near future.

I didn't remember Estrin, but checked Knuth and see that his
method is a subset of what I use routinely.  I might have learned
it from Knuth.  I tried all methods in Knuth, but they didn't seem
to be any better at first.  Later I learned more about scheduling.
Estrin's method now seems routine as a subset of scheduling.

You already used it in expl: for ld80:

X 		x2 = x * x;
X 		x4 = x2 * x2;

Unlike in Knuth's description of Estrin, we don't try to order the
operations to suit the CPU(s) in the code, but just try to keep them
independent, as required for them to execute independently.  Compilers
will rearrange them anyway, sometimes even correctly, but it makes
little difference (on OOE CPUs) to throw all of the operations into
the pipeline and see how fast something comes out).

X 		q = x4 * (x2 * (x4 *
X 		    /*
X 		     * XXX the number of terms is no longer good for
X 		     * pairwise grouping of all except B3, and the
X 		     * grouping is no longer from highest down.
X 		     */
X 		    (x2 *            B12  + (x * B11 + B10)) +
X 		    (x2 * (x * B9 +  B8) +  (x * B7 +  B6))) +
X 			  (x * B5 +  B4.e)) + x2 * x * B3.e;

This is arranged to look a bit like Horner's method at first (operations
not written in separate statements), but it is basically Estrin's method
with no extensions to more complicated sub-expressions than Estrin's,
but complications for accuracy:
- I think accuracy for the B3 term is unimportant, but for some reason
   (perhaps just the one in the comment) it is grouped separately.  Estrin
   might use x3 * (B3.e + x * B4.e).
- accuracy is important for B0 through B2 terms, and we don't use Estrin
   for them.  Efficiency is a matter of running the above in parallel
   with the evaluation of lower terms.  The above shouldn't steal any
   resources from evaluation of the lower terms unless it is the slow
   part.  Normally you want N (2 independent streams to complete at the
   same time ready for a final addition).

The above also does some power-of-2 grouping which has been messed up
by changing the number of terms in the polynomial (see the comment).
Generally this costs time time of 1 operation for each missed optimization.

The above hasn't been rearranged since it is a lot of work to keep
rearranging it during development and development stalled after committing
it.  The ld128 cases has larger problems from this.  It has 2 sets of
polynomials, both higher order, with sub-optimal numbers of terms.

Here is my program for testing the results of quadratic grouping on x86.
It shows little qualitative difference between Athlon-XP and Haswell.
Haswell is just faster:

XX #include <stdlib.h>
XX 
XX static __inline void
XX cpuid(int ax)
XX {
XX 	__asm __volatile("cpuid" : "=a" (ax) : "0" (ax) : "cx", "dx", "bx");
XX }
XX #define	puid(ax)
XX 
XX double P0, P1, P2, P3;
XX double P4, P5, P6, P7;
XX double P8, P9, Pa, Pb;
XX double Pc, Pd, Pe, Pf;
XX double Q0, Q1, Q2, Q3;
XX double Q4, Q5, Q6, Q7;
XX double Q8, Q9, Qa, Qb;
XX double Qc, Qd, Qe, Qf;
XX volatile double a, r;
XX 
XX #define	FREQ	2010168298	/* sysctl -n machdep.tsc_freq */
XX 
XX #define	NITER	(FREQ / 10)
XX 
XX int
XX main(int argc, char **argv)
XX {
XX 	double x;
XX 	int i, n;
XX 
XX 	n = (argc == 1 ? 0 : atoi(argv[1]));
XX 	switch (n) {
XX 	default:
XX 		for (i = 0; i < NITER; i++) {
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 2:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = x*P1+P0;
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case -4:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = P0+x*(P1+x*(P2+x*P3));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 4:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = (x*P3+P2)*(x*x)+(x*P1+P0);
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 5:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double t32 = x*P3+P2;
XX 			double t10 = x*P1+P0;
XX 			double x2 = x*x;
XX 			r = t32*x2+t10;
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case -8:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*P7))))));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 6:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double lo = 0.0;
XX 			r = ((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) +
XX 			    ((x*P3+P2)*(x*x)+lo);
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 7:
XX 		for (i = 0; i < NITER; i++) {
XX 			double d = a;
XX 			double lo = 0.0;
XX 			double t = d*P7+P6;
XX 			double z = d*d;
XX 			double s = d*P5+P4;
XX 			double rl = d*P3+P2;
XX #if 0
XX 			r = lo + z * rl + z * z * s + z * z * (z * t);
XX #elif 0
XX 			r = lo + z * z * (z * t) + z * z * s + z * rl;
XX #elif 0
XX 			r = z * z * (z * t) + (z * z * s + lo) + z * rl;
XX #else
XX 			r = z * z * (z * t) + z * z * s + (z * rl + lo);
XX #endif
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 8:
XX 		for (i = 0; i < NITER; i++) {
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 9:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double t76 = x*P7+P6;
XX 			double t54 = x*P5+P4;
XX 			double t32 = x*P3+P2;
XX 			double x2 = x*x;
XX 			r = ((t76    )*x2+(t54    ))*(x2*x2) +
XX 			    ((t32    )*x2+(x*P1+P0));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 10:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double t76 = x*P7+P6;
XX 			double t32 = x*P3+P2;
XX 			double t54 = x*P5+P4;
XX 			double x2 = x*x;
XX 			r = ((t76    )*x2+(t54    ))*(x2*x2) +
XX 			    ((t32    )*x2+(x*P1+P0));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 11:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double t76 = x*P7+P6;
XX 			double t54 = x*P5+P4;
XX 			double t32 = x*P3+P2;
XX 			double x2 = x*x;
XX 			double t7654 = t76*x2+t54;
XX 			double t3210 = t32*x2+(x*P1+P0);
XX 			r = t7654*(x2*x2) + t3210;
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case -16:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*(P7+x*(P8+x*(P9+x*(Pa+x*(Pb+x*(Pc+x*(Pd+x*(Pe+x*(Pf)))))))))))))));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 16:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = (((x*Pf+Pe)*(x*x)+(x*Pd+Pc))*((x*x)*(x*x)) +
XX 			     ((x*Pb+Pa)*(x*x)+(x*P9+P8)))*
XX 			     (((x*x)*(x*x))*((x*x)*(x*x))) +
XX 			    (((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) +
XX 			     ((x*P3+P2)*(x*x)+(x*P1+P0)));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 17:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double tfe = x*Pf+Pe;
XX 			double tba = x*Pb+Pa;
XX 			double t76 = x*P7+P6;
XX 			double t32 = x*P3+P2;
XX 			double x2 = x*x;
XX 			double tdc = x*Pd+Pc;
XX 			double t98 = x*P9+P8;
XX 			double t54 = x*P5+P4;
XX 			r = (((tfe    )*x2+(tdc    ))*(x2*x2) +
XX 			     ((tba    )*x2+(t98    )))*((x2*x2)*(x2*x2)) +
XX 			    (((t76    )*x2+(t54    ))*(x2*x2) +
XX 			     ((t32    )*x2+(x*P1+P0)));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 18:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double tfe = x*Pf+Pe;
XX 			double tba = x*Pb+Pa;
XX 			double t76 = x*P7+P6;
XX 			double tdc = x*Pd+Pc;
XX 			double x2 = x*x;
XX 			double t32 = x*P3+P2;
XX 			double t98 = x*P9+P8;
XX 			double t54 = x*P5+P4;
XX 			double t10 = x*P1+P0;
XX 			double x4 = x2*x2;
XX 			double tfedc = x2*tfe+tdc;
XX 			double tba98 = x2*tba+t98;
XX 			double t7654 = x2*t76+t54;
XX 			double t3210 = x2*t32+t10;
XX 			double tfedcba98 = x4*tfedc+tba98;
XX 			double t76543210 = x4*t7654+t3210;
XX 			r = (x4*x4)*tfedcba98+t76543210;
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case -32:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*(P7+x*(P8+x*(P9+x*(Pa+x*(Pb+x*(Pc+x*(Pd+x*(Pe+x*(Pf+x*(Q0+x*(Q1+x*(Q2+x*(Q3+x*(Q4+x*(Q5+x*(Q6+x*(Q7+x*(Q8+x*(Q9+x*(Qa+x*(Qb+x*(Qc+x*(Qd+x*(Qe+x*(Qf)))))))))))))))))))))))))))))));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 32:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			r = ((((x*Qf+Qe)*(x*x)+(x*Qd+Qc))*((x*x)*(x*x)) +
XX 			      ((x*Qb+Qa)*(x*x)+(x*Q9+Q8)))*
XX 			      (((x*x)*(x*x))*((x*x)*(x*x))) +
XX 			     (((x*Q7+Q6)*(x*x)+(x*Q5+Q4))*((x*x)*(x*x)) +
XX 			      ((x*Q3+Q2)*(x*x)+(x*Q1+Q0))))*
XX 			      ((((x*x)*(x*x))*((x*x)*(x*x)))*
XX 			       (((x*x)*(x*x))*((x*x)*(x*x)))) +
XX 			    ((((x*Pf+Pe)*(x*x)+(x*Pd+Pc))*((x*x)*(x*x)) +
XX 			      ((x*Pb+Pa)*(x*x)+(x*P9+P8)))*
XX 			      (((x*x)*(x*x))*((x*x)*(x*x))) +
XX 			     (((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) +
XX 			      ((x*P3+P2)*(x*x)+(x*P1+P0))));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	case 33:
XX 		for (i = 0; i < NITER; i++) {
XX 			x = a;
XX 			double Tfe = x*Qf+Qe;
XX 			double tfe = x*Pf+Pd;
XX 			double T76 = x*Q7+Q6;
XX 			double t76 = x*P7+P6;
XX 			double x2 = x*x;
XX 			r = ((((Tfe    )*x2+(x*Qd+Qc))*(x2*x2) +
XX 			      ((x*Qb+Qa)*x2+(x*Q9+Q8)))*((x2*x2)*(x2*x2)) +
XX 			     (((T76    )*x2+(x*Q5+Q4))*(x2*x2) +
XX 			      ((x*Q3+Q2)*x2+(x*Q1+Q0))))*(((x2*x2)*(x2*x2))*((x2*x2)*(x2*x2))) +
XX 			    ((((tfe    )*x2+(x*Pd+Pc))*(x2*x2) +
XX 			      ((x*Pb+Pa)*x2+(x*P9+P8)))*((x2*x2)*(x2*x2)) +
XX 			     (((t76    )*x2+(x*P5+P4))*(x2*x2) +
XX 			      ((x*P3+P2)*x2+(x*P1+P0))));
XX 			cpuid(0);
XX 		}
XX 		break;
XX 	}
XX 	return (0);
XX }

It only tests power-of-2 number of terms.  With say 6 or 7 terms, it is
harder to run much faster than with 8 terms, but easy to try all other
resonable groupings.  With 8-15 terms, it is harder to try enough other
groupings.

It tests using temporary variables to try to get the best order.  This
makes a little difference with gcc-3.3, less than that with gcc-4.2 and
no difference with clang.  Looking at the generated code can be instructive.
The compiler sometimes finds a good order that you might not think of.
Also, you can see where it is constrained by dependencies.

Modern x86 (SSE-mumble or AVX-mumble) have the basic Estrin operation of
B0 + x * B1 as a single operation (v*fma* in AVX).  Basic Estrin would
reduce to Horner if this were used.  But the C standard doesn't really
allow using it, and compilers don't use it on x86.  gcc did use it on
ia64.  It also takes unportable CFLAGS to get high SSE/AVX support, and
gives unportable binaries when used.  Don't use the libm fma since that
tends to be 10-100 times slower than not using it (perhaps only 10
times slower if it claims to be fast).  This libm fma is only useful
for algorithm development and cases where efficiency doesn't matter.
Fma is about half an ulp more accurate than separate operations.

Bruce

From owner-freebsd-numerics@freebsd.org  Tue Apr 25 06:36:45 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 37F36D4F513;
 Tue, 25 Apr 2017 06:36:45 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 2199F1D18;
 Tue, 25 Apr 2017 06:36:45 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3P6ah72046094
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Mon, 24 Apr 2017 23:36:43 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3P6ahPV046093;
 Mon, 24 Apr 2017 23:36:43 -0700 (PDT) (envelope-from sgk)
Date: Mon, 24 Apr 2017 23:36:43 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170425063643.GA45906@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170413171248.GA86780@troutmask.apl.washington.edu>
 <20170424232743.GA54621@troutmask.apl.washington.edu>
 <20170425143120.R933@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170425143120.R933@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Apr 2017 06:36:45 -0000

On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote:
> On Mon, 24 Apr 2017, Steve Kargl wrote:
> 
> > I have what appears to be working versions of asinpi[fl].  It
> > was suggested elsewhere that using an Estrin's method to
> > sum the polynomial approximations instead of Horner's method
> > may allow modern CPUs to better schedule generated code.
> > I have implemented an Estrin's-like method for sinpi[l]
> > and cospi[l], and indeed the generated code is faster on
> > my Intel core2 duo with only a slight degradation in the
> > observed max ULP.  I'll post new versions to bugzilla in
> > the near future.
> 
> I didn't remember Estrin, but checked Knuth and see that his
> method is a subset of what I use routinely.  I might have learned
> it from Knuth.  I tried all methods in Knuth, but they didn't seem
> to be any better at first.  Later I learned more about scheduling.
> Estrin's method now seems routine as a subset of scheduling.

I think I came across this in Knuth as well, but I may have
picked of the name Estrin from Mueller's book (or google :).

> 
> You already used it in expl: for ld80:
> 

I recall that that is your handy work.

> X 		x2 = x * x;
> X 		x4 = x2 * x2;
> 
> Unlike in Knuth's description of Estrin, we don't try to order the
> operations to suit the CPU(s) in the code, but just try to keep them
> independent, as required for them to execute independently.  Compilers
> will rearrange them anyway, sometimes even correctly, but it makes
> little difference (on OOE CPUs) to throw all of the operations into
> the pipeline and see how fast something comes out).
> 
> X 		q = x4 * (x2 * (x4 *
> X 		    /*
> X 		     * XXX the number of terms is no longer good for
> X 		     * pairwise grouping of all except B3, and the
> X 		     * grouping is no longer from highest down.
> X 		     */
> X 		    (x2 *            B12  + (x * B11 + B10)) +
> X 		    (x2 * (x * B9 +  B8) +  (x * B7 +  B6))) +
> X 			  (x * B5 +  B4.e)) + x2 * x * B3.e;
> 
> This is arranged to look a bit like Horner's method at first (operations
> not written in separate statements), but it is basically Estrin's method
> with no extensions to more complicated sub-expressions than Estrin's,
> but complications for accuracy:
> - I think accuracy for the B3 term is unimportant, but for some reason
>    (perhaps just the one in the comment) it is grouped separately.  Estrin
>    might use x3 * (B3.e + x * B4.e).

Yes, it is most likely an accuracy issue.  I found something 
similar with sinpi/cospi.  My k_sinpi.c has

#if 0
        double x2;
        /* Horner's method. */
        x2 = x * x;
        sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3)
            + s2) + s1;
#else
        double x2, x4;
        /* Sort of Estrin's method. */
        x2 = x * x;
        x4 = x2 * x2;
        sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4
            + (s2 + s3 * x2 + s4 * x4) * x2 + s1;
#endif

The addition of s1 must occur last, or I get greater 1 ULP
for a number of values.

My timing show on Intel core2 duo

/* Horner's methods. */
laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22
sinpi time: 0.066 usec per call
cospi time: 0.063 usec per call

/* Estrin's methods. */
laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22
sinpi time: 0.059 usec per call
cospi time: 0.055 usec per call

Here, -n 22 means to test 2**22 - 1 values in the interval
[0,0.25].

For the older functions that were committed a few years
ago, I should probably go back to recheck timings and accuracy
with and without Estrin's method.

-- 
Steve
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Tue Apr 25 08:28:08 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A9E72D4EB9C;
 Tue, 25 Apr 2017 08:28:08 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id 4040C1C43;
 Tue, 25 Apr 2017 08:28:07 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id B19C0104877B;
 Tue, 25 Apr 2017 18:27:57 +1000 (AEST)
Date: Tue, 25 Apr 2017 18:27:52 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170425063643.GA45906@troutmask.apl.washington.edu>
Message-ID: <20170425173542.S1401@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170413171248.GA86780@troutmask.apl.washington.edu>
 <20170424232743.GA54621@troutmask.apl.washington.edu>
 <20170425143120.R933@besplex.bde.org>
 <20170425063643.GA45906@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=pMMk53J5QUvvM_BoBYoA:9 a=JWkNqyrBzBJkAUbv:21
 a=d3hanlBOIU_nJgXb:21 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Apr 2017 08:28:08 -0000

On Mon, 24 Apr 2017, Steve Kargl wrote:

> On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote:

>> [for expl]
>> This is arranged to look a bit like Horner's method at first (operations
>> not written in separate statements), but it is basically Estrin's method
>> with no extensions to more complicated sub-expressions than Estrin's,
>> but complications for accuracy:
>> - I think accuracy for the B3 term is unimportant, but for some reason
>>    (perhaps just the one in the comment) it is grouped separately.  Estrin
>>    might use x3 * (B3.e + x * B4.e).
>
> Yes, it is most likely an accuracy issue.  I found something
> similar with sinpi/cospi.  My k_sinpi.c has
>
> #if 0
>        double x2;
>        /* Horner's method. */
>        x2 = x * x;
>        sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3)
>            + s2) + s1;

Horner's of course has about as many dependencies as operations (13), so
on modern x86 this is limited by the latency of 3-5 cycles/instructions
or 39-65 cycles.  Multiplications take an extra cycle in more cases, so
it is better to have more additions, but unfortunately Horner gives 1
more multuplication.

With no dependencies, the limit would be throughput -- 2 instructions/cycle
of 6.5 cycles rounded up to 7; then add latency for the last operation to
get about 11.

With practical dependencies, it is hard to get more than a 2-fold speedup.

> #else
>        double x2, x4;
>        /* Sort of Estrin's method. */
>        x2 = x * x;
>        x4 = x2 * x2;
>        sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4
>            + (s2 + s3 * x2 + s4 * x4) * x2 + s1;
> #endif

This looks OK.  The only obvious improvement is to try grouping x4*x4
so that it can be evaluated separately.  This reduces the length of
dependency chains by 1, but increases early pressure on multiplication.
Don't write this as x8 = x4*x4.  Compilers will do the equivalent.
You can also replace x4 by (x2*x2) but there are many x4's so this
might be less readable.

> The addition of s1 must occur last, or I get greater 1 ULP
> for a number of values.

Indeed.  This prevents using the best order, which is the opposite.
You used the best order for the other inner terms, but not so obviously
for the first outer addition.  However, the compiler is free to reorder
that: consider:

 	T2 + T1 + T0

T2 is more complicated and everything depends on it, so should be started
first, but possibly T1 can be started first because it is simpler.  Here
it is not significantly simpler.  However, for:

 	T3 + T2 + T1 + T0

the natural left-to-right grouping is usually pessimial.  This should be
written as:

 	T3 + (T2 + T1) + T0

Unfortunately, we can't write this naturally and usually optimally as:

 	T0 + T1 + T2 + T3

due to the accuracy problem with T0.  Maybe right this as:

 	T1 + T2 + T3 + T0	// only T0 in unnatural order

For expl, I tried all orders.

> My timing show on Intel core2 duo
>
> /* Horner's methods. */
> laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22
> sinpi time: 0.066 usec per call
> cospi time: 0.063 usec per call
>
> /* Estrin's methods. */
> laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22
> sinpi time: 0.059 usec per call
> cospi time: 0.055 usec per call
>
> Here, -n 22 means to test 2**22 - 1 values in the interval
> [0,0.25].

The improvement is marginal.   Only worth doing if it doesn't complicate
the code much.  Older CPUs can't exploit te parallelism so well, so I
would expect a relatively larger improvement.  Don't measure in usec,
since they are hard to scale.  0.66 usec and 1.86 GHz is more than 100
cycles.  This is quite slow.  I get similar times for cos and sin on an
older Athlon-XP (~40 in float prec, ~160 in double and ~270 in long
double).  Newer CPUs are about 4 times faster in double prec (only 2
times fast in float prec) in cycles despite having similar nominal
timing, by unclogging their pipes.

We should also try to use the expression A*x + B more often so that
things with be faster when this is done by fma, but this happens
naturally anyway.

Bruce

From owner-freebsd-numerics@freebsd.org  Wed Apr 26 03:23:58 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4FE8CD50FC3;
 Wed, 26 Apr 2017 03:23:58 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 2B09C9C;
 Wed, 26 Apr 2017 03:23:58 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3Q3No8O087663
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Tue, 25 Apr 2017 20:23:50 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3Q3NoEL087662;
 Tue, 25 Apr 2017 20:23:50 -0700 (PDT) (envelope-from sgk)
Date: Tue, 25 Apr 2017 20:23:50 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170426032350.GA87524@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170413171248.GA86780@troutmask.apl.washington.edu>
 <20170424232743.GA54621@troutmask.apl.washington.edu>
 <20170425143120.R933@besplex.bde.org>
 <20170425063643.GA45906@troutmask.apl.washington.edu>
 <20170425173542.S1401@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170425173542.S1401@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Apr 2017 03:23:58 -0000

On Tue, Apr 25, 2017 at 06:27:52PM +1000, Bruce Evans wrote:
> On Mon, 24 Apr 2017, Steve Kargl wrote:
> 
> > On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote:
> 
> >> [for expl]
> >> This is arranged to look a bit like Horner's method at first (operations
> >> not written in separate statements), but it is basically Estrin's method
> >> with no extensions to more complicated sub-expressions than Estrin's,
> >> but complications for accuracy:
> >> - I think accuracy for the B3 term is unimportant, but for some reason
> >>    (perhaps just the one in the comment) it is grouped separately.  Estrin
> >>    might use x3 * (B3.e + x * B4.e).
> >
> > Yes, it is most likely an accuracy issue.  I found something
> > similar with sinpi/cospi.  My k_sinpi.c has
> >
> > #if 0
> >        double x2;
> >        /* Horner's method. */
> >        x2 = x * x;
> >        sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3)
> >            + s2) + s1;
> 
> Horner's of course has about as many dependencies as operations (13), so
> on modern x86 this is limited by the latency of 3-5 cycles/instructions
> or 39-65 cycles.  Multiplications take an extra cycle in more cases, so
> it is better to have more additions, but unfortunately Horner gives 1
> more multuplication.
> 
> With no dependencies, the limit would be throughput -- 2 instructions/cycle
> of 6.5 cycles rounded up to 7; then add latency for the last operation to
> get about 11.
> 
> With practical dependencies, it is hard to get more than a 2-fold speedup.
> 
> > #else
> >        double x2, x4;
> >        /* Sort of Estrin's method. */
> >        x2 = x * x;
> >        x4 = x2 * x2;
> >        sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4
> >            + (s2 + s3 * x2 + s4 * x4) * x2 + s1;
> > #endif
> 
> This looks OK.  The only obvious improvement is to try grouping x4*x4
> so that it can be evaluated separately.  This reduces the length of
> dependency chains by 1, but increases early pressure on multiplication.
> Don't write this as x8 = x4*x4.  Compilers will do the equivalent.
> You can also replace x4 by (x2*x2) but there are many x4's so this
> might be less readable.

The grouping of (x4*x4) and (x2 * x4) in other polynomials indeed
give small improvements.  I've reworked the approximations in 
sinpi[l], cospi[l], and tanpi[l].  I did not see any improves with
the float versions as the order of the polynomials are fairly small
(order of 4).


> > The addition of s1 must occur last, or I get greater 1 ULP
> > for a number of values.
> 
> Indeed.  This prevents using the best order, which is the opposite.
> You used the best order for the other inner terms, but not so obviously
> for the first outer addition.  However, the compiler is free to reorder
> that: consider:
> 
>  	T2 + T1 + T0
> 
> T2 is more complicated and everything depends on it, so should be started
> first, but possibly T1 can be started first because it is simpler.  Here
> it is not significantly simpler.  However, for:
> 
>  	T3 + T2 + T1 + T0
> 
> the natural left-to-right grouping is usually pessimial.  This should be
> written as:
> 
>  	T3 + (T2 + T1) + T0
> 
> Unfortunately, we can't write this naturally and usually optimally as:
> 
>  	T0 + T1 + T2 + T3
> 
> due to the accuracy problem with T0.  Maybe right this as:
> 
>  	T1 + T2 + T3 + T0	// only T0 in unnatural order
> 
> For expl, I tried all orders.

I tried a few different orders; particularly with the long double
functions.  I also managed to break the code a few times.  Thankfully,
I have everything in a subversion repository. 

> 
> > My timing show on Intel core2 duo
> >
> > /* Horner's methods. */
> > laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22
> > sinpi time: 0.066 usec per call
> > cospi time: 0.063 usec per call
> >
> > /* Estrin's methods. */
> > laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22
> > sinpi time: 0.059 usec per call
> > cospi time: 0.055 usec per call
> >
> > Here, -n 22 means to test 2**22 - 1 values in the interval
> > [0,0.25].
> 
> The improvement is marginal.   Only worth doing if it doesn't complicate
> the code much.  Older CPUs can't exploit te parallelism so well, so I
> would expect a relatively larger improvement.  Don't measure in usec,
> since they are hard to scale.  0.66 usec and 1.86 GHz is more than 100
> cycles.  This is quite slow.  I get similar times for cos and sin on an
> older Athlon-XP (~40 in float prec, ~160 in double and ~270 in long
> double).  Newer CPUs are about 4 times faster in double prec (only 2
> times fast in float prec) in cycles despite having similar nominal
> timing, by unclogging their pipes.

I'm assuming your 0.66 in the above is a typo as I wrote 0.066.  I also
got some improvement by changing the splitting of a few entities into
high and low parts by a simple cast.  Instead of doing

	EXTRACT_WORDS(hx, lx, x);
	INSERT_WORDS(xhi, hx, 0);
	xlo = x - xhi;

I do

    xhi = (float)x;
    xlo = x - xhi;

but this seems to only help in some situation.  My new timing for
x in [0,0.25] on my 1995.05 MHz Core2 duo gives

Horner
sinpi time: 0.073 usec per call  (~146 cycles)
cospi time: 0.070 usec per call  (~140)
tanpi time: 0.112 usec per call  (~224)

Estrin
sinpi time: 0.064 usec per call  (~128)
cospi time: 0.060 usec per call  (~120)
tanpi time: 0.088 usec per call  (~176)

tanpi benefits the most as it has an order 16 polynomial, which
allows some freedom in group terms.

I don't think I can do much better.  For example, the approximation
for sinpi(x) is

sinpi(x) = x * (s0 + x2 * S(x2))

where S(x2) is the polynomial and I need to split x2 into high
and low parts and s0 is split.  The kernel with the Estrin method
and with the polynomial coefficients is

static inline double
__kernel_sinpi(double x)
{
	double hi, lo, sm, xhi, xlo;
	uint32_t hx, lx;

	double x2, x4;
	/* Sort of Estrin's method. */
	x2 = x * x;
	x4 = x2 * x2;
	sm = (s5 + s6 * x2 + s7 * x4) * (x4 * x4)
	    + (s2 + s3 * x2 + s4 * x4) * x2 + s1;

	EXTRACT_WORDS(hx, lx, x);
	INSERT_WORDS(xhi, hx, 0);
	xlo = x - xhi;
	lo = xlo * (x + xhi) * sm;
	hi = xhi * xhi * sm;
	lo = x * lo + xlo * (s0hi + s0lo) + x * hi;

	return (lo + xhi * s0lo + xhi * s0hi);
}

The order of terms in the last 5 lines matters.  Testing yields

        a =  2.4220192246482908e-01, /* 0x3fcf0078, 0xfc01e3f0 */
 sinpi(a) =  6.8957335930938313e-01, /* 0x3fe610fc, 0x264da72e */
sin(pi*a) =  6.8957335930938324e-01, /* 0x3fe610fc, 0x264da72f */
ULP: 0.79950446
         MAX ULP: 0.79950446
    Total tested: 8388606
1.0 < ULP       : 0
0.9 < ULP <= 1.0: 0
0.8 < ULP <= 0.9: 0
0.7 < ULP <= 0.8: 554
0.6 < ULP <= 0.7: 11275

> We should also try to use the expression A*x + B more often so that
> things with be faster when this is done by fma, but this happens
> naturally anyway.

Using A*x+B or B+A*x did not seem to matter.

-- 
Steve
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Thu Apr 27 23:14:17 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B33AED53837;
 Thu, 27 Apr 2017 23:14:17 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9AE9F6D4;
 Thu, 27 Apr 2017 23:14:17 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3RNEBps011407
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Thu, 27 Apr 2017 16:14:11 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3RNEBRS011406;
 Thu, 27 Apr 2017 16:14:11 -0700 (PDT) (envelope-from sgk)
Date: Thu, 27 Apr 2017 16:14:11 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: freebsd-numerics@freebsd.org
Cc: freebsd-hackers@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170427231411.GA11346@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170409220809.GA25076@troutmask.apl.washington.edu>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Apr 2017 23:14:17 -0000

On Sun, Apr 09, 2017 at 03:08:09PM -0700, Steve Kargl wrote:
> Both IEEE-754 2008 and ISO/IEC TS 18661-4 define the half-cycle
> trignometric functions cospi, sinpi, and tanpi.  The attached
> patch implements cospi[fl], sinpi[fl], and tanpi[fl].  Limited
> testing on the cospi and sinpi reveal a max ULP less than 0.89;
> while tanpi is more problematic with a max ULP less than 2.01 
> in the interval [0,0.5].  The algorithms used in these functions
> are documented in {ks}_cospi.c, {ks}_sinpi.c, and s_tanpi.c.
> 
> Note 1.  ISO/IEC TS 18661-4 says these funstions are guarded by
> a predefine macro.  I have no idea or interest in what clang and
> gcc do with regards to this macro.  I've put the functions behind
> __BSD_VISIBLE.
> 
> Note 2.  I no longer have access to a system with ld128 and 
> adequate support to compile and test the ld128 implementations
> of these functions.  Given the almost complete lack of input from
> others on improvements to libm, I doubt that anyone cares.  If 
> someone does care, the ld128 files contain a number of FIXME comments,
> and in particular, while the polynomial coefficients are given
> I did not update the polynomial algorithms to properly use the
> coefficients.
> 
> The code is attached the bug reportr.
> 
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218514
> 

I have attached a new diff to the bugzilla report.  The 
diff is 3090 lines and won't be broadcast the mailing list.

This diff includes fixes for a few inconsequential bugs
and implements modified Estrin's method for sum a few
ploynomials.  If you want the previous Horner's method
then add -DHORNER to your CFLAGS.

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 01:01:23 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id AF430D4DF87;
 Fri, 28 Apr 2017 01:01:23 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9A01B274;
 Fri, 28 Apr 2017 01:01:23 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3S11M2W012863
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Thu, 27 Apr 2017 18:01:22 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3S11MRW012862;
 Thu, 27 Apr 2017 18:01:22 -0700 (PDT) (envelope-from sgk)
Date: Thu, 27 Apr 2017 18:01:22 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: freebsd-numerics@freebsd.org
Cc: freebsd-hackers@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170428010122.GA12814@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170427231411.GA11346@troutmask.apl.washington.edu>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 01:01:23 -0000

On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote:
> 
> I have attached a new diff to the bugzilla report.  The 
> diff is 3090 lines and won't be broadcast the mailing list.
> 
> This diff includes fixes for a few inconsequential bugs
> and implements modified Estrin's method for sum a few
> ploynomials.  If you want the previous Horner's method
> then add -DHORNER to your CFLAGS.
> 

For those curious about testing, here are some numbers
for the Estrin's method.  These were obtained on an AMD
FX(tm)-8350 @ 4018.34-MHz.  The times are in microseconds
and the number in parentheses are estimated cycles.

            |    cospi     |    sinpi     |    tanpi
------------+--------------+--------------+--------------
float       | 0.0089 (37)  | 0.0130 (54)  | 0.0194 (80)
double      | 0.0134 (55)  | 0.0160 (66)  | 0.0249 (102)
long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393)
------------+--------------+--------------+--------------

The timing tests are done on the interval [0,0.25] as
this is the interval where the polynomial approximations
apply.  Limited accuracy testing gives

                 |   cospif   |   cospi    |  cospil
-----------------+------------+------------+-------------
         MAX ULP | 0.88242114 | 0.89339466 | 0.88799449
    Total tested | 16777214   | 16777214   | 16777214
0.8 < ULP <= 0.9 | 1613       | 1682       | 1601
0.7 < ULP <= 0.8 | 33925      | 33817      | 33843
0.6 < ULP <= 0.7 | 171799     | 172775     | 171899
-----------------+------------+------------+-------------

                 |   sinpif   |   sinpi    |  sinpil
-----------------+------------+------------+-------------
         MAX ULP | 0.80103672 | 0.79851085 | 0.77049407
    Total tested | 16777214   | 16777214   | 16777214
0.8 < ULP <= 0.9 | 1          | 0          | 0
0.7 < ULP <= 0.8 | 727        | 1075       | 546
0.6 < ULP <= 0.7 | 19268      | 22722      | 19019
-----------------+------------+------------+-------------

 x in [0,0.25]   |   tanpif   |   tanpi    |   tanpil
-----------------+------------+------------+-------------
         MAX ULP | 1.37954760 | 1.37300168 | 1.38800823
    Total tested | 16777214   | 16777214   | 16777214
1.0 < ULP        | 39109      | 38712      | 38798
0.9 < ULP <= 1.0 | 130373     | 131086     | 130803
0.8 < ULP <= 0.9 | 342624     | 341350     | 341155
0.7 < ULP <= 0.8 | 687244     | 687122     | 686814
0.6 < ULP <= 0.7 | 1078964    | 1079881    | 1078555
-----------------+------------+------------+-------------

In the interval [0.25,0.5] tanpi[fl] is computed by 
cospi / sinpi.  The numbers look like 

 x in [0.25,0.5] |   tanpif   |   tanpi    |   tanpil
-----------------+------------+------------+-------------
         MAX ULP | 1.93529165 | 2.04485533 | 1.95823689
    Total tested | 16777215   | 16777215   | 16777215
1.0 < ULP        | 699310     | 701244     | 698946
0.9 < ULP <= 1.0 | 442978     | 443344     | 443599
0.8 < ULP <= 0.9 | 642991     | 642561     | 642601
0.7 < ULP <= 0.8 | 889357     | 890307     | 889233
0.6 < ULP <= 0.7 | 1176473    | 1177407    | 1177121
-----------------+------------+------------+-------------

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 04:18:51 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 542DDD53230;
 Fri, 28 Apr 2017 04:18:51 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 36C457B5;
 Fri, 28 Apr 2017 04:18:51 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3S4In0M013553
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Thu, 27 Apr 2017 21:18:49 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3S4Intr013552;
 Thu, 27 Apr 2017 21:18:49 -0700 (PDT) (envelope-from sgk)
Date: Thu, 27 Apr 2017 21:18:49 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: freebsd-numerics@freebsd.org
Cc: freebsd-hackers@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170428041849.GA13544@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170427231411.GA11346@troutmask.apl.washington.edu>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 04:18:51 -0000

On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote:
> > 
> > The code is attached the bug reportr.
> > 
> > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218514
> > 
> 
> I have attached a new diff to the bugzilla report.  The 
> diff is 3090 lines and won't be broadcast the mailing list.
> 
> This diff includes fixes for a few inconsequential bugs
> and implements modified Estrin's method for sum a few
> ploynomials.  If you want the previous Horner's method
> then add -DHORNER to your CFLAGS.
> 

Grrrr.  Find a sloppy theshold can be fun.

Index: src/s_cospif.c
===================================================================
--- src/s_cospif.c	(revision 1916)
+++ src/s_cospif.c	(working copy)
@@ -61,7 +61,7 @@
 	SET_FLOAT_WORD(ax, ix);
 
 	if (ix < 0x3f800000) {			/* |x| < 1 */
-		if (ix < 0x39000000) {		/* |x| < 0x1p-13 */
+		if (ix < 0x38800000) {		/* |x| < 0x1p-14 */
 			if (huge + ax > 0)	/* Raise inexact iff != 0. */
 				return (1);
 		}

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 08:00:26 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C6440D51F54
 for <freebsd-numerics@mailman.ysv.freebsd.org>;
 Fri, 28 Apr 2017 08:00:26 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au
 [211.29.132.246])
 by mx1.freebsd.org (Postfix) with ESMTP id 928321047
 for <freebsd-numerics@freebsd.org>; Fri, 28 Apr 2017 08:00:26 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 7A00D428223;
 Fri, 28 Apr 2017 18:00:16 +1000 (AEST)
Date: Fri, 28 Apr 2017 18:00:19 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170427231411.GA11346@troutmask.apl.washington.edu>
Message-ID: <20170428175851.A1386@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=VbSHBBh9 c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=kwgiUtXkrd9uoIphUegA:9 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 08:00:26 -0000

[cc list trimmed]

On Thu, 27 Apr 2017, Steve Kargl wrote:

> I have attached a new diff to the bugzilla report.  The
> diff is 3090 lines and won't be broadcast the mailing list.

But no one will see it there.

Bruce

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 08:55:18 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 76190D5279F;
 Fri, 28 Apr 2017 08:55:18 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au
 [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id 15EAFEA0;
 Fri, 28 Apr 2017 08:55:17 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id ED08AD4A8E5;
 Fri, 28 Apr 2017 18:33:57 +1000 (AEST)
Date: Fri, 28 Apr 2017 18:33:54 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170428041849.GA13544@troutmask.apl.washington.edu>
Message-ID: <20170428180046.P1386@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428041849.GA13544@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=6I5d2MoRAAAA:8 a=uopPRt1wpuTJ01loTfUA:9
 a=44QarpAT_w9gG2nY:21 a=cCMfiCuOYRyRPczO:21 a=CjuIK1q_8ugA:10
 a=IjZwj45LgO3ly-622nXo:22
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 08:55:18 -0000

On Thu, 27 Apr 2017, Steve Kargl wrote:

> On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote:
>>>
>>> The code is attached the bug reportr.
>>>
>>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218514
>>
>> I have attached a new diff to the bugzilla report.  The
>> diff is 3090 lines and won't be broadcast the mailing list.
> ...
> Grrrr.  Find a sloppy theshold can be fun.
>
> Index: src/s_cospif.c
> ===================================================================
> --- src/s_cospif.c	(revision 1916)
> +++ src/s_cospif.c	(working copy)
> @@ -61,7 +61,7 @@
> 	SET_FLOAT_WORD(ax, ix);
>
> 	if (ix < 0x3f800000) {			/* |x| < 1 */
> -		if (ix < 0x39000000) {		/* |x| < 0x1p-13 */
> +		if (ix < 0x38800000) {		/* |x| < 0x1p-14 */
> 			if (huge + ax > 0)	/* Raise inexact iff != 0. */
> 				return (1);
> 		}

This looks too different from s_cosf.c:

XX 	if(ix <= 0x3f490fda) {		/* |x| ~<= pi/4 */
XX 	    if(ix<0x39800000)		/* |x| < 2**-12 */
XX 		if(((int)x)==0) return 1.0;	/* 1 with inexact if x != 0 */
XX 	    return __kernel_cosdf(x);
XX 	}

The (int)x == 0 method is perhaps not best (especially not the excessive
parentheses used in it), but it was verified to be faster than other
methods.  This might be a matter of tricking the compiler into not
optimizing the unusual case.  (__predict ugliness doesn't work any better.)

Non-formatting style bugs include:
- arguably worse spelling of powers of 2 in comments
- less information in the comment about inexact.  I try to change all
   similar comments to be consistent whenever I change their code.
   s_cos.c still has a variant with less detail "/* generate inexact".

s_sinl.c is still broken by your not copying the fdlibm code:

XX 	/* If x = +-0 or x is a subnormal number, then sin(x) = x */
XX 	if (z.bits.exp == 0)
XX 		return (x);

This is missing not only the comment about setting inexact, but also
the code.  This is also poorly optimized.

XX 	/* Optimize the case where x is already within range. */
XX 	if (z.e < M_PI_4) {
XX 		hi = __kernel_sinl(z.e, 0, 0);
XX 		RETURNI(s ? -hi : hi);
XX 	}

This is is considerably more broken.  It underflows for small x.  Evaluation
of polynomials always depends on not doing it for tiny x, since evaluation
of higher terms always underflows for tiny x even when x is not denormal
(except when the poly is linear, the T1*x term doesn't underflow if either
T1 is 0 or T1 is larger than about 1 and x is not denormal).

Filtering out tiny x is just an optimization, but prevents this underflow.
It sometimes also gives free tests for 0 and denormals (I don't see how
to fold this test into the others), and free optimizations for tiny x.  It
does do this for sin(x), provided we don't support non-standard rounding
modes.  The correct value is just x with inexact if x != 0.

s_cosl.c has the same bug.  I sent you preliminary patches in 2008 when
the long double trig functions were committed, but left fixing these bugs
as an exercise.  Actually, I thought I fixed this in cosl() and only left
sinl() as an exercise.

XX diff -u2 s_cosl.c~ s_cosl.c
XX --- s_cosl.c~	Thu May 30 18:17:21 2013
XX +++ s_cosl.c	Tue Sep  6 03:43:57 2016
XX @@ -55,21 +55,17 @@
XX  	long double y[2];
XX  	long double hi, lo;
XX +	int ex;
XX 
XX  	z.e = x;
XX -	z.bits.sign = 0;
XX -
XX -	/* If x = +-0 or x is a subnormal number, then cos(x) = 1 */
XX -	if (z.bits.exp == 0)
XX -		return (1.0);
XX -
XX -	/* If x = NaN or Inf, then cos(x) = NaN. */
XX -	if (z.bits.exp == 32767)
XX -		return ((x - x) / (x - x));
XX +	ex = z.bits.exp;

Mostly style and minor efficiency fixes.

XX 
XX  	ENTERI();
XX 
XX -	/* Optimize the case where x is already within range. */
XX -	if (z.e < M_PI_4)
XX -		RETURNI(__kernel_cosl(z.e, 0));
XX +	if (ex < 0x3ffe || (ex == 0x3ffe && z.bits.manh < 0xc90fdaa2))
XX +		RETURNI(__kernel_cosl(x, 0));

Efficiency fix for the comparison.

But no fix for underflow, etc.

XX +
XX +	/* If x = NaN or Inf, then cos(x) = NaN. */
XX +	if (ex == 32767)
XX +		RETURNI(x - x);
XX 
XX  	e0 = __ieee754_rem_pio2l(x, y);

ISTR breaking s_sinf.c and/or s_cosf.c with the invalid optimization of
just evaluating the polynomial, to avoid the branch for the
micro-optimization.  But it is not juste for an optimization.

Bruce

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 09:39:45 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 75D16D53FF8;
 Fri, 28 Apr 2017 09:39:45 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id 0CEC0ECC;
 Fri, 28 Apr 2017 09:39:44 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 5551B1048F3C;
 Fri, 28 Apr 2017 19:13:17 +1000 (AEST)
Date: Fri, 28 Apr 2017 19:13:16 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170428010122.GA12814@troutmask.apl.washington.edu>
Message-ID: <20170428183733.V1497@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=HgMA7pdRT-D6GSaveAAA:9 a=bWHRhd8LnqQueheO:21
 a=6RyRNACK7q-QfLbh:21 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 09:39:45 -0000

On Thu, 27 Apr 2017, Steve Kargl wrote:

> On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote:
>>
>> I have attached a new diff to the bugzilla report.  The
>> diff is 3090 lines and won't be broadcast the mailing list.
>>
>> This diff includes fixes for a few inconsequential bugs
>> and implements modified Estrin's method for sum a few
>> ploynomials.  If you want the previous Horner's method
>> then add -DHORNER to your CFLAGS.
>
> For those curious about testing, here are some numbers
> for the Estrin's method.  These were obtained on an AMD
> FX(tm)-8350 @ 4018.34-MHz.  The times are in microseconds
> and the number in parentheses are estimated cycles.
>
>            |    cospi     |    sinpi     |    tanpi
> ------------+--------------+--------------+--------------
> float       | 0.0089 (37)  | 0.0130 (54)  | 0.0194 (80)
> double      | 0.0134 (55)  | 0.0160 (66)  | 0.0249 (102)
> long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393)
> ------------+--------------+--------------+--------------
>
> The timing tests are done on the interval [0,0.25] as
> this is the interval where the polynomial approximations
> apply.  Limited accuracy testing gives

These still seem slow.  I did a quick test of naive evaluations of
these functions as standard_function(Pi * x) and get times a bit faster
on Haswell, except 2-4 times faster for the long double case, with the
handicaps of using gcc-3.3.3 and i386.  Of course, the naive evaluation
is inaccurate, especially near multiples of Pi/2.

> x in [0,0.25]   |   tanpif   |   tanpi    |   tanpil
> -----------------+------------+------------+-------------
>         MAX ULP | 1.37954760 | 1.37300168 | 1.38800823

Just use the naive evaluation to get similar errors in this
range.  It is probably faster too.  For tiny x, both reduce
to the approximation Pi*x, with an error like this expected
unless the evaluation is done in extra precision.

> In the interval [0.25,0.5] tanpi[fl] is computed by
> cospi / sinpi.  The numbers look like
>
> x in [0.25,0.5] |   tanpif   |   tanpi    |   tanpil
> -----------------+------------+------------+-------------
>         MAX ULP | 1.93529165 | 2.04485533 | 1.95823689

The errors build up only linearly in the number of operations,
which is good.

Note that on i386 with its extended precision, in float precision
the naive method is accurate to nearly 0.5 ulps provided you use
extended precision for Pi, the multiplication, and also the function,
so sinpif() is only worth having if it can do this almost as fast
as sinf() (about 15 cycles throughput and less than 100 latency
(50?) on modern x86).  The extra precision is used automatically
by sinf() (by using a double hack.  Double is not very different
from float+extended on i386).  I think accuracy is enough up to extend
float precision up to a useful multiple of Pi (suppose double precision
and not full extended, so only 53 bits for Pi, so 29 extra; lose 24
to cancelations and 5 are left, so the accuracy is enough up to about
2**5*Pi).

Bruce

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 16:56:59 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A4CE3D549E2;
 Fri, 28 Apr 2017 16:56:59 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 859451613;
 Fri, 28 Apr 2017 16:56:59 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3SGuwdl017872
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Fri, 28 Apr 2017 09:56:58 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3SGuwdV017871;
 Fri, 28 Apr 2017 09:56:58 -0700 (PDT) (envelope-from sgk)
Date: Fri, 28 Apr 2017 09:56:58 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170428165658.GA17560@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170428183733.V1497@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 16:56:59 -0000

On Fri, Apr 28, 2017 at 07:13:16PM +1000, Bruce Evans wrote:
> On Thu, 27 Apr 2017, Steve Kargl wrote:
> 
> > On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote:
> >>
> >> I have attached a new diff to the bugzilla report.  The
> >> diff is 3090 lines and won't be broadcast the mailing list.
> >>
> >> This diff includes fixes for a few inconsequential bugs
> >> and implements modified Estrin's method for sum a few
> >> ploynomials.  If you want the previous Horner's method
> >> then add -DHORNER to your CFLAGS.
> >
> > For those curious about testing, here are some numbers
> > for the Estrin's method.  These were obtained on an AMD
> > FX(tm)-8350 @ 4018.34-MHz.  The times are in microseconds
> > and the number in parentheses are estimated cycles.
> >
> >            |    cospi     |    sinpi     |    tanpi
> > ------------+--------------+--------------+--------------
> > float       | 0.0089 (37)  | 0.0130 (54)  | 0.0194 (80)
> > double      | 0.0134 (55)  | 0.0160 (66)  | 0.0249 (102)
> > long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393)
> > ------------+--------------+--------------+--------------
> >
> > The timing tests are done on the interval [0,0.25] as
> > this is the interval where the polynomial approximations
> > apply.  Limited accuracy testing gives
> 
> These still seem slow.  I did a quick test of naive evaluations of
> these functions as standard_function(Pi * x) and get times a bit faster
> on Haswell, except 2-4 times faster for the long double case, with the
> handicaps of using gcc-3.3.3 and i386.  Of course, the naive evaluation
> is inaccurate, especially near multiples of Pi/2.

The slowness comes from handling extra precision arithmetic.  For
sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
approximation and x2 = x * x.  One needs to take care in the ordering
on the evaluation where x and x2 are split into high and low parts.
See the source code posted in response to your other email.  

> > x in [0,0.25]   |   tanpif   |   tanpi    |   tanpil
> > -----------------+------------+------------+-------------
> >         MAX ULP | 1.37954760 | 1.37300168 | 1.38800823
> 
> Just use the naive evaluation to get similar errors in this
> range.  It is probably faster too.

I haven't checked tanpif, but naive evaluation is faster
for sinpif(x).  But getting the worry answer fast seems
to be counter to what one may expect from a math library.
Consider the two naive implementations:

float
bde1(float x)
{
   return (sinf((float)M_PI * x));
}

float
bde2(float x)
{
   return (sinf((float)(M_PI * x)));
}

float
bde3(float x)
{
   return ((float)sin(M_PI * x));
}

x in [0,0.25]    |   sinpif    |    bde1(x)  |    bde2(x)   |   bde3(x)
-----------------+-------------+-------------+--------------+------------
Time usec (cyc)  | 0.0130 (54) | 0.0084 (35) | 0.0093 (38)  | 0.0109 (45)
         MAX ULP | 0.80103672  | 1.94017851  | 1.46830785   | 0.5
1.0 < ULP        | 0           | 736496      | 47607        | 0
0.9 < ULP <= 1.0 | 0           | 563122      | 101162       | 0
0.8 < ULP <= 0.9 | 1           | 751605      | 386128       | 0
0.7 < ULP <= 0.8 | 727         | 942349      | 753647       | 0
0.6 < ULP <= 0.7 | 19268       | 1169719     | 1123518      | 0

x in [3990,3992] |   sinpif    |   bde1(x)   |   bde2(x)   |   bde3(x)
-----------------+-------------+-------------+-------------+------------
Time usec (cyc)  | 0.0161 (66) | 0.0137 (56) | 0.0144 (59) | 0.0211 (87)
         MAX ULP | 0.65193975  | 10458803.   | 6471461.    | 0.5
1.0 < ULP        | 0           | 16773117    | 16762878    | 0
0.9 < ULP <= 1.0 | 0           | 0           | 0           | 0
0.8 < ULP <= 0.9 | 0           | 0           | 0           | 0
0.7 < ULP <= 0.8 | 0           | 0           | 0           | 0
0.6 < ULP <= 0.7 | 19268       | 2047        | 2047        | 0

So bde3(x) is best, but if you're going to use sin(M_PI*x) for
float sinpif, then use sinpi((double)x) which is faster than bde3
and just as accurate.

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 17:05:29 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1EAE8D54C27
 for <freebsd-numerics@mailman.ysv.freebsd.org>;
 Fri, 28 Apr 2017 17:05:29 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id E60081A9F
 for <freebsd-numerics@freebsd.org>; Fri, 28 Apr 2017 17:05:28 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3SH5S9J017958
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Fri, 28 Apr 2017 10:05:28 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3SH5SDB017957;
 Fri, 28 Apr 2017 10:05:28 -0700 (PDT) (envelope-from sgk)
Date: Fri, 28 Apr 2017 10:05:28 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170428170528.GD17560@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428175851.A1386@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170428175851.A1386@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 17:05:29 -0000

On Fri, Apr 28, 2017 at 06:00:19PM +1000, Bruce Evans wrote:
> 
> On Thu, 27 Apr 2017, Steve Kargl wrote:
> 
> > I have attached a new diff to the bugzilla report.  The
> > diff is 3090 lines and won't be broadcast the mailing list.
> 
> But no one will see it there.
> 

Here's the diff in its full glory (without the sloppy
threshold fix).  There no signature at the end.

Index: lib/msun/Makefile
===================================================================
--- lib/msun/Makefile	(revision 317529)
+++ lib/msun/Makefile	(working copy)
@@ -59,6 +59,7 @@
 	s_asinh.c s_asinhf.c s_atan.c s_atanf.c s_carg.c s_cargf.c s_cargl.c \
 	s_cbrt.c s_cbrtf.c s_ceil.c s_ceilf.c \
 	s_copysign.c s_copysignf.c s_cos.c s_cosf.c \
+	s_cospi.c s_cospif.c \
 	s_csqrt.c s_csqrtf.c s_erf.c s_erff.c \
 	s_exp2.c s_exp2f.c s_expm1.c s_expm1f.c s_fabsf.c s_fdim.c \
 	s_finite.c s_finitef.c \
@@ -74,7 +75,10 @@
 	s_rint.c s_rintf.c s_round.c s_roundf.c \
 	s_scalbln.c s_scalbn.c s_scalbnf.c s_signbit.c \
 	s_signgam.c s_significand.c s_significandf.c s_sin.c s_sinf.c \
-	s_tan.c s_tanf.c s_tanh.c s_tanhf.c s_tgammaf.c s_trunc.c s_truncf.c \
+	s_sinpi.c s_sinpif.c \
+	s_tan.c s_tanf.c s_tanh.c s_tanhf.c \
+	s_tanpi.c s_tanpif.c \
+	s_tgammaf.c s_trunc.c s_truncf.c \
 	w_cabs.c w_cabsf.c w_drem.c w_dremf.c
 
 # Location of fpmath.h and _fpmath.h
@@ -100,11 +104,16 @@
 	e_lgammal.c e_lgammal_r.c \
 	e_remainderl.c e_sinhl.c e_sqrtl.c \
 	invtrig.c k_cosl.c k_sinl.c k_tanl.c \
-	s_asinhl.c s_atanl.c s_cbrtl.c s_ceill.c s_cosl.c s_cprojl.c \
+	s_asinhl.c s_atanl.c s_cbrtl.c s_ceill.c s_cosl.c \
+	s_cospil.c \
+	s_cprojl.c \
 	s_csqrtl.c s_erfl.c s_exp2l.c s_expl.c s_floorl.c s_fmal.c \
 	s_fmaxl.c s_fminl.c s_frexpl.c s_logbl.c s_logl.c s_nanl.c \
 	s_nextafterl.c s_nexttoward.c s_remquol.c s_rintl.c s_roundl.c \
-	s_scalbnl.c s_sinl.c s_tanhl.c s_tanl.c s_truncl.c w_cabsl.c
+	s_scalbnl.c s_sinl.c \
+	s_sinpil.c \
+	s_tanhl.c s_tanl.c \
+	s_tanpil.c s_truncl.c w_cabsl.c
 .endif
 
 # C99 complex functions
@@ -131,13 +140,19 @@
 
 MAN=	acos.3 acosh.3 asin.3 asinh.3 atan.3 atan2.3 atanh.3 \
 	ceil.3 cacos.3 ccos.3 ccosh.3 cexp.3 \
-	cimag.3 copysign.3 cos.3 cosh.3 csqrt.3 erf.3 exp.3 fabs.3 fdim.3 \
+	cimag.3 copysign.3 cos.3 cosh.3 \
+	cospi.3 \
+	csqrt.3 erf.3 exp.3 fabs.3 fdim.3 \
 	feclearexcept.3 feenableexcept.3 fegetenv.3 \
 	fegetround.3 fenv.3 floor.3 \
 	fma.3 fmax.3 fmod.3 hypot.3 ieee.3 ieee_test.3 ilogb.3 j0.3 \
 	lgamma.3 log.3 lrint.3 lround.3 math.3 nan.3 \
 	nextafter.3 remainder.3 rint.3 \
-	round.3 scalbn.3 signbit.3 sin.3 sinh.3 sqrt.3 tan.3 tanh.3 trunc.3 \
+	round.3 scalbn.3 signbit.3 sin.3 sinh.3 \
+	sinpi.3 \
+	sqrt.3 tan.3 tanh.3 \
+	tanpi.3 \
+	trunc.3 \
 	complex.3
 
 MLINKS+=acos.3 acosf.3 acos.3 acosl.3
@@ -166,6 +181,7 @@
 MLINKS+=copysign.3 copysignf.3 copysign.3 copysignl.3
 MLINKS+=cos.3 cosf.3 cos.3 cosl.3
 MLINKS+=cosh.3 coshf.3 cosh.3 coshl.3
+MLINKS+=cospi.3 cospif.3 cospi.3 cospil.3
 MLINKS+=csqrt.3 csqrtf.3 csqrt.3 csqrtl.3
 MLINKS+=erf.3 erfc.3 erf.3 erff.3 erf.3 erfcf.3 erf.3 erfl.3 erf.3 erfcl.3
 MLINKS+=exp.3 expm1.3 exp.3 expm1f.3 exp.3 expm1l.3 exp.3 pow.3 exp.3 powf.3 \
@@ -216,10 +232,12 @@
 MLINKS+=scalbn.3 scalbnf.3 scalbn.3 scalbnl.3
 MLINKS+=sin.3 sinf.3 sin.3 sinl.3
 MLINKS+=sinh.3 sinhf.3 sinh.3 sinhl.3
+MLINKS+=sinpi.3 sinpif.3 sinpi.3 sinpil.3
 MLINKS+=sqrt.3 cbrt.3 sqrt.3 cbrtf.3 sqrt.3 cbrtl.3 sqrt.3 sqrtf.3 \
 	sqrt.3 sqrtl.3
 MLINKS+=tan.3 tanf.3 tan.3 tanl.3
 MLINKS+=tanh.3 tanhf.3 tanh.3 tanhl.3
+MLINKS+=tanpi.3 tanpif.3 tanpi.3 tanpil.3
 MLINKS+=trunc.3 truncf.3 trunc.3 truncl.3
 
 .include <src.opts.mk>
Index: lib/msun/Symbol.map
===================================================================
--- lib/msun/Symbol.map	(revision 317529)
+++ lib/msun/Symbol.map	(working copy)
@@ -294,4 +294,13 @@
 	casinl;
 	catanl;
 	catanhl;
+	cospi;
+	cospif;
+	cospil;
+	sinpi;
+	sinpif;
+	sinpil;
+	tanpi;
+	tanpif;
+	tanpil;
 };
Index: lib/msun/ld128/k_cospil.c
===================================================================
--- lib/msun/ld128/k_cospil.c	(nonexistent)
+++ lib/msun/ld128/k_cospil.c	(working copy)
@@ -0,0 +1,73 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/k_cospi.c for implementation details.
+ *
+ * FIXME:  This has not been compiled nor has it been tested for accuracy.
+ * FIXME:  This should use bit twiddling.
+ */
+
+static const long double
+c0   =  1,
+c1hi = -4.93480220054467930941724549993807541e+00L,
+c1lo = -1.61159244006703024615160135944400084e-34L,
+c02  =  4.05871212641676821818501386202937963e+00L,
+c03  = -1.33526276885458949587530478285057448e+00L,
+c04  =  2.35330630358893204541879352772399571e-01L,
+c05  = -2.58068913900140600125982936922503890e-02L,
+c06  =  1.92957430940392304790328456072058283e-03L,
+c07  = -1.04638104924845707113757250204498056e-04L,
+c08  =  4.30306958703294680806457740580669010e-06L,
+c09  = -1.38789524622131344829253970500107607e-07L,
+c10  =  3.60473079732244660718578413933676910e-09L,
+c11  = -7.70070692297072534641720608808474790e-11L,
+c12  =  1.37684487309360112303144478033041762e-12L,
+c13  = -2.07957304470476734518573046614539573e-14L;
+
+static inline long double
+__kernel_cospi(long double x)
+{
+	union IEEEl2bits u;
+	long double c, chi, clo, x2, x2lo;
+	double x2hi;
+
+	x2 = x * x;
+
+	c = c09 + (c10 + (c11 + (c12 + c13 * x2) * x2) * x2) * x2;
+	c = c05 + (c06 + (c07 + (c08 + c   * x2) * x2) * x2) * x2;
+	c =       (c02 + (c03 + (c04 + c   * x2) * x2) * x2) * x2 + c1hi;
+
+	if (x2 < 0x1p-6)		/* |x| < 0x1p-6 */
+		return (c0 + c * x2);
+
+	x2hi = x2;
+	x2lo = x2 - x2hi;
+
+	c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0);
+
+	return (c);
+}

Property changes on: lib/msun/ld128/k_cospil.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/ld128/k_sinpil.c
===================================================================
--- lib/msun/ld128/k_sinpil.c	(nonexistent)
+++ lib/msun/ld128/k_sinpil.c	(working copy)
@@ -0,0 +1,73 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/k_sinpi.c for implementation details.
+ *
+ * FIXME:  This has not been compiled nor has it been tested for accuracy.
+ * FIXME:  This should use bit twiddling.
+ *
+ * Compute sinpi(x) = sin(pi*x) in the interval [0,0.25]
+ */
+
+static const long double 
+s0lo =  3.14159265358979322702026593105983920e+00L,
+s0md =  1.14423774522196636802274144823600323e-17L,
+s0lo = -5.10459839933399352423791604618557491e-52L,
+s01  = -5.16771278004997002924605251118356578e+00L,
+s02  =  2.55016403987734544385617758369521912e+00L,
+s03  = -5.99264529320792076887739383518724498e-01L,
+s04  =  8.21458866111282287988023605285125252e-02L,
+s05  = -7.37043094571435077725854643888422319e-03L,
+s06  =  4.66302805767612564382677987452304496e-04L,
+s07  = -2.19153534478302140522509092972929589e-05L,
+s08  =  7.95205400147494466063630770553231865e-07L,
+s09  = -2.29484289960219849488354905723109270e-08L,
+s10  =  5.39266447760508581521426149354268607e-10L,
+s11  = -1.05182947998557665557194979309998767e-11L,
+s12  =  1.72036427603492093074818577722479034e-13L;
+
+static inline long double
+__kernel_sinpi(long double x)
+{
+	union IEEEl2bits u;
+	long double hi, lo, sm, x2, xlo;
+	double xhi;
+
+	/* Split x into high and low parts. */
+	xhi = x;
+	xlo = x - xhi;
+
+	x2 = x * x;
+	sm = x2 * (x2 * (x2 * (x2 * s12 + s11) + s10) + s09) + s08;
+	sm = x2 * (x2 * (x2 * (x2 * sm  + s07) + s06) + s05) + s04;
+	sm = x2 * (x2 * (x2 * sm  + s03) + s02) + s01;
+	lo = xlo * (x + xhi) * sm;
+	hi = sm * xhi * xhi;
+	lo = x * (lo + s0lo) + xlo * (s0hi + s0md) + x * hi;
+
+	return (lo + xhi * s0md + xhi * s0hi);
+}

Property changes on: lib/msun/ld128/k_sinpil.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/ld128/s_cospil.c
===================================================================
--- lib/msun/ld128/s_cospil.c	(nonexistent)
+++ lib/msun/ld128/s_cospil.c	(working copy)
@@ -0,0 +1,98 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/s_cospi.c for implementation details.
+ *
+ * FIXME:  This has not been compiled nor has it been tested for accuracy.
+ * FIXME:  This should use bit twiddling.
+ */
+
+#include "math.h"
+#include "math_private.h"
+#include "k_cospi.c"
+#include "k_sinpi.c"
+
+static inline long double
+__compute_cospil(long double x)
+{
+
+	if (x < 0.5) {
+		if (x < 0.25)
+			return (__kernel_cospil(x));
+		return (__kernel_sinpil(0.5 - x));
+	} else {
+		if (x < 0.75)
+			return (-__kernel_sinpil(x - 0.5));
+		return (-__kernel_cospil(1 - x));
+	}
+}
+
+long double
+cospil(long double x)
+{
+	const double huge = 1e+300;
+	long double ax, xf;
+	uint32_t ix;
+
+	ax = fabsl(x);
+
+	if (ax < 1) {			/* |x| < 1 */
+		if (ax < 0x1p-60) {	/* |x| < 0x1p-60 */
+			if (huge + x > 0)
+				return (1);
+		}
+		if (ax == 0.5)
+			return (0);
+		return (__compute_cospil(ax));
+	}
+
+	if (ax < 0x1p112) {		/* 1 <= |x| < 0x1p112 */
+
+		xf = floorl(ax);
+
+		ax -= x;
+		if (ax == 0.5)
+			return (0);
+
+		ax = ax == 0 ? 1 : __compute_cospil(ax);
+
+		if (xf > 0x1p50) xf -= 0x1p50;
+		if (xf > 0x1p30) xf -= 0x1p30;
+		ix = (uint32_t)xf;
+
+		return (ix & 1 ? -ax : ax);
+	}
+
+	if (isinf(x) || isnan(x))
+		return (x - x);
+
+	/*
+	 * |x| >= 0x1p112 is always an even integer, so return 1.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID.
+	 */
+	return (1);
+}

Property changes on: lib/msun/ld128/s_cospil.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/ld128/s_sinpil.c
===================================================================
--- lib/msun/ld128/s_sinpil.c	(nonexistent)
+++ lib/msun/ld128/s_sinpil.c	(working copy)
@@ -0,0 +1,100 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/s_sinpi.c for implementation details.
+ *
+ * FIXME:  This has not been compiled nor has it been tested for accuracy.
+ * FIXME:  This should use bit twiddling.
+ */
+
+#include "math.h"
+#include "math_private.h"
+#include "k_cospil.c"
+#include "k_sinpil.c"
+
+static const long double pihi = 3.14159265358979322702026593105983920e+00L;
+static const long double pilo = 1.14423774522196636802434264184180742e-17L;
+
+static inline long double
+__compute_sinpil(long double x)
+{
+
+	if (x < 0.5) {
+		if (x <= 0.25)
+			return (__kernel_sinpil(x));
+		return (__kernel_cospil(0.5 - x));
+	} else {
+		if (x < 0.75)
+			return (__kernel_cospil(x - 0.5));
+		return (__kernel_sinpil(1 - x));
+	}
+}
+
+long double
+sinpi(long double x)
+{
+	long double ax, xf;
+	uint32_t ix;
+
+	ax = fabsl(x);
+
+	if (ax < 1) {			/* |x| < 1 */
+		if (ax < 0x1p-60) {	/* |x| < 0x1p-60 */
+			if (x == 0)
+				return (x);
+			return (pilo * x + pihi * x);
+		}
+		ax = __compute_sinpil(ax);
+		return (copysignl(ax, x));
+	}
+
+	if (ax < 0x1p112) {		/* 1 <= |x| < 0x1p112 */
+
+		xf = floorl(ax);
+
+		ax -= xf;
+		if (ax == 0) {
+			ax = 0;
+		} else {
+			ax = __compute_sinpil(ax);
+			if (xf > 0x1p50) xf -= 0x1p50;
+			if (xf > 0x1p30) xf -= 0x1p30;
+			ix = (uint32_t)xf;
+			if (ix & 1) ax = -ax;
+		}
+		return (copysignl(ax, x));
+	}
+
+	if (isinf(x) || isnan(x))
+		return (x - x);
+
+	/*
+	 * |x| >= 0x1p112 is always an integer, so return +-0.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID.
+	 */
+	return (copysignl(0, x));
+}

Property changes on: lib/msun/ld128/s_sinpil.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/ld128/s_tanpil.c
===================================================================
--- lib/msun/ld128/s_tanpil.c	(nonexistent)
+++ lib/msun/ld128/s_tanpil.c	(working copy)
@@ -0,0 +1,98 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/s_tanpi.c for implementation details.
+ *
+ * FIXME: This has not been compiled nor has it been tested for accuracy.
+ * FIXME: This should use bit twiddling.
+ * FIXME: This should use a polynomial approximation in [0,0.25], but the
+ * FIXME: increase in max ULP is probably small, so who cares.
+ */
+
+#include "math.h"
+#include "math_private.h"
+
+static const long double pihi = 3.14159265358979322702026593105983920e+00L;
+static const long double pilo = 1.14423774522196636802434264184180742e-17L;
+
+static inline long double
+__k_tanpi(long double x)
+{
+
+	return (sinpil(x) / cospil(x));
+}
+
+static inline long double
+__compute_tanpi(long double x)
+{
+
+	return (x < 0.5 ? __k_tanpil(x) : -__k_tanpil(1 - x));
+}
+
+long double
+tanpil(long double x)
+{
+	long double ax, xf;
+	uint32_t ix;
+
+	ax = fabsl(ax);
+
+	if (ax < 1) {			/* |x| < 1 */
+		if (ax < 0x1p-60) {	/* |x| < 0x1p-60 */
+			if (x == 0)
+				return (x);
+			return (pilo * x + pihi * x);
+		}
+		if (ax == 0.5)
+			return ((x - x) / (x - x));
+		ax = __compute_tanpil(ax);
+		return (copysignl(ax, x));
+	}
+
+	if (ix < 0x1p112) {		/* 1 <= |x| < 0x1p112 */
+
+		xf = floorl(ax);
+
+		ax -= xf;
+
+		if (ax == 0.5)
+			return ((x - x) / (x - x));
+
+		ax = ax == 0 ? 0 : __compute_tanpil(ax);
+		return (copysignl(ax, x));
+	}
+
+	/* x = +-inf or nan. */
+	if (isinf(x) || isnan(x))
+		return (x - x);
+
+	/*
+	 * |x| >= 0x1p53 is always an integer, so return +-0.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID.
+	 */
+	return (copysignl(0, x));
+}

Property changes on: lib/msun/ld128/s_tanpil.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/ld80/k_cospil.c
===================================================================
--- lib/msun/ld80/k_cospil.c	(nonexistent)
+++ lib/msun/ld80/k_cospil.c	(working copy)
@@ -0,0 +1,78 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/k_cospi.c for implementation details.
+ */
+
+static const union IEEEl2bits
+c1hiu = LD80C(0x9de9e64df22ef2d2,   2, -4.93480220054467930927e+00L),
+c1lou = LD80C(0xadc4baf0dd791c3e, -63, -1.47187625629472133798e-19L),
+c2u   = LD80C(0x81e0f840dad61d9b,   2,  4.05871212641676821836e+00L);
+#define	c1hi	(c1hiu.e)
+#define	c1lo	(c1lou.e)
+#define	c2	(c2u.e)
+
+static const double
+c3 = -1.3352627688545895e+00, /* 0xbff55d3c, 0x7e3cbffa */
+c4 =  2.3533063035889312e-01, /* 0x3fce1f50, 0x6891bab8 */
+c5 = -2.5806891390008115e-02, /* 0xbf9a6d1f, 0x2a2043da */
+c6 =  1.9295743091289794e-03, /* 0x3f5f9d38, 0xa362e3d1 */
+c7 = -1.0463809745687055e-04, /* 0xbf1b6e24, 0xd372e145 */
+c8 =  4.3029513401606985e-06, /* 0x3ed20c42, 0x4210907c */
+c9 = -1.3777927680211844e-07; /* 0xbe827e0f, 0x55d52bbb */
+
+static const double c0 = 1;
+
+static inline long double
+__kernel_cospil(long double x)
+{
+	long double c, chi, clo, x2hi, x2lo;
+
+#if HORNER
+	/* Horner's method. */
+	long double x2;
+	x2 = x * x;
+	c = (c2 + (c3 + (c4 + (c5 + (c6 + (c7 + (c8 + c9 * x2) * x2)
+	    * x2) * x2) * x2) * x2) * x2) * x2 + c1hi;
+#else
+	/* Sort of Estrin's method. */
+	long double x2, x4;
+	x2 = x * x;
+	x4 = x2 * x2;
+	c = ((c9 * x2 + c8) * x4 + (c7 * x2 + c6)) * x2 * x4 * x4 
+	  + ((c5 * x2 + c4) * x4 + (c3 * x2 + c2)) * x2 +  c1hi;
+#endif
+	if (x2 < 0x1p-6)
+		return (c0 + c * x2);
+	x2hi =(float)x2;
+	x2lo = x2 - x2hi;
+	chi = (float)c;
+	clo = c - chi + c1lo;
+	c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0);
+
+	return (c);
+}

Property changes on: lib/msun/ld80/k_cospil.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/ld80/k_sinpil.c
===================================================================
--- lib/msun/ld80/k_sinpil.c	(nonexistent)
+++ lib/msun/ld80/k_sinpil.c	(working copy)
@@ -0,0 +1,78 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/k_sinpi.c for implementation details.
+ */
+
+static const union IEEEl2bits
+s0hiu = LD80C(0xc90fdaa200000000,   1,  3.14159265346825122833e+00L),
+s0lou = LD80C(0x85a308d313167b73, -33,  1.21542010130121322891e-10L),
+s1u   = LD80C(0xa55de7312df295f5,   2, -5.16771278004997002909e+00L),
+s2u   = LD80C(0xa335e33bad570e85,   1,  2.55016403987734544098e+00L);
+
+#define	s0hi	(s0hiu.e)
+#define	s0lo	(s0lou.e)
+#define	s1	(s1u.e)
+#define	s2	(s2u.e)
+
+static const double
+s3 = -5.9926452932079166e-01,	/* 0xbfe32d2c 0xce62bd82 */
+s4 =  8.2145886611090388e-02,	/* 0x3fb50783 0x487edcdb */
+s5 = -7.3704309439650085e-03,	/* 0xbf7e3074 0xfdc9c0c9 */
+s6 =  4.6630275825043123e-04,	/* 0x3f3e8f43 0x18c2a7a0 */
+s7 = -2.1914601021339888e-05,	/* 0xbef6faa7 0xea41adee */
+s8 =  7.8877623841310396e-07;	/* 0x3eaa7789 0x4aacabad */
+
+static inline long double
+__kernel_sinpil(long double x)
+{
+	long double hi, lo, sm, xhi, xlo;
+	uint64_t lx;
+	uint16_t ix;
+
+#if HORNER
+	/* Horner's method. */
+	long double x2;
+	x2 = x * x;
+	sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s8 + s7) + s6)
+	    + s5) + s4) + s3) + s2) + s1;
+#else
+	/* Sort of Estrin's method. */
+	long double a, b, x2, x4;
+	x2 = x * x;
+	x4 = x2 * x2;
+	sm = ((x2 * s8 + s7) * x4 + (x2 * s6 + s5)) * (x4 * x4)
+	   + (x2 * s4 + s3) * x4 + x2 * s2 + s1;
+#endif
+	xhi = (float)x;
+	xlo = x - xhi;
+	lo = xlo * (x + xhi) * sm;
+	hi = xhi * xhi * sm;
+	lo = x * lo + xlo * (s0hi + s0lo) + x * hi;
+
+	return (lo + xhi * s0lo + xhi * s0hi);
+}

Property changes on: lib/msun/ld80/k_sinpil.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/ld80/s_cospil.c
===================================================================
--- lib/msun/ld80/s_cospil.c	(nonexistent)
+++ lib/msun/ld80/s_cospil.c	(working copy)
@@ -0,0 +1,118 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/s_cospi.c for implementation details.
+ */
+
+#ifdef __i386__
+#include <ieeefp.h>
+#endif
+#include <stdint.h>
+
+#include "fpmath.h"
+#include "math.h"
+#include "math_private.h"
+#include "k_cospil.c"
+#include "k_sinpil.c"
+
+static inline long double
+__compute_cospil(long double x)
+{
+
+	if (x < 0.5) {
+		if (x < 0.25)
+			return (__kernel_cospil(x));
+		return (__kernel_sinpil(0.5 - x));
+	} else {
+		if (x < 0.75)
+			return (-__kernel_sinpil(x - 0.5));
+		return (-__kernel_cospil(1 - x));
+	}
+}
+
+long double
+cospil(long double x)
+{
+	volatile static const double vzero = 0;
+	const double huge = 1e+300;
+	long double ax;
+	uint32_t j0;
+	uint64_t lx;
+	uint16_t hx, ix;
+
+	EXTRACT_LDBL80_WORDS(hx, lx, x);
+	ix = hx & 0x7fff;
+	INSERT_LDBL80_WORDS(ax, ix, lx);
+
+	ENTERI();
+
+	if (ix < 0x3fff) {		/* |x| < 1 */
+		if (ix < 0x3fdd) {	/* |x| < 0x1p-34 */
+			if (huge + x > 0)
+				RETURNI(1);
+		}
+		if (ix == 0x3ffe && lx == 0x8000000000000000ull)
+			RETURNI(0);
+		RETURNI(__compute_cospil(ax));
+	}
+
+	if (ix < 0x403e) {		/* 1 <= |x| < 0x1p63 */
+		/* Determine integer part of ax. */
+		j0 = ix - 0x3fff + 1;
+		if (j0 < 32) {
+			lx = (lx >> 32) << 32;
+			lx &= ~(((lx << 32)-1) >> j0);
+		} else {
+			uint64_t m = (uint64_t)-1 >> (j0 + 1);
+			if (lx & m) lx &= ~m;
+		}
+		INSERT_LDBL80_WORDS(x, ix, lx);
+
+		ax -= x;
+		if (ax == 0.5)
+			RETURNI(0);
+
+		ax = ax == 0 ? 1 : __compute_cospil(ax);
+
+		if (j0 > 40) x -= 0x1p40;
+		if (j0 > 30) x -= 0x1p30;
+		j0 = (uint32_t)x;
+
+		return (j0 & 1 ? -ax : ax);
+	}
+
+	if (ix >= 0x7fff)
+		RETURNI(vzero / vzero);
+
+	/*
+	 * |x| >= 0x1p63 is always an even integer, so return 1.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID.
+	 */
+	if (ax + 1 > 1)
+		ax = 1;
+	return (ax);
+}

Property changes on: lib/msun/ld80/s_cospil.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/ld80/s_sinpil.c
===================================================================
--- lib/msun/ld80/s_sinpil.c	(nonexistent)
+++ lib/msun/ld80/s_sinpil.c	(working copy)
@@ -0,0 +1,123 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/s_sinpi.c for implementation details.
+ */
+
+#ifdef __i386__
+#include <ieeefp.h>
+#endif
+#include <stdint.h>
+
+#include "fpmath.h"
+#include "math.h"
+#include "math_private.h"
+#include "k_cospil.c"
+#include "k_sinpil.c"
+
+static inline long double
+__compute_sinpil(long double x)
+{
+
+	if (x <= 0.5) {
+		if (x <= 0.25)
+			return (__kernel_sinpil(x));
+		return (__kernel_cospil(0.5 - x));
+	} else {
+		if (x <= 0.75)
+			return (__kernel_cospil(x - 0.5));
+		return (__kernel_sinpil(1 - x));
+	}
+}
+
+static const double pihi = 3.1415926814079285e+00; /* 0x400921fb 0x58000000 */
+static const double pilo =-2.7818135228334233e-08; /* 0xbe5dde97 0x3dcb3b3a */
+
+long double
+sinpil(long double x)
+{
+	volatile static const double vzero = 0;
+	long double ax;
+	uint32_t j0;
+	uint64_t lx;
+	uint16_t hx, ix;
+
+	EXTRACT_LDBL80_WORDS(hx, lx, x);
+	ix = hx & 0x7fff;
+	INSERT_LDBL80_WORDS(ax, ix, lx);
+
+	ENTERI();
+
+	if (ix < 0x3fff) {		/* |x| < 1 */
+		if (ix < 0x3fdc) {	/* |x| < 0x1p-35 */
+			if (x == 0)
+				RETURNI(x);
+			INSERT_LDBL80_WORDS(ax, hx, (lx >> 32) << 32);
+			x -= ax;
+			RETURNI(pilo * x + pihi * x + pilo * ax + pihi * ax);
+		}
+		ax = __compute_sinpil(ax);
+		RETURNI((hx & 0x8000) ? -ax : ax);
+	}
+
+	if (ix < 0x403e) {		/* 1 <= |x| < 0x1p63 */
+		/* Determine integer part of ax. */
+		j0 = ix - 0x3fff + 1;
+		if (j0 < 32) {
+			lx = (lx >> 32) << 32;
+			lx &= ~(((lx << 32)-1) >> j0);
+		} else {
+			uint64_t m = (uint64_t)-1 >> (j0 + 1);
+			if (lx & m) lx &= ~m;
+		}
+		INSERT_LDBL80_WORDS(x, ix, lx);
+
+		ax -= x;
+		if (ax == 0) {
+			ax = 0;
+		} else {
+			ax = __compute_sinpil(ax);
+			if (j0 > 40) x -= 0x1p40;
+			if (j0 > 30) x -= 0x1p30;
+			j0 = (uint32_t)x;
+			if (j0 & 1) ax = -ax;
+		}
+		RETURNI((hx & 0x8000) ? -ax : ax);
+	}
+
+	/* x = +-inf or nan. */
+	if (ix >= 0x7fff)
+		RETURNI(vzero / vzero);
+
+	/*
+	 * |x| >= 0x1p63 is always an integer, so return +-0.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID.
+	 */
+	if (ax + 1 > 1)
+		ax = copysignl(0, x);
+	RETURNI(ax);
+}

Property changes on: lib/msun/ld80/s_sinpil.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/ld80/s_tanpil.c
===================================================================
--- lib/msun/ld80/s_tanpil.c	(nonexistent)
+++ lib/msun/ld80/s_tanpil.c	(working copy)
@@ -0,0 +1,195 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/s_tanpi.c for implementation details.
+ */
+
+#ifdef __i386__
+#include <ieeefp.h>
+#endif
+#include <stdint.h>
+
+#include "fpmath.h"
+#include "math.h"
+#include "math_private.h"
+#include "k_cospil.c"
+#include "k_sinpil.c"
+
+static const double pihi = 3.1415926814079285e+00; /* 0x400921fb 0x58000000 */
+static const double pilo =-2.7818135228334233e-08; /* 0xbe5dde97 0x3dcb3b3a */
+
+static const union IEEEl2bits
+t00hiu = LD80C(0xc90fdaa22168c235,   1,  3.14159265358979323851e+00L),
+t00lou = LD80C(0xecdfa9e9520ede79, -65, -5.01599541779735313611e-20L),
+t01hiu = LD80C(0xa55de7312df295f5,   3,  1.03354255600999400582e+01L),
+t01lou = LD80C(0x8d51cc1ca8412de2, -62,  2.39404580332694307957e-19L),
+t02hiu = LD80C(0xa335e33bad570ec7,   5,  4.08026246380375272847e+01L),
+t02lou = LD80C(0xe72db1e59703a7ba, -60, -1.56652642070969805771e-18L),
+t03u   = LD80C(0xa2fffcda474ac1d3,   7,  1.62999951975255277944e+02L),
+t04u   = LD80C(0xa2fa3971d7988829,   9,  6.51909756145994949750e+02L),
+t05u   = LD80C(0xa2f99792aec80f33,  11,  2.60759950512193695427e+03L),
+t06u   = LD80C(0xa2f985abafb74d99,  13,  1.04303805377441052764e+04L),
+t07u   = LD80C(0xa2f983819a4712aa,  15,  4.17215136963294523085e+04L);
+#define	t00	(t00hiu.e)
+#define	t00lo	(t00lou.e)
+#define	t01	(t01hiu.e)
+#define	t01lo	(t01lou.e)
+#define	t02	(t02hiu.e)
+#define	t02lo	(t02lou.e)
+#define	t03	(t03u.e)
+#define	t04	(t04u.e)
+#define	t05	(t05u.e)
+#define	t06	(t06u.e)
+#define	t07	(t07u.e)
+
+static const double
+t08 =  1.6688612339742415e+05, /* 0x41045f30, 0xfcb7c9e9 */
+t09 =  6.6753884815733030e+05, /* 0x41245f25, 0xb241ad77 */
+t10 =  2.6704954502371075e+06, /* 0x41445fcf, 0xb9a15e9a */
+t11 =  1.0666012381827524e+07, /* 0x41645803, 0x8c37ee5b */
+t12 =  4.3253268125143178e+07, /* 0x41849ff0, 0xa1004b11 */
+t13 =  1.5587209092892912e+08, /* 0x41a294d6, 0xb5db9c99 */
+t14 =  1.0162479548145952e+09, /* 0x41ce495b, 0x496844a8 */
+t15 = -2.9808831901354914e+09, /* 0xc1e63595, 0x5ec455f2 */
+t16 =  8.5833091951381027e+10, /* 0x4233fc0d, 0x0b6f618b */
+t17 = -6.8395913998321326e+11, /* 0xc263e7e4, 0x87d1e6d3 */
+t18 =  5.1641660627562871e+12, /* 0x4292c981, 0x228a9126 */
+t19 = -2.1258706455671141e+13, /* 0xc2b355ad, 0xa58c7724 */
+t20 =  5.2041706779955102e+13; /* 0x42c7aa73, 0xb91a998d */
+
+static inline long double
+__kernel_tanpil(long double x)
+{
+	long double t, xhi, xlo;
+	uint64_t lx;
+	uint16_t ix;
+
+	xhi = (float)x;
+	xlo = x - xhi;
+	xlo *= (xlo + xhi + xhi);
+	xhi *= xhi;
+	
+	if (x < 0.25) {
+#if HORNER
+		long double a, b, c, x2;
+		x2 = xlo + xhi;
+		a = x2 * (x2 * (x2 * (x2 * t20 + t19) + t18) + t17) + t16;
+		b = x2 * (x2 * (x2 * (x2 * a   + t15) + t14) + t13) + t12;
+		c = x2 * (x2 * (x2 * (x2 * b   + t11) + t10) + t09) + t08;
+		t = x2 * (x2 * (x2 * (x2 * c   + t07) + t06) + t05);
+#else
+		long double d, x2, x4;
+		double a, b, c, d2, d4;
+		d2 = x2 = xlo + xhi;
+		d4 = x4 = x2 * x2;
+		a = ((d2 * t20 + t19) * d4 + (d2 * t18 + t17)) * d4;
+		b = ((d2 * t16 + t15) * d4 + (d2 * t14 + t13));
+		c = ((d2 * t12 + t11) * d4 + (d2 * t10 + t09));
+		d = ((x2 * t08 + t07) * x4 + (x2 * t06 + t05));
+		t = (((a * d4 + b) * (x4 * x4) + c) * (x4 * x4) + d) * x2;
+#endif
+		t = t02lo + x2 * (x2 * (t + t04) + t03) + t02;
+		t = t01lo + xlo * t + xhi * t + t01;
+		t = t00lo + xlo * t + xhi * t + t00;
+		return (t * x);
+	} else if (x > 0.25) {
+		x = 0.5 - x;
+		t = __kernel_cospil(x) / __kernel_sinpil(x);
+	} else
+		t = 1;
+	return (t);
+}
+
+static inline long double
+__compute_tanpil(long double x)
+{
+
+	return (x < 0.5 ? __kernel_tanpil(x) : -__kernel_tanpil(1 - x));
+}
+
+long double
+tanpil(long double x)
+{
+	volatile static const double vzero = 0;
+	long double ax;
+	uint32_t j0;
+	uint64_t lx;
+	uint16_t hx, ix;
+
+	EXTRACT_LDBL80_WORDS(hx, lx, x);
+	ix = hx & 0x7fff;
+	INSERT_LDBL80_WORDS(ax, ix, lx);
+
+	ENTERI();
+
+	if (ix < 0x3fff) {		/* |x| < 1 */
+		if (ix < 0x3fdd) {	/* |x| < 0x1p-34 */
+			if (x == 0)
+				RETURNI(x);
+			INSERT_LDBL80_WORDS(ax, ix, (lx >> 32) << 32);
+			x -= ax;
+			RETURNI(pilo * x + pihi * x + pilo * ax + pihi * ax);
+		}
+		if (ix == 0x3ffe && lx == 0x8000000000000000ull)
+			RETURNI((x - x) / (x - x));
+		ax = __compute_tanpil(ax);
+		RETURNI((hx & 0x8000) ? -ax : ax);
+	}
+
+	if (ix < 0x403e) {		/* 1 <= |x| < 0x1p63 */
+		/* Determine integer part of ax. */
+		j0 = ix - 0x3fff + 1;
+		if (j0 < 32) {
+			lx = (lx >> 32) << 32;
+			lx &= ~(((lx << 32)-1) >> j0);
+		} else {
+			uint64_t m = (uint64_t)-1 >> (j0 + 1);
+			if (lx & m) lx &= ~m;
+		}
+		INSERT_LDBL80_WORDS(x, ix, lx);
+
+		ax -= x;
+
+		if (ax == 0.5)
+			RETURNI((x - x) / (x - x));
+
+		ax = ax == 0 ? 0 : __compute_tanpil(ax);
+		RETURNI((hx & 0x8000) ? -ax : ax);
+	}
+
+	/* x = +-inf or nan. */
+	if (ix >= 0x7fff)
+		RETURNI(vzero / vzero);
+
+	/*
+	 * |x| >= 0x1p63 is always an integer, so return +-0.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID.
+	 */
+	if (ax + 1 > 1)
+		ax = copysignl(0, x);
+	RETURNI(ax);
+}

Property changes on: lib/msun/ld80/s_tanpil.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/man/cospi.3
===================================================================
--- lib/msun/man/cospi.3	(nonexistent)
+++ lib/msun/man/cospi.3	(working copy)
@@ -0,0 +1,111 @@
+.\" Copyright (c) 2017 Steven G. Kargl <kargl@FreeBSD.org>
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd April 1, 2017
+.Dt COSPI 3
+.Os
+.Sh NAME
+.Nm cospi ,
+.Nm cospif ,
+.Nm cospil
+.Nd half\(encycle cosine functions
+.Sh LIBRARY
+.Lb libm
+.Sh SYNOPSIS
+.In math.h
+.Ft double
+.Fn cospi "double x"
+.Ft float
+.Fn cospif "float x"
+.Ft long double
+.Fn cospil "long double x"
+.Sh DESCRIPTION
+The
+.Fn cospi ,
+.Fn cospif ,
+and
+.Fn cospil
+functions compute the cosine of
+.Fa "\(*p \(mu x" .
+and measure angles in half-cycles.
+.Sh RETURN VALUES
+The
+.Fn cospi ,
+.Fn cospif ,
+and
+.Fn cospil
+functions returns
+.Fn cos "\(*p \(mu x" .
+If \*(Bax\*(Ba \*(Ge 2^(p - 1)
+where p is the floating\(enpoint precision of 
+.Ar x ,
+then the returned value is 1 and it has no significance.
+.Sh SPECIAL VALUES
+.Bl -tag
+.It
+.Fn cospi \*(Pm0
+returns 1.
+.It
+.Fn cospi \*(Pmn/2
+returns 0 for positive integers
+.Ar n .
+.It
+.Fn cospi n
+returns 1 for even integers
+.Ar n .
+.It
+.Fn cospi n
+returns \-1 for odd integers
+.Ar n .
+.It
+.Fn cospi \*(Pm\(if
+return an \*(Na and raises an FE_INVALID exception.
+.It
+.Fn cospi \*(Na
+return an \*(Na and raises an FE_INVALID exception.
+.El
+.Sh SEE ALSO
+.Xr cos 3 ,
+.Xr fenv 3 ,
+.Xr math 3 ,
+.Xr sin 3 ,
+.Xr sinpi 3 ,
+.Xr tan 3 ,
+.Xr tanpi 3
+.Sh AUTHORS
+The half\(encycle trignometric functions were written by
+.An Steven G. Kargl Aq Mt kargl@FreeBSD.org .
+.Sh STANDARDS
+These functions conform to
+IEEE Std 754\(tm\(en2008 ,
+\(dqIEEE Standard for Floating-Point Arithmetic\(dq
+and to
+ISO/IEC TS 18661-4 , 
+\(dqInformation technology \(em Programming languages, their environments,
+and system software interfaces \(em Floating\(enpoint extensions for
+C\(dq \(em Part 4: Supplementary functions.
+
+

Property changes on: lib/msun/man/cospi.3
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/man/sinpi.3
===================================================================
--- lib/msun/man/sinpi.3	(nonexistent)
+++ lib/msun/man/sinpi.3	(working copy)
@@ -0,0 +1,102 @@
+.\" Copyright (c) 2017 Steven G. Kargl <kargl@FreeBSD.org>
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd April 1, 2017
+.Dt SINPI 3
+.Os
+.Sh NAME
+.Nm sinpi ,
+.Nm sinpif ,
+.Nm sinpil
+.Nd half\(encycle sine functions
+.Sh LIBRARY
+.Lb libm
+.Sh SYNOPSIS
+.In math.h
+.Ft double
+.Fn sinpi "double x"
+.Ft float
+.Fn sinpif "float x"
+.Ft long double
+.Fn sinpil "long double x"
+.Sh DESCRIPTION
+The
+.Fn sinpi ,
+.Fn sinpif ,
+and
+.Fn sinpil
+functions compute the sine of
+.Fa "\(*p \(mu x" .
+and measure angles in half-cycles.
+.Sh RETURN VALUES
+The
+.Fn sinpi ,
+.Fn sinpif ,
+and
+.Fn sinpil
+functions returns
+.Fn sin "\(*p \(mu x" .
+If \*(Bax\*(Ba \*(Ge 2^(p - 1)
+where p is the floating\(enpoint precision of 
+.Ar x ,
+then the returned value is \*(Pm0 and it has no significance.
+.Sh SPECIAL VALUES
+.Bl -tag
+.It
+.Fn sinpi \*(Pm0
+returns \*(Pm0.
+.It
+.Fn sinpi \*(Pmn
+returns \*(Pm0 for positive integers
+.Ar n .
+.It
+.Fn sinpi \*(Pm\(if
+return an \*(Na and raises an FE_INVALID exception.
+.It
+.Fn sinpi \*(Na
+return an \*(Na and raises an FE_INVALID exception.
+.El
+.Sh SEE ALSO
+.Xr cos 3 ,
+.Xr cospi 3 ,
+.Xr fenv 3 ,
+.Xr math 3 ,
+.Xr sin 3 ,
+.Xr tan 3 ,
+.Xr tanpi 3
+.Sh AUTHORS
+The half\(encycle trignometric functions were written by
+.An Steven G. Kargl Aq Mt kargl@FreeBSD.org .
+.Sh STANDARDS
+These functions conform to
+IEEE Std 754\(tm\(en2008 ,
+\(dqIEEE Standard for Floating-Point Arithmetic\(dq
+and to
+ISO/IEC TS 18661-4 , 
+\(dqInformation technology \(em Programming languages, their environments,
+and system software interfaces \(em Floating\(enpoint extensions for
+C\(dq \(em Part 4: Supplementary functions.
+

Property changes on: lib/msun/man/sinpi.3
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/man/tanpi.3
===================================================================
--- lib/msun/man/tanpi.3	(nonexistent)
+++ lib/msun/man/tanpi.3	(working copy)
@@ -0,0 +1,106 @@
+.\" Copyright (c) 2017 Steven G. Kargl <kargl@FreeBSD.org>
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd April 1, 2017
+.Dt TANPI 3
+.Os
+.Sh NAME
+.Nm tanpi ,
+.Nm tanpif ,
+.Nm tanpil
+.Nd half\(encycle tangent functions
+.Sh LIBRARY
+.Lb libm
+.Sh SYNOPSIS
+.In math.h
+.Ft double
+.Fn tanpi "double x"
+.Ft float
+.Fn tanpif "float x"
+.Ft long double
+.Fn tanpil "long double x"
+.Sh DESCRIPTION
+The
+.Fn tanpi ,
+.Fn tanpif ,
+and
+.Fn tanpil
+functions compute the tangent of
+.Fa "\(*p \(mu x" 
+and measure angles in half-cycles.
+.Sh RETURN VALUES
+The
+.Fn tanpi ,
+.Fn tanpif ,
+and
+.Fn tanpil
+functions returns
+.Fn tan "\(*p \(mu x" .
+If \*(Bax\*(Ba \*(Ge 2^(p - 1)
+where p is the floating\(enpoint precision of 
+.Ar x ,
+then the returned value is \*(Pm0 and it has no significance.
+.Sh SPECIAL VALUES
+.Bl -tag
+.It
+.Fn tanpi \*(Pm0
+returns \*(Pm0.
+.It
+.Fn tanpi \*(Pmn
+returns \*(Pm0 for positive integers
+.Ar n .
+.It
+.Fn tanpi \*(Pmn/2
+returns \*(Na for n > 0 and raises an FE_INVALID exception.
+.It
+.Fn tanpi \*(Pm\(if
+return an \*(Na and raises an FE_INVALID exception.
+.It
+.Fn tanpi \*(Na
+return an \*(Na and raises an FE_INVALID exception.
+.El
+.Sh SEE ALSO
+.Xr cos 3 ,
+.Xr cospi 3 ,
+.Xr fenv 3 ,
+.Xr math 3 ,
+.Xr sin 3 ,
+.Xr sinpi 3 ,
+.Xr tan 3 ,
+.Sh AUTHORS
+The half\(encycle trignometric functions were written by
+.An Steven G. Kargl Aq Mt kargl@FreeBSD.org .
+.Sh STANDARDS
+These functions conform to
+IEEE Std 754\(tm\(en2008 ,
+\(dqIEEE Standard for Floating-Point Arithmetic\(dq
+and to
+ISO/IEC TS 18661-4 , 
+\(dqInformation technology \(em Programming languages, their environments,
+and system software interfaces \(em Floating\(enpoint extensions for
+C\(dq \(em Part 4: Supplementary functions.
+
+

Property changes on: lib/msun/man/tanpi.3
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/src/k_cospi.c
===================================================================
--- lib/msun/src/k_cospi.c	(nonexistent)
+++ lib/msun/src/k_cospi.c	(working copy)
@@ -0,0 +1,83 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Compute cospi(x) = cos(pi*x) via a polynomial approximation.  Coefficients 
+ * were determined for the following function:
+ *
+ * f(x) = ((cos(pi*x) - 1) / (x*x)) = c1 + c2*x**2 + c3*x**4 + ...
+ *
+ * with an Remes algorithm.  If |x*x| < 0x1p-6, the polynomial can be summed
+ * using a Horner's method.  If |x*x| exceeds 0x1p-6, then to accumulate the
+ * result with sufficient accuracy the c1 coefficient and final
+ * multiplications needed to be handled specially.
+ *
+ * x2 = x * x
+ * C(x2) = c1hi + c2*x2 + c3*x2**2 + ...
+ *       = c1hi + (c2 + (c3 ...) * x2) * x2
+ * cospi(x) = c0 + x2 * C(x2)
+ */
+
+static const double
+c0   =  1.0000000000000000e+00,	/* 0x3ff00000 0x00000000 */
+c1hi = -4.9348022005446790e+00,	/* 0xc013bd3c 0xc9be45de */
+c1lo = -3.1132368291465357e-16,	/* 0xbcb66ee8 0x86887579 */
+c2   =  4.0587121264167649e+00,	/* 0x40103c1f 0x081b5ac0 */
+c3   = -1.3352627688538099e+00,	/* 0xbff55d3c 0x7e3cb243 */
+c4   =  2.3533063028402096e-01,	/* 0x3fce1f50 0x68689166 */
+c5   = -2.5806887965588832e-02,	/* 0xbf9a6d1e 0xef4b8280 */
+c6   =  1.9294938875092037e-03,	/* 0x3f5f9ce2 0x49428066 */
+c7   = -1.0370089699114971e-04;	/* 0xbf1b2f3f 0xd835d4fe */
+
+static inline double
+__kernel_cospi(double x)
+{
+	double c, chi, clo, x2hi, x2lo;
+#if HORNER
+	/* Horner's method. */
+	double x2;
+	x2 = x * x;
+	c = (c2 + (c3 + (c4 + (c5 + (c6 + c7 * x2) * x2) * x2) * x2)
+	    * x2) * x2 + c1hi;
+#else
+	/* Sort of Estrin's method. */
+	double x2, x4;
+	x2 = x * x;
+	x4 = x2 * x2;
+	c = (c5 + c6 * x2 + c7 * x4) * (x4 * x4)
+	    + (c2 + c3 * x2 + c4 * x4) * x2 + c1hi;
+#endif
+	if (x2 < 0x1p-6)
+		return (c0 + c * x2);
+
+	x2hi = (float)x2;
+	x2lo = x2 - x2hi;
+	chi = (float)c;
+	clo = c - chi + c1lo;
+	c = clo * x2lo + (chi * x2lo + clo * x2hi) + (chi * x2hi + c0);
+
+	return (c);
+}

Property changes on: lib/msun/src/k_cospi.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/src/k_cospif.c
===================================================================
--- lib/msun/src/k_cospif.c	(nonexistent)
+++ lib/msun/src/k_cospif.c	(working copy)
@@ -0,0 +1,60 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/k_cospi.c for implementation details.
+ */
+
+static const float
+c0   =  1.00000000e+00f,	/* 0x3fc00000 */
+c1hi = -4.93480206e+00f,	/* 0xc09de9e6 0xb418a08d */
+c1lo = -1.42145112e-07f,	/* 0xc09de9e6 0xb418a08d */
+c2   =  4.05871058e+00f,	/* 0x4081e0f5 0xb1fc37ea */
+c3   = -1.33513784e+00f,	/* 0xbfaae5cc 0xb375e792 */
+c4   =  2.32126340e-01f;	/* 0x3e6db287 0x31daa25f */
+
+static inline float
+__kernel_cospif(float x)
+{
+	float c, chi, clo, x2, x2hi, x2lo;
+	uint32_t ix;
+
+	x2 = x * x;
+	c = (c2 + (c3 + c4 * x2) * x2) * x2 + c1hi;
+
+	GET_FLOAT_WORD(ix, x2);
+	if (ix < 0x3c800000)		/* |x2| < 0x1p-6 */
+		return (c0 + c * x2);
+
+	SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);
+	x2lo = x2 - x2hi;
+	GET_FLOAT_WORD(ix, c);
+	SET_FLOAT_WORD(chi, (ix >> 14) << 14);
+	clo = c - chi + c1lo;
+	c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0);
+
+	return (c);
+}

Property changes on: lib/msun/src/k_cospif.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/src/k_sinpi.c
===================================================================
--- lib/msun/src/k_sinpi.c	(nonexistent)
+++ lib/msun/src/k_sinpi.c	(working copy)
@@ -0,0 +1,84 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * Compute sinpi(x) = sin(pi*x) via a polynomial approximation.  Coefficients 
+ * were determined for the following function:
+ *
+ * f(x) = sin(pi*x) / x = s0 + s1*x**2 + s2*x**4 + ...
+ *
+ * with an Remes algorithm.  To accumulate the result with sufficient
+ * accuracy the s0 coefficient and final multiplications needed to be
+ * handled specially.
+ *
+ * x2 = x * x
+ * S(x2) = s1 + s2*x2 + s3*x2**2 + ...
+ *       = s1 + (s2 + (s3 ...) * x2) * x2
+ * sinpi(x) = x * (s0 + x2 * S(x2))
+ *
+ * The final multiplications and additions in the above expression for
+ * sinpi(x) are done in an extra precision arithmetic.
+ */
+
+static const double 
+s0hi =  3.1415926218032837e+00,	/* 0x400921fb 0x50000000 */
+s0lo =  3.1786509547050787e-08,	/* 0x3e6110b4 0x611a5f14 */
+s1   = -5.1677127800499703e+00,	/* 0xc014abbc 0xe625be53 */
+s2   =  2.5501640398773415e+00,	/* 0x400466bc 0x6775aad9 */
+s3   = -5.9926452932029806e-01,	/* 0xbfe32d2c 0xce62ac24 */
+s4   =  8.2145886580065136e-02,	/* 0x3fb50783 0x485cc006 */
+s5   = -7.3704298849218506e-03,	/* 0xbf7e3074 0xb502de6a */
+s6   =  4.6628273194622205e-04,	/* 0x3f3e8eed 0x159b24d7 */
+s7   = -2.1717412527405332e-05;	/* 0xbef6c5b9 0x3995dfbe */
+
+static inline double
+__kernel_sinpi(double x)
+{
+	double hi, lo, sm, xhi, xlo;
+	uint32_t hx, lx;
+#if HORNER
+	double x2;
+	/* Horner's method. */
+	x2 = x * x;
+	sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3)
+	    + s2) + s1;
+#else
+	double x2, x4;
+	/* Sort of Estrin's method. */
+	x2 = x * x;
+	x4 = x2 * x2;
+	sm = (s5 + s6 * x2 + s7 * x4) * (x4 * x4)
+	    + (s2 + s3 * x2 + s4 * x4) * x2 + s1;
+#endif
+	EXTRACT_WORDS(hx, lx, x);
+	INSERT_WORDS(xhi, hx, 0);
+	xlo = x - xhi;
+	lo = xlo * (x + xhi) * sm;
+	hi = xhi * xhi * sm;
+	lo = x * lo + xlo * (s0hi + s0lo) + x * hi;
+
+	return (lo + xhi * s0lo + xhi * s0hi);
+}

Property changes on: lib/msun/src/k_sinpi.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/src/k_sinpif.c
===================================================================
--- lib/msun/src/k_sinpif.c	(nonexistent)
+++ lib/msun/src/k_sinpif.c	(working copy)
@@ -0,0 +1,56 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/k_sinpi.c for implementation details.
+ */
+
+static const float
+s0hi =  3.14062500e+00f,	/* 0x40490000 */
+s0lo =  9.67653585e-04f,	/* 0x3a7daa22 */
+s1   = -5.16771269e+00f,	/* 0xc0a55de7 */
+s2   =  2.55016255e+00f,	/* 0x402335dd */
+s3   = -5.99202096e-01f,	/* 0xbf19654f */
+s4   =  8.10018554e-02f;	/* 0x3da5e44d */
+
+static inline float
+__kernel_sinpif(float x)
+{
+	float hi, lo, sm, x2, xhi, xlo;
+	uint32_t ix;
+
+	GET_FLOAT_WORD(ix, x);
+	SET_FLOAT_WORD(xhi, (ix >> 14) << 14);
+	xlo = x - xhi;
+
+	x2 = x * x;
+	sm = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
+	lo = xlo * (x + xhi) * sm;
+	hi = xhi * xhi * sm;
+	lo = x * lo + xlo * (s0hi + s0lo) + x * hi;
+
+	return (lo + xhi * s0lo + xhi * s0hi);
+}

Property changes on: lib/msun/src/k_sinpif.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/src/math.h
===================================================================
--- lib/msun/src/math.h	(revision 317529)
+++ lib/msun/src/math.h	(working copy)
@@ -500,6 +500,15 @@
 
 #if __BSD_VISIBLE
 long double	lgammal_r(long double, int *);
+double		cospi(double);
+float		cospif(float);
+long double	cospil(long double);
+double		sinpi(double);
+float		sinpif(float);
+long double	sinpil(long double);
+double		tanpi(double);
+float		tanpif(float);
+long double	tanpil(long double);
 #endif
 
 __END_DECLS
Index: lib/msun/src/s_cospi.c
===================================================================
--- lib/msun/src/s_cospi.c	(nonexistent)
+++ lib/msun/src/s_cospi.c	(working copy)
@@ -0,0 +1,147 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/**
+ * cospi(x) computes cos(pi*x) without multiplication by pi (almost).  First,
+ * note that cospi(-x) = cospi(x), so the algorithm considers only |x|.  The
+ * method used depends on the magnitude of x.
+ *
+ * 1. For small |x|, cospi(x) = 1 with FE_INEXACT raised where a sloppy
+ *    threshold is used.  The threshold is |x| < 0x1pN with N = -(P/2+M).
+ *    P is the precision of the floating-point type and M = 3 or 4.
+ *
+ * 2. For |x| < 1, argument reduction is not required and cospi(x) is 
+ *    computed by polynomial approximations in the the interval [0,0.25].
+ *    See k_sinpi.c and k_cospi.c for the details for these polynomial
+ *    approximations.
+ *
+ * 3. For 1 <= |x| < 0x1p(P-1), argument reduction is required where
+ *    |x| = j0 + r with j0 an integer and the remainder r statisfies
+ *    0 <= r < 1.  With the given domain, a simplified inline floor(x)
+ *    is used.
+ *
+ *    cospi(x) = cos(pi*(j0+r))
+ *             = cos(pi*j0) * cos(pi*r) - sin(pi*j0) * sin(pi*r)
+ *             = cos(pi*j0) * cos(pi*r)
+ *             = +-cospi(r)
+ *
+ *    If j0 is even, then cos(pi*j0) = 1. If j0 is odd, then cos(pi*j0) = -1.
+ *    cospi(r) is then computed via polynomial approximation.
+ *
+ * 4. For |x| >= 0x1p(P-1), |x| is integral and cospi(x) = 1.
+ *
+ * 5. Special cases:
+ *
+ *    cospi(+-0) = 1.
+ *    cospi(n.5) = 0 for n an integer.
+ *    cospi(+-inf) = nan.  Raises the "invalid" floating-point exception.
+ *    cospi(nan) = nan.  Raises the "invalid" floating-point exception.
+ */
+
+#include "math.h"
+#include "math_private.h"
+#include "k_cospi.c"
+#include "k_sinpi.c"
+
+/*
+ * To compute cospi(x) for 0 < x < 1, use trignometric identities to map
+ * cospi(x) into the [0,0.25] quadrant of sinpi(r) or cospi(r).
+ */
+static inline double
+__compute_cospi(double x)
+{
+
+	if (x < 0.5) {
+		if (x < 0.25)
+			return (__kernel_cospi(x));
+		return (__kernel_sinpi(0.5 - x));
+	} else {
+		if (x < 0.75)
+			return (-__kernel_sinpi(x - 0.5));
+		return (-__kernel_cospi(1 - x));
+	}
+}
+
+double
+cospi(double x)
+{
+	volatile static const double vzero = 0;
+	static const double huge = 1e+300;
+	double ax;
+	uint32_t hx, ix, j0, lx;
+
+	EXTRACT_WORDS(hx, lx, x);
+	ix = hx & 0x7fffffff;
+	INSERT_WORDS(ax, ix, lx);
+
+	if (ix < 0x3ff00000) {		/* |x| < 1 */
+		if (ix < 0x3e200000) {	/* |x| < 0x1p-29 */
+			if (huge + x > 0)
+				return (1);
+		}
+		if ((ix | lx) == 0x3fe00000)
+			return (0);
+		return (__compute_cospi(ax));
+	}
+
+	if (ix < 0x43300000) {		/* 1 <= |x| < 0x1p52 */
+		/* Determine integer part of ax. */
+		j0 = ((ix >> 20) & 0x7ff) - 0x3ff;
+		if (j0 < 20) {
+			ix &= ~(0x000fffff >> j0);
+			lx = 0;
+		} else {
+			lx &= ~((uint32_t)0xffffffff >> (j0 - 20));
+		}
+		INSERT_WORDS(x, ix, lx);
+
+		ax -= x;
+		if (ax == 0.5)
+			return (0);
+
+		ax = ax == 0 ? 1 : __compute_cospi(ax);
+
+		if (j0 > 30) x -= 0x1p30;
+		j0 = (uint32_t)x;
+
+		return (j0 & 1 ? -ax : ax);
+	}
+
+	if (ix >= 0x7f800000)
+		return (vzero / vzero);
+
+	/*
+	 * |x| >= 0x1p52 is always an even integer, so return 1.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID?
+	 */
+	if (ax + 1 > 1)
+		ax = 1;
+	return (ax);
+}
+
+#if LDBL_MANT_DIG == 53
+__weak_reference(cospi, cospil);
+#endif

Property changes on: lib/msun/src/s_cospi.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/src/s_cospif.c
===================================================================
--- lib/msun/src/s_cospif.c	(nonexistent)
+++ lib/msun/src/s_cospif.c	(working copy)
@@ -0,0 +1,99 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/s_cospi.c for implementation details.
+ */
+
+#include "math.h"
+#include "math_private.h"
+#include "k_cospif.c"
+#include "k_sinpif.c"
+
+static inline float
+__compute_cospif(float x)
+{
+
+	if (x <= 0.5f) {
+		if (x <= 0.25f)
+			return (__kernel_cospif(x));
+		return (__kernel_sinpif(0.5f - x));
+	} else {
+		if (x <= 0.75f)
+			return (-__kernel_sinpif(x - 0.5f));
+		return (-__kernel_cospif(1 - x));
+	}
+}
+
+float
+cospif(float x)
+{
+	volatile static const float vzero = 0;
+	static const float huge = 1e30; 
+	float ax;
+	uint32_t ix, j0;
+
+	GET_FLOAT_WORD(ix, x);
+	ix = ix & 0x7fffffff;
+	SET_FLOAT_WORD(ax, ix);
+
+	if (ix < 0x3f800000) {			/* |x| < 1 */
+		if (ix < 0x39000000) {		/* |x| < 0x1p-13 */
+			if (huge + ax > 0)	/* Raise inexact iff != 0. */
+				return (1);
+		}
+		return (__compute_cospif(ax));
+	}
+
+	if (ix < 0x4b000000) {			/* 1 <= |x| < 0x1p23 */
+		/* Determine integer part of ax. */
+		j0 = ((ix >> 23) & 0xff) - 0x7f;
+		ix &= ~(0x007fffff >> j0);
+		SET_FLOAT_WORD(x, ix);
+
+		ax -= x;
+		if (ax == 0.5f)
+			return (0);
+
+		ax = ax == 0 ? 1 : __compute_cospif(ax);
+
+		ix = (uint32_t)x;
+
+		return (ix & 1 ? -ax : ax);
+	}
+
+	/* x = +-inf or nan. */
+	if (ix >= 0x7f800000)
+		return (vzero / vzero);
+
+	/*
+	 * |x| >= 0x1p23 is always an even integer, so return 1.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID.
+	 */
+	if (ax + 1 > 1)
+		ax = 1;
+	return (ax);
+}

Property changes on: lib/msun/src/s_cospif.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/src/s_sinpi.c
===================================================================
--- lib/msun/src/s_sinpi.c	(nonexistent)
+++ lib/msun/src/s_sinpi.c	(working copy)
@@ -0,0 +1,154 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/**
+ * sinpi(x) computes sin(pi*x) without multiplication by pi (almost).  First,
+ * note that sinpi(-x) = -sinpi(x), so the algorithm considers only |x| and
+ * includes reflection symmetry by considering the sign of x on output.  The
+ * method used depends on the magnitude of x.
+ *
+ * 1. For small |x|, sinpi(x) = pi * x where a sloppy threshold is used.  The
+ *    threshold is |x| < 0x1pN with N = -(P/2+M).  P is the precision of the
+ *    floating-point type and M = 3 or 4.  To achieve high accuracy, pi is 
+ *    decomposed into high and low parts with the high part containing a
+ *    number of trailing zero bits.  x is also split into high and low parts.
+ *
+ * 2. For |x| < 1, argument reduction is not required and sinpi(x) is 
+ *    computed by polynomial approximations in the the interval [0,0.25].
+ *    See k_sinpi.c and k_cospi.c for the details for these polynomial
+ *    approximations.
+ *
+ * 3. For 1 <= |x| < 0x1p(P-1), argument reduction is required where
+ *    |x| = j0 + r with j0 an integer and the remainder r satisifies
+ *    0 <= r < 1.  With the given domain, a simplified inline floor(x)
+ *    is used.
+ *
+ *    sinpi(x) = sin(pi*(j0+r))
+ *             = sin(pi*j0) * cos(pi*r) + cos(pi*j0) * sin(pi*r)
+ *             = cos(pi*j0) * sin(pi*r)
+ *             = +-sinpi(r)
+ *
+ *    If j0 is even, then cos(pi*j0) = 1. If j0 is odd, then cos(pi*j0) = -1.
+ *    sinpi(r) is then computed via polynomial approximation.
+ *
+ * 4. For |x| >= 0x1p(P-1), |x| is integral and sinpi(x) = copysign(0,x).
+ *
+ * 5. Special cases:
+ *
+ *    sinpi(+-0) = +-0
+ *    sinpi(+-n) = +-0, for positive integers n.
+ *    sinpi(+-inf) = nan.  Raises the "invalid" floating-point exception.
+ *    sinpi(nan) = nan.  Raises the "invalid" floating-point exception.
+ */
+
+#include "math.h"
+#include "math_private.h"
+#include "k_cospi.c"
+#include "k_sinpi.c"
+
+static const double pihi = 3.1415926814079285e+00; /* 0x400921fb 0x58000000 */
+static const double pilo =-2.7818135228334233e-08; /* 0xbe5dde97 0x3dcb3b3a */
+
+/*
+ * To compute sinpi(x) for 0 < x < 1, use trignometric identities to map
+ * sinpi(x) into the [0,0.25] quadrant of sinpi(r) or cospi(r).
+ */
+static inline double
+__compute_sinpi(double x)
+{
+
+	if (x < 0.5) {
+		if (x < 0.25)
+			return (__kernel_sinpi(x));
+		return (__kernel_cospi(0.5 - x));
+	} else {
+		if (x < 0.75)
+			return (__kernel_cospi(x - 0.5));
+		return (__kernel_sinpi(1 - x));
+	}
+}
+
+double
+sinpi(double x)
+{
+	volatile static const double vzero = 0;
+	double ax;
+	uint32_t hx, ix, j0, lx;
+
+	EXTRACT_WORDS(hx, lx, x);
+	ix = hx & 0x7fffffff;
+	INSERT_WORDS(ax, ix, lx);
+
+	if (ix < 0x3ff00000) {		/* |x| < 1 */
+		if (ix < 0x3e300000) {	/* |x| < 0x1p-28 */
+			if (x == 0)
+				return (x);
+			INSERT_WORDS(ax, hx, 0);
+			x -= ax;
+			return (pilo * x + pihi * x + pilo * ax + pihi * ax);
+		}
+		ax = __compute_sinpi(ax);
+		return ((hx & 0x80000000) ? -ax : ax);
+	}
+
+	if (ix < 0x43300000) {		/* 1 <= |x| < 0x1p52 */
+		/* Determine integer part of ax. */
+		j0 = ((ix >> 20) & 0x7ff) - 0x3ff;
+		if (j0 < 20) {
+			ix &= ~(0x000fffff >> j0);
+			lx = 0;
+		} else {
+			lx &= ~((uint32_t)0xffffffff >> (j0 - 20));
+		}
+		INSERT_WORDS(x, ix, lx);
+
+		ax -= x;
+		if (ax == 0) {
+			ax = 0;
+		} else {
+			ax = __compute_sinpi(ax);
+			if (j0 > 30) x -= 0x1p30;
+			lx = (uint32_t)x;
+			if (lx & 1) ax = -ax;
+		}
+		return ((hx & 0x80000000) ? -ax : ax);
+	}
+
+	if (ix >= 0x7f800000)
+		return (vzero / vzero);
+
+	/*
+	 * |x| >= 0x1p52 is always an integer, so return +-0.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID?
+	 */
+	if (ax + 1 > 1)
+		ax = copysign(0, x);
+	return (ax);
+}
+
+#if LDBL_MANT_DIG == 53
+__weak_reference(sinpi, sinpil);
+#endif

Property changes on: lib/msun/src/s_sinpi.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/src/s_sinpif.c
===================================================================
--- lib/msun/src/s_sinpif.c	(nonexistent)
+++ lib/msun/src/s_sinpif.c	(working copy)
@@ -0,0 +1,105 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/s_sinpi.c for implementation details.
+ */
+
+#include "math.h"
+#include "math_private.h"
+#include "k_cospif.c"
+#include "k_sinpif.c"
+
+static inline float
+__compute_sinpif(float x)
+{
+
+	if (x <= 0.5f) {
+		if (x <= 0.25f)
+			return (__kernel_sinpif(x));
+		return (__kernel_cospif(0.5f - x));
+	} else {
+		if (x <= 0.75f)
+			return (__kernel_cospif(x - 0.5f));
+		return (__kernel_sinpif(1 - x));
+	}
+}
+
+static const float pihi =  3.14160156e+00f;	/* 0x40491000 */
+static const float pilo = -8.90890988e-06f;	/* 0xb715777a */
+
+float
+sinpif(float x)
+{
+	volatile static const float vzero = 0;
+	float ax;
+	uint32_t hx, ix, j0;
+
+	GET_FLOAT_WORD(hx, x);
+	ix = hx & 0x7fffffff;
+	SET_FLOAT_WORD(ax, ix);
+
+	if (ix < 0x3f800000) {		/* |x| < 1 */
+		if (ix < 0x38000000) {	/* |x| < 0x1p-15 */
+			if (x == 0)
+				return (x);
+			SET_FLOAT_WORD(ax, (hx >> 13) << 13);
+			x -= ax;
+			return (pilo * x + pihi * x + pilo * ax + pihi * ax);
+		}
+		ax = __compute_sinpif(ax);
+		return ((hx & 0x80000000) ? -ax : ax);
+	}
+
+	if (ix < 0x4b000000) {		/* 1 <= |x| < 0x1p23 */
+		/* Determine integer part of ax. */
+		j0 = ((ix >> 23) & 0xff) - 0x7f;
+		ix &= ~(0x007fffff >> j0);
+		SET_FLOAT_WORD(x, ix);
+
+		ax -= x;
+		if (ax == 0) {
+			ax = 0;
+		} else {
+			ix = (uint32_t)x;
+			ax =  __compute_sinpif(ax);
+			if (ix & 1) ax = -ax;
+		}
+		return ((hx & 0x80000000) ? -ax : ax);
+	}
+
+	/* x = +-inf or nan. */
+	if (ix >= 0x7f800000)
+		return (vzero / vzero);
+
+	/*
+	 * |x| >= 0x1p23 is always an integer, so return +-0.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID?
+	 */
+	if (ax + 1 > 1)
+		ax = copysignf(0, x);
+	return (ax);
+}

Property changes on: lib/msun/src/s_sinpif.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/src/s_tanpi.c
===================================================================
--- lib/msun/src/s_tanpi.c	(nonexistent)
+++ lib/msun/src/s_tanpi.c	(working copy)
@@ -0,0 +1,224 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/**
+ * tanpi(x) computes tan(pi*x) without multiplication by pi (almost).  First,
+ * note that tanpi(-x) = -tanpi(x), so the algorithm considers only |x| and
+ * includes reflection symmetry by considering the sign of x on output.  The
+ * method used depends on the magnitude of x.
+ *
+ * 1. For small |x|, tanpi(x) = pi * x where a sloppy threshold is used.  The
+ *    threshold is |x| < 0x1pN with N = -(P/2+M).  P is the precision of the
+ *    floating-point type and M = 3 or 4.  To achieve high accuracy, pi is
+ *    decomposed into high and low parts with the high part containing a
+ *    number of trailing zero bits.  x is also split into high and low parts.
+ *
+ * 2. For |x| < 1, argument reduction is not required and tanpi(x) is 
+ *    computed by either a polynomial approximation or the definition of
+ *    tanpi in terms of sinpi and cospi.  For |x| in [0,0.25), a polynomial
+ *    approximation is used.  For |x| in (0.25, 0.5], tanpi(x) is computed
+ *    by tanpi(x) = cospi(0.5 - x) / sinpi(0.5 - x).
+ *
+ * 3. For 1 <= |x| < 0x1p(P-1), argument reduction is required where
+ *    |x| = j0 + r with j0 an integer and the remainder r satisfies
+ *    0 <= r < 1.  With the given domain, a simplified inline floor(x)
+ *    is used.
+ *                                   tan(pi*j0) + tan(pi*r)
+ *    tanpi(x) = tan(pi*(j0+r)) = ---------------------------- = tanpi(r)
+ *                                 1 - tan(pi*j0) * tan(pi*r)
+ *
+ * 4. For |x| >= 0x1p(P-1), |x| is integral and tanpi(x) = copysign(0,x).
+ *
+ * 5. Special cases:
+ *
+ *    tanpi(+-0) = +-0
+ *    tanpi(+-n) = +-0, for positive integers n.
+ *    tanpi(+-n+1/4) = +-1, for positive integers n.
+ *    tanpi(+-n+1/2) = NaN, for positive integers n.
+ *    tanpi(+-inf) = nan.  Raises the "invalid" floating-point exception.
+ *    tanpi(nan) = nan.  Raises the "invalid" floating-point exception.
+ *
+ * Compute tanpi(x) = tan(pi*x) via a polynomial approximation.  Coefficients 
+ * were determined for the following function:
+ *
+ * f(x) = tan(pi*x) / x = t0 + t1*x**2 + t2*x**4 + ...
+ *
+ * with an Remes algorithm.  To accumulate the result with sufficient accuracy,
+ * the t0, t1, and t2 coefficients and final multiplications and additions
+ * needed to be handled specially.
+ *
+ * x2 = x * x
+ * T(x2) = t3 + t4*x2 + t5*x2**2 + ...
+ *       = t3 + (t4 + (t5 ...) * x2) * x2
+ * tanpi(x) = x * (t0 + x2*(t1 + x2*(t2 + x2*T(x2))))
+ *
+ * The final multiplications and additions in the above expression for
+ * tanpi(x) are done in an extra precision arithmetic.
+ */
+
+#include "math.h"
+#include "math_private.h"
+#include "k_cospi.c"
+#include "k_sinpi.c"
+
+static const double pihi = 3.1415926814079285e+00; /* 0x400921fb 0x58000000 */
+static const double pilo =-2.7818135228334233e-08; /* 0xbe5dde97 0x3dcb3b3a */
+
+static const double 
+t00   =  3.1415926535897931e+00,	/* 0x400921fb 0x54442d18 */
+t00lo =  1.2267625636978282e-16,	/* 0x3ca1adf4 0x7b8ee56b */
+t01   =  1.0335425560099939e+01,	/* 0x4024abbc 0xe625be52 */
+t01lo = -6.0289590484334045e-16,	/* 0xbcc5b8bb 0xb4f3d850 */
+t02   =  4.0802624638040442e+01,	/* 0x404466bc 0x6775ac7c */
+t02lo = -3.1181465177803440e-15,	/* 0xbcec15f4 0xd355f491 */
+t03   =  1.6299995197351416e+02,	/* 0x40645fff 0x9b47fa0c */
+t04   =  6.5190975669363422e+02,	/* 0x40845f47 0x2e8473cf */
+t05   =  2.6075994007466215e+03,	/* 0x40a45f32 0xe4a797e0 */
+t06   =  1.0430393632420044e+04,	/* 0x40c45f32 0x628c115e */
+t07   =  4.1720374420664804e+04,	/* 0x40e45f0b 0xfb410bc9 */
+t08   =  1.6695720713150888e+05,	/* 0x41046169 0xa8349085 */
+t09   =  6.6429163117898861e+05,	/* 0x412445c7 0x4329e474 */
+t10   =  2.7802499382503941e+06,	/* 0x4145362c 0xf81896c3 */
+t11   =  7.9185101944778729e+06,	/* 0x415e34eb 0x8c725352 */
+t12   =  9.3691291294289947e+07,	/* 0x41965676 0x6d2d5a58 */
+t13   = -5.0585651903137898e+08,	/* 0xc1be26c2 0x07080874 */
+t14   =  6.8772398519624548e+09,	/* 0x41f99ea5 0xa2bf6637 */
+t15   = -3.3094464588263161e+10,	/* 0xc21ed255 0xd1310d7a */
+t16   =  1.1673794006564838e+11;	/* 0x423b2e1f 0x9a61a5fc */
+
+static inline double
+__kernel_tanpi(double x)
+{
+	double hi, lo, t, xhi, xlo;
+	uint32_t hx, lx;
+
+	if (x < 0.25) {
+#if HORNER
+		double x2;
+		/* Horner's method. */
+		x2 = x * x;
+		t = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * (x2 * (x2 * (x2 * (x2
+		    * (x2 * (x2 * t16 + t15) + t14) + t13) + t12) + t11)
+		    + t10) + t09) + t08) + t07) + t06) + t05) + t04;
+#else
+		double a, b, c, d, x2, x4, x6;
+		/* Sort of Estrin's method. */
+		x2 = x * x;
+		x4 = x2 * x2;
+		x6 = x4 * x2;
+		a = (t16 * x4 + t15 * x2 + t14) * x6;
+		b = (t13 * x4 + t12 * x2 + t11);
+		c = (t10 * x4 + t09 * x2 + t08) * x6;
+		d = (t07 * x4 + t06 * x2 + t05);
+		t = (a + b) * x6 * (x4 * x4) + (c + d) * x2 + t04;
+#endif
+
+		EXTRACT_WORDS(hx, lx, x);
+		INSERT_WORDS(xhi, hx, 0);
+		xlo = x - xhi;
+		xlo *= (xlo + xhi + xhi);
+		xhi *= xhi;
+
+		t = t02lo + x2 * (x2 * t + t03) + t02;
+		t = t01lo + xlo * t + xhi * t + t01;
+		t = t00lo + xlo * t + xhi * t + t00;
+		return (t * x);
+	} else if (x > 0.25) {
+		x = 0.5 - x;
+		t = __kernel_cospi(x) / __kernel_sinpi(x);
+	} else
+		t = 1;
+	return (t);
+}
+
+static inline double
+__compute_tanpi(double x)
+{
+
+	return (x < 0.5 ? __kernel_tanpi(x) : -__kernel_tanpi(1 - x));
+}
+
+double
+tanpi(double x)
+{
+	volatile static const double vzero = 0;
+	double ax;
+	uint32_t hx, ix, j0, lx;
+
+	EXTRACT_WORDS(hx, lx, x);
+	ix = hx & 0x7fffffff;
+	INSERT_WORDS(ax, ix, lx);
+
+	if (ix < 0x3ff00000) {		/* |x| < 1 */
+		if (ix < 0x3e200000) {	/* |x| < 0x1p-29 */
+			if (x == 0)
+				return (x);
+			INSERT_WORDS(ax, hx, 0);
+			x -= ax;
+			return (pilo * x + pihi * x + pilo * ax + pihi * ax);
+		}
+		if (ix == 0x3fe00000 && !lx)
+			return ((x - x) / (x - x));
+		ax = __compute_tanpi(ax);
+		return ((hx & 0x80000000) ? -ax : ax);
+	}
+
+	if (ix < 0x43300000) {		/* 1 <= |x| < 0x1p52 */
+		/* Determine integer part of ax. */
+		j0 = ((ix >> 20) & 0x7ff) - 0x3ff;
+		if (j0 < 20) {
+			ix &= ~(0x000fffff >> j0);
+			lx = 0;
+		} else {
+			lx &= ~(((uint32_t)(0xffffffff)) >> (j0 - 20));
+		}
+		INSERT_WORDS(x,ix,lx);
+
+		ax -= x;
+
+		if (ax == 0.5)
+			return ((x - x) / (x - x));
+
+		ax = ax == 0 ? 0 : __compute_tanpi(ax);
+		return ((hx & 0x80000000) ? -ax : ax);
+	}
+
+	/* x = +-inf or nan. */
+	if (ix >= 0x7f800000)
+		return (vzero / vzero);
+
+	/*
+	 * |x| >= 0x1p52 is always an integer, so return +-0.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID?
+	 */
+	if (ax + 1 > 1)
+		ax = copysign(0, x);
+	return (ax);
+}
+
+#if LDBL_MANT_DIG == 53
+__weak_reference(tanpi, tanpil);
+#endif

Property changes on: lib/msun/src/s_tanpi.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: lib/msun/src/s_tanpif.c
===================================================================
--- lib/msun/src/s_tanpif.c	(nonexistent)
+++ lib/msun/src/s_tanpif.c	(working copy)
@@ -0,0 +1,141 @@
+/*-
+ * Copyright (c) 2017 Steven G. Kargl
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice unmodified, this list of conditions, and the following
+ *    disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
+ * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
+ * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT,
+ * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+ * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/*
+ * See ../src/s_tanpi.c for implementation details.
+ */
+
+#include "math.h"
+#include "math_private.h"
+#include "k_cospif.c"
+#include "k_sinpif.c"
+
+static const float pihi =  3.14160156e+00f;	/* 0x40491000 */
+static const float pilo = -8.90890988e-06f;	/* 0xb715777a */
+
+static const float
+t0   =  3.14159274e+00f,	/* 0x40490fdb */
+t0lo = -8.71231194e-08f,	/* 0xb3bb1871 */
+t1   =  1.03354244e+01f,	/* 0x41255de6 */
+t1lo =  3.83117083e-07f,	/* 0x34cdaf36 */
+t2   =  4.08029366e+01f,	/* 0x42233635 */
+t2lo = -1.28256147e-07f,	/* 0xb409b6c8 */
+t3   =  1.62950455e+02f,	/* 0x4322f351 */
+t3lo =  4.45962769e-06f,	/* 0x3695a3e9 */
+t4   =  6.55820984e+02f,	/* 0x4423f48b */
+t5   =  2.43586987e+03f,	/* 0x45183deb */
+t6   =  1.47717539e+04f,	/* 0x4666cf04 */
+t7   = -1.96741250e+04f,	/* 0xc699b440 */
+t8   =  5.87398688e+05f;	/* 0x490f686b */
+
+static inline float
+__kernel_tanpif(float x)
+{
+	float hi, lo, t, x2, xhi, xlo;
+	uint32_t ix;
+
+	if (x < 0.25f) {
+		x2 = x * x;
+		t = t3lo + x2 * (x2 * (x2 * (x2 * (x2 * t8 + t7) + t6) + t5)
+		    + t4) + t3;
+
+		GET_FLOAT_WORD(ix, x);
+		SET_FLOAT_WORD(xhi, (ix >> 14) << 14);
+		xlo = x - xhi;
+		xlo *= (xlo + xhi + xhi);
+		xhi *= xhi;
+
+		t = t2lo + xlo * t + xhi * t + t2;
+		t = t1lo + xlo * t + xhi * t + t1;
+		t = t0lo + xlo * t + xhi * t + t0;
+		t *= x;
+	} else if (x > 0.25f) {
+		x = 0.5f - x;
+		t = __kernel_cospif(x) / __kernel_sinpif(x);
+	} else
+		t = 1;
+	return (t);
+}
+
+static inline float
+__compute_tanpif(float x)
+{
+
+	return (x < 0.5f ? __kernel_tanpif(x) : -__kernel_tanpif(1 - x));
+}
+
+float
+tanpif(float x)
+{
+	volatile static const float vzero = 0;
+	float ax;
+	uint32_t hx, ix, j0;
+
+	GET_FLOAT_WORD(hx, x);
+	ix = hx & 0x7fffffff;
+	SET_FLOAT_WORD(ax, ix);
+
+	if (ix < 0x3f800000) {		/* |x| < 1 */
+		if (ix < 0x38800000) {	/* |x| < 0x1p-14 */
+			if (x == 0)
+				return (x);
+			SET_FLOAT_WORD(ax, (hx >> 13) << 13);
+			x -= ax;
+			return (pilo * x + pihi * x + pilo * ax + pihi * ax);
+		}
+		if (ix == 0x3f000000)
+			return ((x - x) / (x - x));
+		ax = __compute_tanpif(ax);
+		return ((hx & 0x80000000) ? -ax : ax);
+	}
+
+	if (ix < 0x4b000000) {		/* 1 <= |x| < 0x1p23 */
+		/* Determine integer part of ax. */
+		j0 = ((ix >> 23) & 0xff) - 0x7f;
+		ix &= ~(0x007fffff >> j0);
+		SET_FLOAT_WORD(x, ix);
+
+		ax -= x;
+
+		if (ax == 0.5f)
+			return ((x - x) / (x - x));
+
+		ax = ax == 0 ? 0 : __compute_tanpif(ax);
+		return ((hx & 0x80000000) ? -ax : ax);
+	}
+
+	/* x = +-inf or nan. */
+	if (ix >= 0x7f800000)
+		return (vzero / vzero);
+
+	/*
+	 * |x| >= 0x1p23 is always an integer, so return +-0.
+	 * FIXME: should this raise FE_INEXACT or FE_INVALID.
+	 */
+	if (ax + 1 > 1)
+		ax = copysignf(0, x);
+	return (ax);
+}

Property changes on: lib/msun/src/s_tanpif.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 18:40:26 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A38CED53A4E;
 Fri, 28 Apr 2017 18:40:26 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au
 [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 523BBBBF;
 Fri, 28 Apr 2017 18:40:25 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 82D483C5E0A;
 Sat, 29 Apr 2017 04:40:18 +1000 (AEST)
Date: Sat, 29 Apr 2017 04:40:14 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: Bruce Evans <brde@optusnet.com.au>, freebsd-numerics@freebsd.org, 
 freebsd-hackers@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170428165658.GA17560@troutmask.apl.washington.edu>
Message-ID: <20170429035131.E3406@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=VbSHBBh9 c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=baPn34At2Hashz5YGC0A:9 a=-D1BXJ0ZKV1LJ0pY:21
 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 18:40:26 -0000

On Fri, 28 Apr 2017, Steve Kargl wrote:

> On Fri, Apr 28, 2017 at 07:13:16PM +1000, Bruce Evans wrote:
>> On Thu, 27 Apr 2017, Steve Kargl wrote:

>>> For those curious about testing, here are some numbers
>>> for the Estrin's method.  These were obtained on an AMD
>>> FX(tm)-8350 @ 4018.34-MHz.  The times are in microseconds
>>> and the number in parentheses are estimated cycles.
>>>
>>>            |    cospi     |    sinpi     |    tanpi
>>> ------------+--------------+--------------+--------------
>>> float       | 0.0089 (37)  | 0.0130 (54)  | 0.0194 (80)
>>> double      | 0.0134 (55)  | 0.0160 (66)  | 0.0249 (102)
>>> long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393)
>>> ------------+--------------+--------------+--------------
>>>
>>> The timing tests are done on the interval [0,0.25] as
>>> this is the interval where the polynomial approximations
>>> apply.  Limited accuracy testing gives
>>
>> These still seem slow.  I did a quick test of naive evaluations of
>> these functions as standard_function(Pi * x) and get times a bit faster
>> on Haswell, except 2-4 times faster for the long double case, with the
>> handicaps of using gcc-3.3.3 and i386.  Of course, the naive evaluation
>> is inaccurate, especially near multiples of Pi/2.
>
> The slowness comes from handling extra precision arithmetic.  For
> sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
> approximation and x2 = x * x.  One needs to take care in the ordering
> on the evaluation where x and x2 are split into high and low parts.
> See the source code posted in response to your other email.

See logl() for how to do this almost as well as possible using the 2sum
primitives.  Don't forget to test on i386 with i387, where you will need
lots of slow STRICT_ASSIGN()s or careful use of float_t and double_t
as in my uncomitted logf() and logl()i or more relevantly clog() (cplex.c)
to avoid problems with extra precision (2sum requires this).

Calculating the polynomial as x * p(x) seems wrong.  That is too inaccurate
even for sin(x), and probably slower too.  __kernel_sin(x) uses
x + x2*S(x2), with complications except in the first quadrant needed to
improve accuracy.  The arrangement of the second term is dubious too,
even for sin(x).  It unimproves acuracy but is apparently used because
it is faster.  I would try S(x2) * x4 + x2 * (x * S3) + x, where
x2 * (x * S3) + x must all be evaluated in extra precision.

>>> x in [0,0.25]   |   tanpif   |   tanpi    |   tanpil
>>> -----------------+------------+------------+-------------
>>>         MAX ULP | 1.37954760 | 1.37300168 | 1.38800823
>>
>> Just use the naive evaluation to get similar errors in this
>> range.  It is probably faster too.
>
> I haven't checked tanpif, but naive evaluation is faster
> for sinpif(x).  But getting the worry answer fast seems
> to be counter to what one may expect from a math library.
> Consider the two naive implementations:

But the above is inaccurate too.

> float
> bde1(float x)
> {
>   return (sinf((float)M_PI * x));
> }

sinpif() with range reduction in float precision and multiplication
by M_PI and __kernel_sin() to finish off would be much faster and
more accurate than doing everything in float precision.  On i386/i387,
it could be arranged that naive users get M_PI and __kernel_sin()
almost automatically.  This is essentially what is already done for
sinf().  Similarly for sinpi() on arches with extra precision for double.

> float
> bde2(float x)
> {
>   return (sinf((float)(M_PI * x)));
> }
>
> float
> bde3(float x)
> {
>   return ((float)sin(M_PI * x));
> }
>
> x in [0,0.25]    |   sinpif    |    bde1(x)  |    bde2(x)   |   bde3(x)
> -----------------+-------------+-------------+--------------+------------
> Time usec (cyc)  | 0.0130 (54) | 0.0084 (35) | 0.0093 (38)  | 0.0109 (45)
>         MAX ULP | 0.80103672  | 1.94017851  | 1.46830785   | 0.5
> 1.0 < ULP        | 0           | 736496      | 47607        | 0
> 0.9 < ULP <= 1.0 | 0           | 563122      | 101162       | 0
> 0.8 < ULP <= 0.9 | 1           | 751605      | 386128       | 0
> 0.7 < ULP <= 0.8 | 727         | 942349      | 753647       | 0
> 0.6 < ULP <= 0.7 | 19268       | 1169719     | 1123518      | 0
>
> x in [3990,3992] |   sinpif    |   bde1(x)   |   bde2(x)   |   bde3(x)
> -----------------+-------------+-------------+-------------+------------
> Time usec (cyc)  | 0.0161 (66) | 0.0137 (56) | 0.0144 (59) | 0.0211 (87)
>         MAX ULP | 0.65193975  | 10458803.   | 6471461.    | 0.5
> 1.0 < ULP        | 0           | 16773117    | 16762878    | 0
> 0.9 < ULP <= 1.0 | 0           | 0           | 0           | 0
> 0.8 < ULP <= 0.9 | 0           | 0           | 0           | 0
> 0.7 < ULP <= 0.8 | 0           | 0           | 0           | 0
> 0.6 < ULP <= 0.7 | 19268       | 2047        | 2047        | 0
>
> So bde3(x) is best, but if you're going to use sin(M_PI*x) for
> float sinpif, then use sinpi((double)x) which is faster than bde3
> and just as accurate.

That is a naive version, with explicit pesssimation and destruction
of extra precision by casting to float.

A fast version would do return (__kernel_sindf(M_PI * x)) after
reducing the range of x (switch to +- and cosf for other quadrants).
Perhaps optimize the range reduction for small quadrant numbers for
sinf().  It is easier to subtract 1 from x than to subtract Pi/2
to adjust the quadrant, so this optimization might not be so useful.

Bruce

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 20:15:24 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3B16CD549B6;
 Fri, 28 Apr 2017 20:15:24 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 170831269;
 Fri, 28 Apr 2017 20:15:24 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3SKFMnx033229
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Fri, 28 Apr 2017 13:15:22 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3SKFMZ8033228;
 Fri, 28 Apr 2017 13:15:22 -0700 (PDT) (envelope-from sgk)
Date: Fri, 28 Apr 2017 13:15:22 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170428201522.GA32785@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170429035131.E3406@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 20:15:24 -0000

On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote:
> On Fri, 28 Apr 2017, Steve Kargl wrote:
> 
> > On Fri, Apr 28, 2017 at 07:13:16PM +1000, Bruce Evans wrote:
> >> On Thu, 27 Apr 2017, Steve Kargl wrote:
> >>> For those curious about testing, here are some numbers
> >>> for the Estrin's method.  These were obtained on an AMD
> >>> FX(tm)-8350 @ 4018.34-MHz.  The times are in microseconds
> >>> and the number in parentheses are estimated cycles.
> >>>
> >>>            |    cospi     |    sinpi     |    tanpi
> >>> ------------+--------------+--------------+--------------
> >>> float       | 0.0089 (37)  | 0.0130 (54)  | 0.0194 (80)
> >>> double      | 0.0134 (55)  | 0.0160 (66)  | 0.0249 (102)
> >>> long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393)
> >>> ------------+--------------+--------------+--------------
> >>>
> >>> The timing tests are done on the interval [0,0.25] as
> >>> this is the interval where the polynomial approximations
> >>> apply.  Limited accuracy testing gives
> >>
> >> These still seem slow.  I did a quick test of naive evaluations of
> >> these functions as standard_function(Pi * x) and get times a bit faster
> >> on Haswell, except 2-4 times faster for the long double case, with the
> >> handicaps of using gcc-3.3.3 and i386.  Of course, the naive evaluation
> >> is inaccurate, especially near multiples of Pi/2.
> >
> > The slowness comes from handling extra precision arithmetic.  For
> > sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
> > approximation and x2 = x * x.  One needs to take care in the ordering
> > on the evaluation where x and x2 are split into high and low parts.
> > See the source code posted in response to your other email.
> 
> See logl() for how to do this almost as well as possible using the 2sum
> primitives.  Don't forget to test on i386 with i387, where you will need
> lots of slow STRICT_ASSIGN()s or careful use of float_t and double_t
> as in my uncomitted logf() and logl()i or more relevantly clog() (cplex.c)
> to avoid problems with extra precision (2sum requires this).
> 
> Calculating the polynomial as x * p(x) seems wrong.

It starts life as the Remes approximation for p(x) = sin(pi*x) / x.
Yes, I also looked into using p(x) = sin(pi*x) / (pi*x).  At some
point, one must do sinpi(x) = sin(pi*x) = x * p(x).

One could do, as is done with sin(x), 

f(x) = (sin(pi*x) - pi*x)/(pi*x)**3

then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds
the complication that pi*x must be computed in extra precision.
For float, with the algorithm I have developed, one can simply
do everything in double and then cast the result to float.
In fact, my __kernel_sinpif was initially written in double
with a max ULP less than 0.55.  I then looked at what portions
of the polynomial could be computed in float without reducing
the 0.55.  This then reveals what needs to be handle specially.

But, what is one to do with double and long double functions?

> That is too inaccurate even for sin(x), and probably slower too.

I'm not sure why you're claiming that it is too inaccurate.
Let's look at sinpi(x) and sin(x) on (roughly) the same
interval to remove the different precisions involved in 
sinpif(x) and sinf(x).

This is my sinpi(x)
troutmask:kargl[221] ./testd -S -m 0.00 -M 0.25 -n 24
         MAX ULP: 0.79851085
    Total tested: 16777214
0.7 < ULP <= 0.8: 1075
0.6 < ULP <= 0.7: 22722

This is libm sin(x)
troutmask:kargl[224] ./testd -S -m 0.00 -M 0.7853981634 -n 24
         MAX ULP: 0.72922199
    Total tested: 16777214
0.7 < ULP <= 0.8: 36
0.6 < ULP <= 0.7: 12193

The numbers above really aren't too different, and someone
that actually understands numerical analysis might be able
to take my code and (marginally) improve the numbers.

> > float
> > bde1(float x)
> > {
> >   return (sinf((float)M_PI * x));
> > }
> 
> sinpif() with range reduction in float precision and multiplication
> by M_PI and __kernel_sin() to finish off would be much faster and
> more accurate than doing everything in float precision.  On i386/i387,

Multiplication by M_PI and __kernel_sin() are double precision
intermediates.  That works for float.  What are you going to
do for double and long double?  One of the benefits of writing
sinpif(x) in float precision is that the exacts same algorithm is
used in sinpi(x) and sinpil(x).  One of these can be exhaustively
tested.

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 22:09:46 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 14850D547CD;
 Fri, 28 Apr 2017 22:09:46 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id B7D39913;
 Fri, 28 Apr 2017 22:09:45 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 437AA104C43F;
 Sat, 29 Apr 2017 08:09:42 +1000 (AEST)
Date: Sat, 29 Apr 2017 08:09:39 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: Bruce Evans <brde@optusnet.com.au>, freebsd-numerics@freebsd.org, 
 freebsd-hackers@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170428201522.GA32785@troutmask.apl.washington.edu>
Message-ID: <20170429070036.A4005@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=VbSHBBh9 c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=2JjXyQGd_aUmJn-FHg8A:9 a=SzcEAFoTqt-hgAuy:21
 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 22:09:46 -0000

On Fri, 28 Apr 2017, Steve Kargl wrote:

> On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote:
>> On Fri, 28 Apr 2017, Steve Kargl wrote:
>>> ...
>>> The slowness comes from handling extra precision arithmetic.  For
>>> sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
>>> approximation and x2 = x * x.  One needs to take care in the ordering
>>> on the evaluation where x and x2 are split into high and low parts.
>>> See the source code posted in response to your other email.
>> ...
>> Calculating the polynomial as x * p(x) seems wrong.
>
> It starts life as the Remes approximation for p(x) = sin(pi*x) / x.
> Yes, I also looked into using p(x) = sin(pi*x) / (pi*x).  At some
> point, one must do sinpi(x) = sin(pi*x) = x * p(x).

I forgot that the linear term needs to be multiplied by Pi.

Actually, the problem is only in the notation and the above description.
s0 looks like a constant, but it is actually Pi*x in extra precision,
and no extra precision is needed or done for x2.

__kernel_cospi() seems to be buggy.  It splits x2, but doesn't use
extra precision for calculating x2.  Is extra precision really needed
for the x2 term?  It is like the x3 term for sin*(), with the critical
difference that the coeffient is much larger (-1/2 instead of -1/6
before multiplication by pi).  This requires extra accuracy for cos()
too, but not for the calculating the term.  I optimized this by changing
a slow fdlibm method to essentially 2sum.  The x2/2 term is inaccurate
but no significant further inaccuracies occur for adding terms.  You
have Pi*Pi*x2/2 which is more inaccurate.  I guess the extra half an
ulp accuracy here is too much.  You avoid this half an ulp using extra
precision, and seem to use lots of extra precision instead of the final
2sum step.  Probably much slower.

Here is my modified method:

 	x2 = x*x;	/* hopefully really don't need extra prec */
 	(hi, lo) = split(x2);
 	resid = S(x2);
 	hi = C2_hi * hi;
 	lo = C2_lo * hi + C1_hi * lo + C1_lo * lo;
 	lo = lo + resid;
#ifdef hopefully_not_needed
 	Re-split hi+lo if lo is too large. 
#endif
 	return 3sum(1, hi, lo)

3sum is in math_private.h and is supposed to be used instead of magic-
looking open code.  Here it would reduce to the same expressions as in
__kernel_sin().

 	1+hi = hi_1 + lo_1;	/* by hi-lo decomp. */
 	return (lo + lo_1 + hi_1);	/* safely add the lo's and "carry" */

> One could do, as is done with sin(x),
>
> f(x) = (sin(pi*x) - pi*x)/(pi*x)**3
>
> then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds
> the complication that pi*x must be computed in extra precision.

Isn't that what is done now?  Of course pi*x must be evaluated in
extra precision, but hopefully no more is needed.  The x3 term
would be much harder to do all in extra precision than the x2
term for cospi().

>> That is too inaccurate even for sin(x), and probably slower too.
>
> I'm not sure why you're claiming that it is too inaccurate.

Writing sin(x) as x * (1 + C2*x*x + ...) without extra precision.
This gives at best an error of half an ulp for the part in parentheses
and another half an ulp for the multiplication.  The final step
must be an addition of a large term with a term at least 2 times
smaller so that the error can be controlled (this depends on 1
of the terms being exact).

> ...
>>> float
>>> bde1(float x)
>>> {
>>>   return (sinf((float)M_PI * x));
>>> }
>>
>> sinpif() with range reduction in float precision and multiplication
>> by M_PI and __kernel_sin() to finish off would be much faster and
>> more accurate than doing everything in float precision.  On i386/i387,
>
> Multiplication by M_PI and __kernel_sin() are double precision
> intermediates.  That works for float.

That's what I mean by naive code having a chance of working.  It
takes foot-shooting to reduce M_PI to float.  Unfortunately, the
API + ABI does reduce (M_PI * x) to float in sinf(M_PI * x) if
sinf() is extern.  But if sinf() is in hardware, the compiler
might not destroy the extra precision before or after using the
hardware.

> What are you going to
> do for double and long double?  One of the benefits of writing
> sinpif(x) in float precision is that the exacts same algorithm is
> used in sinpi(x) and sinpil(x).  One of these can be exhaustively
> tested.

That is about the only benefit of the float functions, so I resist
optimizing them using doubles.  It would be better to use extra
precision automatically, but not force it.

Bruce

From owner-freebsd-numerics@freebsd.org  Fri Apr 28 23:36:00 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2CA5DD55E45;
 Fri, 28 Apr 2017 23:36:00 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 146D41D75;
 Fri, 28 Apr 2017 23:36:00 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3SNZqXN034900
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Fri, 28 Apr 2017 16:35:52 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3SNZqtZ034899;
 Fri, 28 Apr 2017 16:35:52 -0700 (PDT) (envelope-from sgk)
Date: Fri, 28 Apr 2017 16:35:52 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170428233552.GA34580@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170429070036.A4005@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 28 Apr 2017 23:36:00 -0000

On Sat, Apr 29, 2017 at 08:09:39AM +1000, Bruce Evans wrote:
> On Fri, 28 Apr 2017, Steve Kargl wrote:
> 
> > On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote:
> >> On Fri, 28 Apr 2017, Steve Kargl wrote:
> >>> ...
> >>> The slowness comes from handling extra precision arithmetic.  For
> >>> sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial
> >>> approximation and x2 = x * x.  One needs to take care in the ordering
> >>> on the evaluation where x and x2 are split into high and low parts.
> >>> See the source code posted in response to your other email.
> >> ...
> >> Calculating the polynomial as x * p(x) seems wrong.
> >
> > It starts life as the Remes approximation for p(x) = sin(pi*x) / x.
> > Yes, I also looked into using p(x) = sin(pi*x) / (pi*x).  At some
> > point, one must do sinpi(x) = sin(pi*x) = x * p(x).
> 
> I forgot that the linear term needs to be multiplied by Pi.
> 
> Actually, the problem is only in the notation and the above description.
> s0 looks like a constant, but it is actually Pi*x in extra precision,
> and no extra precision is needed or done for x2.

I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.

static const float
s0hi =  3.14062500e+00f,	/* 0x40490000 */
s0lo =  9.67653585e-04f,	/* 0x3a7daa22 */
s1   = -5.16771269e+00f,	/* 0xc0a55de7 */
s2   =  2.55016255e+00f,	/* 0x402335dd */
s3   = -5.99202096e-01f,	/* 0xbf19654f */
s4   =  8.10018554e-02f;	/* 0x3da5e44d */

static inline float
__kernel_sinpif(float x)
{
	double s;
	float p, x2;
	x2 = x * x;
	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
	s = x * (x2 * (double)p + s0hi + s0lo);
	return ((float)s);
}

x2 and p are computed in float but is accumulates the final result
in double.  Using a higher precision works for float, but double
and long double becomes problematic. 

> __kernel_cospi() seems to be buggy.  It splits x2, but doesn't use
> extra precision for calculating x2.  Is extra precision really needed
> for the x2 term?

Yes.  The last part of __kernel_cospif() is a fused-multiple and
add.  For small |x2|, we can do a normal summation.  That's this
block of code

	GET_FLOAT_WORD(ix, x2);
	if (ix < 0x3c800000)		/* |x2| < 0x1p-6 */
		return (c0 + c * x2);

If |x2| > 0x1p-6, then c0 + c * x2 is computed with a FMA, but
its a little trickier.  This splits c in clo and chi, and adds
in a correction for c1

	GET_FLOAT_WORD(ix, c);
	SET_FLOAT_WORD(chi, (ix >> 14) << 14);
	clo = c - chi + c1lo;

In the final expression,

	c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0);

chi * c2hi is exact, and the parenthesis are required to achieve
max ULP < ~0.88.  The grouping of the other terms does not seem
to matter.

> It is like the x3 term for sin*(), with the critical
> difference that the coeffient is much larger (-1/2 instead of -1/6
> before multiplication by pi).  This requires extra accuracy for cos()
> too, but not for the calculating the term.  I optimized this by changing
> a slow fdlibm method to essentially 2sum.  The x2/2 term is inaccurate
> but no significant further inaccuracies occur for adding terms.  You
> have Pi*Pi*x2/2 which is more inaccurate.  I guess the extra half an
> ulp accuracy here is too much.  You avoid this half an ulp using extra
> precision, and seem to use lots of extra precision instead of the final
> 2sum step.  Probably much slower.
> 
> Here is my modified method:
> 
>  	x2 = x*x;	/* hopefully really don't need extra prec */
>  	(hi, lo) = split(x2);
>  	resid = S(x2);
>  	hi = C2_hi * hi;
>  	lo = C2_lo * hi + C1_hi * lo + C1_lo * lo;
>  	lo = lo + resid;
> #ifdef hopefully_not_needed
>  	Re-split hi+lo if lo is too large. 
> #endif
>  	return 3sum(1, hi, lo)
> 
> 3sum is in math_private.h and is supposed to be used instead of magic-
> looking open code.

I completely missed 3sum in math_private.h.  I did try using your
2sum and 2sumF macros, but did not recall seeing any difference
from my hand-rolled code.  I probably didn't spend enough time
looking into if I was using 2sum[F] properly.  I'll see if I can
adapt my code to use these macros.

>  Here it would reduce to the same expressions as in
> __kernel_sin().
> 
>  	1+hi = hi_1 + lo_1;	/* by hi-lo decomp. */
>  	return (lo + lo_1 + hi_1);	/* safely add the lo's and "carry" */
> 
> > One could do, as is done with sin(x),
> >
> > f(x) = (sin(pi*x) - pi*x)/(pi*x)**3
> >
> > then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds
> > the complication that pi*x must be computed in extra precision.
> 
> Isn't that what is done now?  Of course pi*x must be evaluated in
> extra precision, but hopefully no more is needed.  The x3 term
> would be much harder to do all in extra precision than the x2
> term for cospi().

Well, yes, I supposed that's one way to look at it.  In the approximation
of sinpif(x) = x * (s0 + x2 * S(x2)), the s0 term is nearly pi.  The
s1 coefficient, as determined by Remes, is s1 = -5.16771269, which
happens to be close to -pi**3/6 = -5.16771278 (i.e, next term in
McClauren series of sin(pi*x) = pi*x - (pi**3/6)*x**3 + ....  So, the
Remes coefficients have absorbed the factors of pi.

> >> That is too inaccurate even for sin(x), and probably slower too.
> >
> > I'm not sure why you're claiming that it is too inaccurate.
> 
> Writing sin(x) as x * (1 + C2*x*x + ...) without extra precision.
> This gives at best an error of half an ulp for the part in parentheses
> and another half an ulp for the multiplication.  The final step
> must be an addition of a large term with a term at least 2 times
> smaller so that the error can be controlled (this depends on 1
> of the terms being exact).

Ah, okay, I understand what you're getting at.  I played a
number of games with grouping intermediate results and trying
to accumulate high and low parts.  You're seeing the best
that I can do.

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Sat Apr 29 00:59:26 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B511FD52424;
 Sat, 29 Apr 2017 00:59:26 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 7F5581ABD;
 Sat, 29 Apr 2017 00:59:26 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3T0xPk4037996
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Fri, 28 Apr 2017 17:59:25 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3T0xOBe037995;
 Fri, 28 Apr 2017 17:59:24 -0700 (PDT) (envelope-from sgk)
Date: Fri, 28 Apr 2017 17:59:24 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170429005924.GA37947@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170428233552.GA34580@troutmask.apl.washington.edu>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Apr 2017 00:59:26 -0000

On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
> 
> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.
> 
> static const float
> s0hi =  3.14062500e+00f,	/* 0x40490000 */
> s0lo =  9.67653585e-04f,	/* 0x3a7daa22 */
> s1   = -5.16771269e+00f,	/* 0xc0a55de7 */
> s2   =  2.55016255e+00f,	/* 0x402335dd */
> s3   = -5.99202096e-01f,	/* 0xbf19654f */
> s4   =  8.10018554e-02f;	/* 0x3da5e44d */
> 
> static inline float
> __kernel_sinpif(float x)
> {
> 	double s;
> 	float p, x2;
> 	x2 = x * x;
> 	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
> 	s = x * (x2 * (double)p + s0hi + s0lo);
> 	return ((float)s);
> }
> 

Well, starting with above and playing the splitting game
with x, x2, and p.  I've manage to reduce the max ULP in
the region testd.  Testing against MPFR with sin(pi*x)
computed in 5*24-bit precision gives

         MAX ULP: 0.73345101
    Total tested: 33554427
0.7 < ULP <= 0.8: 90
0.6 < ULP <= 0.7: 23948

Exhaustive testing with my older sinpi(x) as the reference
gives

./testf -S -m 0x1p-14 -M 0.25 -d -e
         MAX ULP: 0.73345101
    Total tested: 100663296
0.7 < ULP <= 0.8: 45
0.6 < ULP <= 0.7: 11977

The code is slightly slower than my current best kernel.
sinpif time is 0.0147 usec per call (60 cycles).

static inline float
__kernel_sinpif(float x)
{
	float p, phi, x2, x2hi, x2lo, xhi, xlo;
	uint32_t ix;

	x2 = x * x;
	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;

	GET_FLOAT_WORD(ix, p);
	SET_FLOAT_WORD(phi, (ix >> 14) << 14);

	GET_FLOAT_WORD(ix, x2);
	SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);

	x2lo = s0lo + x2 * (p - phi) + (x2 - x2hi) * phi;
	x2hi *= phi;

	GET_FLOAT_WORD(ix, x);
	SET_FLOAT_WORD(xhi, (ix >> 14) << 14);
	xlo = x - xhi;
	xlo = xlo * (x2lo + x2hi) + (xlo * s0hi + xhi * x2lo);

	return (xlo + xhi * x2hi + xhi * s0hi);
}


-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Sat Apr 29 07:54:36 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4555BD55D2D;
 Sat, 29 Apr 2017 07:54:36 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au
 [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 9E8DBA21;
 Sat, 29 Apr 2017 07:54:35 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id C542AD69213;
 Sat, 29 Apr 2017 17:54:25 +1000 (AEST)
Date: Sat, 29 Apr 2017 17:54:21 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170429005924.GA37947@troutmask.apl.washington.edu>
Message-ID: <20170429151457.F809@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
 <20170429005924.GA37947@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=nNk0bZpkwH5g55qFUBQA:9 a=tF56D53I6aLQB65K:21
 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Apr 2017 07:54:36 -0000

On Fri, 28 Apr 2017, Steve Kargl wrote:

> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
>>
>> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.

Comments on this below.

This is all rather over-engineered.  Optimizing these functions is
unimportant comparing with finishing cosl() and sinl() and optimizing
all of the standard trig functions better, but we need correctness.
But I now see many simplifications and improvements:

(1) There is no need for new kernels.  The standard kernels already handle
extra precision using approximations like:

     sin(x+y) ~= sin(x) + (1-x*x/2)*y.

Simply reduce x and write Pi*x = hi+lo.  Then

     sin(Pi*x) = __kernel_sin(hi, lo, 1).

I now see how to do the extra-precision calculations without any
multiplications.  For Pi*x, suppose that x is near 1/4.  Choose a
nice constant value x0 as far below 1/4 as works and reduce x
further using this:

     x = x0 + (x - x0)		// plain '=' means exact
     y0 ~= Pi*x0 = hi0 + lo0	// in extra prec at compile time
     y ~= Pi*(x-x0)		// small, so doesn't need extra prec
     Pi*x ~= hi0 + lo0 + y = hi1 + lo1	// use 2sum
     sin(Pi*x) ~= __kernel_sin(hi1, lo1, 1).

but it is better to use a special polynomial approximation about y0 on y.
We have reduced x to the significantly smaller y, so the polynominal can
have fewer terms and not need any extra precision except for the additive
term.  In infinite precision, sin(y+y0) = S0 + S1*y + ...  Write
S0 = hi2+lo2 and choose x0 so that lo2 can be replaced by 0 without
significant loss of accuracy (see some exp2's for this method).  At
worst, lo2 is nonzero and we have to do just 1 extra-precision operation.
Otherwise, we have a normal polyomial evaluation.  Unlike for sin() near
0, there are even terms and and not nice fractional coefficients, but
since y is small there aren't so many terms.  It is important to avoid
needing extra precision for many of the coefficients.  Higher ones
don't need it since y is small, and S0 might not need it.

The Intel ia64 math library uses the reduction x = y+y0 at several
interval endpoints below Pi/4 for ordinary sin().  Previously, I
couldn't get this method to work as well as using the single large
interval [-Pi/4, Pi/4], since the even terms take a while to calculate
and I didn't understand the hi+lo decomposition methods so well.

Reminder on why cos(x) needs extra precison but sin(x) doesn't when
we use the brute force expansion up to Pi/4: we use cos(x) = 1 - x*x/2
+ ... and the only problem is to subtract these terms accurately enough.
There is a problem since Pi/4 is slightly larger than sqrt(1/2), so
x*x/2 can be slightly larger than 1/4.  If it were larger than 1/2,
then the subtraction would lose a full bit to cancelation, so extra
inaccuracy could be more than 1 ulp, but we want it to be less 0.5
ulps so that the combined inaccuracy is less than 1 ulp.  Similarly,
when x*x/2 is larger than 1/4 but below 1/2, the extra inaccuracy
can be larger than 0.5 ulps though smaller than 1 ulp.  Above 0.5
is still to high, so we need some extra precision.  This problem
doesn't affect sin(x) since the ratio of the terms is -x*x/6 so the
extra inaccuracy is 3 times smaller so below 1/3 ulps.

For cos(y+y0) ~= C0 + C1*y + C2*y*y, y must be small enough for the
extra inaccuracy from the first addition to be a little smaller tha
0.5 ulps.  This is not as easy as I was thinking, since the lineat
term is relatively large.  But there is no problem for cos(y+1/2):
cos(y+1/2) ~= 0.88 - 0.48*y, and the ratio of terms with y = Pi/4-1/2
is ~0.16.  However, sin(y+1/2) ~= 0.48 + 0.88*y, so the maximum
ratio of terms is ~0.53 -- too much.

>> static const float
>> s0hi =  3.14062500e+00f,	/* 0x40490000 */
>> s0lo =  9.67653585e-04f,	/* 0x3a7daa22 */
>> s1   = -5.16771269e+00f,	/* 0xc0a55de7 */
>> s2   =  2.55016255e+00f,	/* 0x402335dd */
>> s3   = -5.99202096e-01f,	/* 0xbf19654f */
>> s4   =  8.10018554e-02f;	/* 0x3da5e44d */
>>
>> static inline float
>> __kernel_sinpif(float x)
>> {
>> 	double s;
>> 	float p, x2;
>> 	x2 = x * x;
>> 	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
>> 	s = x * (x2 * (double)p + s0hi + s0lo);
>> 	return ((float)s);
>> }

I don't like the style being completely different.  Why lower-case constants?
...

> Well, starting with above and playing the splitting game
> with x, x2, and p.  I've manage to reduce the max ULP in
> the region testd.  Testing against MPFR with sin(pi*x)
> computed in 5*24-bit precision gives

It already cheats by using (double)p, but is limited by using float to
caclulate p.  p should have an inaccuracy of not much more than 1 ulp,
and we don't care if this is a little over 1.  Then when we do the final
caclulations in either extra precision or using the double hack, we
don't get any more inaccuracy.  Extra precision is not very useful for
x2 * p -- it only increases the error by another half an ulp.  So errors
for that part are 0.5 ulps for x2, ~1 ulp for p and 0.5 for a float
multiplication -- 2.0 althogther -- and the the double multiplication
reduces this to 1.5.  This gets scaled down as usual by the term ratio
x*x/6 ~= 0.2.  0.2 of 1.5 = 0.3 is below the magic 0.5, but so is 0.2 of
2.0 = 0.4.  But I would prefer below the magic 0.25.

>         MAX ULP: 0.73345101
>    Total tested: 33554427
> 0.7 < ULP <= 0.8: 90
> 0.6 < ULP <= 0.7: 23948

You can even reverse this to calculate the maximum of the term ratio
fairly accuractely.  The rounding error is 0.233.  We had an error of
about 1.5 before the final addition.  The ratio 0.233/1.5 = 0.16 gives
the reduction ratio for the final addition.  It approximates the
term ratio.  (I'm surprised that it is lower than the term ratio
instead of higher.  In fact, I don't believe the 0.733.  According
to the term ratio, the worst case must be at least 0.800.)

> Exhaustive testing with my older sinpi(x) as the reference
> gives
>
> ./testf -S -m 0x1p-14 -M 0.25 -d -e
>         MAX ULP: 0.73345101
>    Total tested: 100663296
> 0.7 < ULP <= 0.8: 45
> 0.6 < ULP <= 0.7: 11977

Perhaps it really is 0.733.  I estimated an error of 1 ulp for p, but
its term ratio is much lower than for the first 2 terms -- 20 times
lower from the factorial in the denominator -- so 0.55 ulps would
be a better estimate.  This gives the estimate term_ratio * 1.05
= 0.21 extra ulps which matches 0.233 almost perfectly.  Now it seems
too perfect to be true -- I usually forget to add up all the half-
ulps in calculations like this.

> The code is slightly slower than my current best kernel.
> sinpif time is 0.0147 usec per call (60 cycles).

It still seems slow :-).  Full sin(x) takes 43 cycles on amd64
(freefall xeon) for x where its extra precision is used (iy != 0),
and calculating only Pi*x in extra precision shouldn't add much to
that, and the simpler range reduction should subtract a lot (sin(x)
takes 23 cycles below 2Pi where its range reduction is simple).

> static inline float
> __kernel_sinpif(float x)
> {
> 	float p, phi, x2, x2hi, x2lo, xhi, xlo;
> 	uint32_t ix;
>
> 	x2 = x * x;
> 	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
>
> 	GET_FLOAT_WORD(ix, p);
> 	SET_FLOAT_WORD(phi, (ix >> 14) << 14);
>
> 	GET_FLOAT_WORD(ix, x2);
> 	SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);

I expect that these GET/SET's are the slowest part.  They are quite fast
in float prec, but in double prec on old i386 CPUs compilers generate bad
code which can have penalties of 20 cycles per GET/SET.

Why the strange reduction?  The double shift is just a manual optimization
or pssimization (usually the latter) for clearing low bits.  Here it is
used to clear 14 low bits instead of the usual 12.  This is normally
written using just a mask of 0xffff0000, unless you want a different
number of bits in the hi terms for technical reasons.  Double precision
can benefit more from asymmetric splitting of terms since 53 is not
divisible by 2; 1 hi term must have less than 26.5 bits and the other term
can hold an extra bit.

Here is an example of fairly refined splitting and use of 2sum withit:

X float
X log10f(float x)
X {
X 	struct Float r;
X 	float_t hi, lo;
X 	float fhi;

hi+lo decompositions need float_t in general, for both correctness and
efficiency.

X 	int32_t hx;
X 
X 	DOPRINT_START(&x);
X 	k_logf(x, &r);
X 	if (!r.lo_set)
X 		RETURNP(r.hi);
X 	if (sizeof(float_t) > sizeof(float))
X 		RETURNP((invln10_lo + (float_t)invln10_hi) * (r.lo + r.hi));
X 	_2sumF(r.hi, r.lo);
X 	fhi = r.hi;
X 	GET_FLOAT_WORD(hx, fhi);
X 	SET_FLOAT_WORD(fhi, hx & 0xfffff000);

It is hard to do the splitting better than this.  This method subtly
masks bugs if float_t is not used.  On i386/i387 it forces much the
same slow loads and stores that are needed to discard any extra
precision when float_t is not used.

X 	hi = fhi;
X 	lo = r.lo + (r.hi - hi);

The code tries to optimize things with careful assignments between float_t
and float.  This has further subtleties from float_t being used.  When there
is extra precision, float_t should be different from float, and compilers
should not be broken enough to elide conversions between float and float_t.
The above 2 lines are basic splitting, and the subtleties are what is needed
for this to work.  struct Float is just hi and lo in a struct, each with
type float, but a further subtlety is that you want the compiler to keep
all variables in registers, with floats in extended precision in the
registers; then it must be arranged that hi+lo decompositions in registers
have no extra precision...

X 	RETURN2P(invln10_hi * hi,
X 	    (invln10_lo + invln10_hi) * lo + invln10_lo * hi);
X }

... but when the hi+lo values are used in expressions like this, we
want the extra precision back.  Here the hi*hi term is exact (that is
the point of the decomposition), and extra precision gives some free
extra accuracy for the rest of the calculation.

>
> 	x2lo = s0lo + x2 * (p - phi) + (x2 - x2hi) * phi;
> 	x2hi *= phi;
>
> 	GET_FLOAT_WORD(ix, x);
> 	SET_FLOAT_WORD(xhi, (ix >> 14) << 14);
> 	xlo = x - xhi;
> 	xlo = xlo * (x2lo + x2hi) + (xlo * s0hi + xhi * x2lo);
>
> 	return (xlo + xhi * x2hi + xhi * s0hi);
> }

Yet another splitting is going to be slow.  This one is necessary.  Are
the first 2 really necessary?  The above analysis might show that they
really aren't -- didn't we get an error of 0.733 ulps (which seems too
low) using both methods?

Otherwise, this code is very similar to my log10f().  k_logf() does
most of the work.  It evaluates logf(x) in extra precision.  log10f(x)
is just C*logf(x) in extra precision.  k_logf() arranges to return
the extra precision.  For sinpi*(), you need to scale the arg instead
of the result.  This is actually easier, since the arg is not in extra
precision and the extra precision is very easy to handle in the kernel
(and even already handled in standard kernels).

The above logf() runs acceptably fast using a kernel not especially
optimized for it.  Signficantly slower than my logf().  Only slightly
faster than fdlibm+cygnus log10f(), but it is much more accurate.

Bruce

From owner-freebsd-numerics@freebsd.org  Sat Apr 29 10:49:25 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id DCCB4D5597F;
 Sat, 29 Apr 2017 10:49:25 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au
 [211.29.132.97])
 by mx1.freebsd.org (Postfix) with ESMTP id 720191A07;
 Sat, 29 Apr 2017 10:49:24 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id B764110390F;
 Sat, 29 Apr 2017 20:19:27 +1000 (AEST)
Date: Sat, 29 Apr 2017 20:19:23 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
cc: Steve Kargl <sgk@troutmask.apl.washington.edu>, 
 freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170429151457.F809@besplex.bde.org>
Message-ID: <20170429194239.P3294@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
 <20170429005924.GA37947@troutmask.apl.washington.edu>
 <20170429151457.F809@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=OsaYX3Fg_SImqZmlGUkA:9 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Apr 2017 10:49:26 -0000

On Sat, 29 Apr 2017, Bruce Evans wrote:

> On Fri, 28 Apr 2017, Steve Kargl wrote:
>
>> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
>>> 
>>> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.
>
> Comments on this below.
>
> This is all rather over-engineered.  Optimizing these functions is
> unimportant comparing with finishing cosl() and sinl() and optimizing
> all of the standard trig functions better, but we need correctness.
> But I now see many simplifications and improvements:
>
> (1) There is no need for new kernels.  The standard kernels already handle
> extra precision using approximations like:
>
>    sin(x+y) ~= sin(x) + (1-x*x/2)*y.
>
> Simply reduce x and write Pi*x = hi+lo.  Then
>
>    sin(Pi*x) = __kernel_sin(hi, lo, 1).
>
> I now see how to do the extra-precision calculations without any
> multiplications.

But that is over-engineered too.

Using the standard kernels is easy and works well:

XX #include <float.h>
XX #include <math.h>
XX 
XX #include "math_private.h"
XX 
XX static const double
XX pi_hi = 3.1415926218032837e+00,	/* 0x400921fb 0x50000000 */
XX pi_lo = 3.1786509547050787e-08;	/* 0x3e6110b4 0x611a5f14 */
XX 
XX /* Only for |x| <= ~0.25 (missing range reduction. */
XX 
XX double
XX cospi(double x)
XX {
XX 	double_t hi, lo;
XX 
XX 	hi = (float)x;
XX 	lo = x - hi;
XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX 	hi = pi_hi * hi;
XX 	_2sumF(hi, lo);
XX 	return __kernel_cos(hi, lo);
XX }
XX 
XX double
XX sinpi(double x)
XX {
XX 	double_t hi, lo;
XX 
XX 	hi = (float)x;
XX 	lo = x - hi;
XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX 	hi = pi_hi * hi;
XX 	_2sumF(hi, lo);
XX 	return __kernel_sin(hi, lo, 1);
XX }
XX 
XX double
XX tanpi(double x)
XX {
XX 	double_t hi, lo;
XX 
XX 	hi = (float)x;
XX 	lo = x - hi;
XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
XX 	hi = pi_hi * hi;
XX 	_2sumF(hi, lo);
XX 	return __kernel_tan(hi, lo, 1);
XX }

I only did a sloppy accuracy test for sinpi().  It was 0.03 ulps less
accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for
sin().

Efficiency is very good in some cases, but anomalous in others: all
times in cycles, on i386, on the range [0, 0.25]

athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
cos:   61-62                 44                 43
cospi: 69-71 (8-9 extra)     78 (anomalous...)  42 (faster to do more!)
sin:   59-60                 51                 37
sinpi: 67-68 (8 extra)       80                 42
tan:   136-172               93-195             67-94
tanpi: 144-187 (8-15 extra)  145-176            61-189

That was a throughput test.  Latency is not so good.  My latency test
doesn't use serializing instructions, but uses random args and the
partial serialization of making each result depend on the previous
one.

athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
cos:   84-85                 69                 79
cospi: 103-104 (19-21 extra) 117                94
sin:   75-76                 89                 77
sinpi: 105-106 (30 extra)    116                90
tan:   168-170               167-168            147
tanpi: 191-194 (23-24 extra) 191                154

This also indicates that the longest times for tan in the throughput
test are what happens when the function doesn't run in parallel with
itself.  The high-degree polynomial and other complications in tan()
are too complicated for much cross-function parallelism.

Anywyay, it looks like the cost of using the kernel is at most 8-9
in the parallel case and at most 30 in the serial case.  The extra-
precision code has about 10 dependent instructions, so it s is
doing OK to take 30.

Bruce

From owner-freebsd-numerics@freebsd.org  Sat Apr 29 18:10:24 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8E065D55BE9;
 Sat, 29 Apr 2017 18:10:24 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 76320D7A;
 Sat, 29 Apr 2017 18:10:24 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3TIAN63041484
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Sat, 29 Apr 2017 11:10:23 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3TIAMGF041483;
 Sat, 29 Apr 2017 11:10:22 -0700 (PDT) (envelope-from sgk)
Date: Sat, 29 Apr 2017 11:10:22 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170429181022.GA41420@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
 <20170429005924.GA37947@troutmask.apl.washington.edu>
 <20170429151457.F809@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170429151457.F809@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Apr 2017 18:10:24 -0000

On Sat, Apr 29, 2017 at 05:54:21PM +1000, Bruce Evans wrote:
> On Fri, 28 Apr 2017, Steve Kargl wrote:
> 
> > On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
> >>
> >> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.
> 
> Comments on this below.

Way too much information to digest in one reading.  I'll
answer the things that are easy ...

> >> 	x2 = x * x;
> >> 	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
> >> 	s = x * (x2 * (double)p + s0hi + s0lo);
> >> 	return ((float)s);
> >> }
> 
> I don't like the style being completely different.  Why lower-case constants?
> ...

Having teethed on musty FORTRAN code, I have an adversion to uppercase.
I also do not like camel case, which seems to plague modern languages.

> > Well, starting with above and playing the splitting game
> > with x, x2, and p.  I've manage to reduce the max ULP in
> > the region testd.  Testing against MPFR with sin(pi*x)
> > computed in 5*24-bit precision gives
> 
> It already cheats by using (double)p, but is limited by using float to
> caclulate p.

Yes, the above is cheating to the extent that the above code
started out as completely written in double.  This gives 
max ULP < 0.51.  I then started reducing the double to float
with the intent of finding when max ULP exceeds some chosen
threshold.  The above gave something like max ULP < 0.55.
I then moved the (double) cast to different terms/factors,
observed how the max ULP changed.  This then gives information
about what requires extra precision.

> > static inline float
> > __kernel_sinpif(float x)
> > {
> > 	float p, phi, x2, x2hi, x2lo, xhi, xlo;
> > 	uint32_t ix;
> >
> > 	x2 = x * x;
> > 	p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1;
> >
> > 	GET_FLOAT_WORD(ix, p);
> > 	SET_FLOAT_WORD(phi, (ix >> 14) << 14);
> >
> > 	GET_FLOAT_WORD(ix, x2);
> > 	SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);
> 
> I expect that these GET/SET's are the slowest part.  They are quite fast
> in float prec, but in double prec on old i386 CPUs compilers generate bad
> code which can have penalties of 20 cycles per GET/SET.
> 
> Why the strange reduction?  The double shift is just a manual optimization
> or pssimization (usually the latter) for clearing low bits.  Here it is
> used to clear 14 low bits instead of the usual 12.  This is normally
> written using just a mask of 0xffff0000, unless you want a different
> number of bits in the hi terms for technical reasons.  Double precision
> can benefit more from asymmetric splitting of terms since 53 is not
> divisible by 2; 1 hi term must have less than 26.5 bits and the other term
> can hold an extra bit.

Because I didn't think about using a mask.  :-)

It's easy to change 14 to 13 or 11 or ..., while I would 
need to write out zeros and one to come up with 0xffff8000,
etc.

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Sat Apr 29 18:42:11 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B8BCDD567C2;
 Sat, 29 Apr 2017 18:42:11 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id A043F1D4B;
 Sat, 29 Apr 2017 18:42:11 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3TIg9KM041637
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Sat, 29 Apr 2017 11:42:09 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3TIg95N041636;
 Sat, 29 Apr 2017 11:42:09 -0700 (PDT) (envelope-from sgk)
Date: Sat, 29 Apr 2017 11:42:09 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170429184209.GB41420@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
 <20170429005924.GA37947@troutmask.apl.washington.edu>
 <20170429151457.F809@besplex.bde.org>
 <20170429194239.P3294@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170429194239.P3294@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Apr 2017 18:42:11 -0000

On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote:
> On Sat, 29 Apr 2017, Bruce Evans wrote:
> 
> > On Fri, 28 Apr 2017, Steve Kargl wrote:
> >
> >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote:
> >>> 
> >>> I was just backtracking with __kernel_sinpi.  This gets a max ULP < 0.61.
> >
> > Comments on this below.
> >
> > This is all rather over-engineered.  Optimizing these functions is
> > unimportant comparing with finishing cosl() and sinl() and optimizing
> > all of the standard trig functions better, but we need correctness.
> > But I now see many simplifications and improvements:
> >
> > (1) There is no need for new kernels.  The standard kernels already handle
> > extra precision using approximations like:
> >
> >    sin(x+y) ~= sin(x) + (1-x*x/2)*y.
> >
> > Simply reduce x and write Pi*x = hi+lo.  Then
> >
> >    sin(Pi*x) = __kernel_sin(hi, lo, 1).
> >
> > I now see how to do the extra-precision calculations without any
> > multiplications.
> 
> But that is over-engineered too.
> 
> Using the standard kernels is easy and works well:

As your code only works on the interval [0,0.25], I took
the liberty to use it as a __kernel_sinpi and __kernel_cospi.

> XX double
> XX cospi(double x)
> XX {
> XX 	double_t hi, lo;

If sizeof(double_t) indicates what I think it means,
This is slow on my Core2 duo (aka ia32 system).

> XX 	hi = (float)x;
> XX 	lo = x - hi;

This is the splitting I use in my double version versions
with hi and lo as simply double.

> XX 	lo = (pi_lo + pi_hi) * lo + pi_lo * hi;
> XX 	hi = pi_hi * hi;
> XX 	_2sumF(hi, lo);
> XX 	return __kernel_cos(hi, lo);
> XX }
> XX 
> 
> I only did a sloppy accuracy test for sinpi().  It was 0.03 ulps less
> accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for
> sin().
> 
> Efficiency is very good in some cases, but anomalous in others: all
> times in cycles, on i386, on the range [0, 0.25]
> 
> athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
> cos:   61-62                 44                 43
> cospi: 69-71 (8-9 extra)     78 (anomalous...)  42 (faster to do more!)
> sin:   59-60                 51                 37
> sinpi: 67-68 (8 extra)       80                 42
> tan:   136-172               93-195             67-94
> tanpi: 144-187 (8-15 extra)  145-176            61-189
> 
> That was a throughput test.  Latency is not so good.  My latency test
> doesn't use serializing instructions, but uses random args and the
> partial serialization of making each result depend on the previous
> one.
> 
> athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
> cos:   84-85                 69                 79
> cospi: 103-104 (19-21 extra) 117                94
> sin:   75-76                 89                 77
> sinpi: 105-106 (30 extra)    116                90
> tan:   168-170               167-168            147
> tanpi: 191-194 (23-24 extra) 191                154

I is unclear how you're making your measurements.   My timings
with my kernels compared to kernels based on your code:

      |   Bruce      |   Steve
------+--------------+--------------
sinpi | 0.0742 (148) | 0.0633 (126)
cospi | 0.0720 (144) | 0.0513 (102)

First number is microseconds per call and the (xxx) is the time*cpu_freq.

As far as over-engineering, for sinpi I find

sinpi            Bruce kernel    Steve kernel
         MAX ULP: 0.73021263     0.73955815
    Total tested: 33554431       33554431
0.7 < ULP <= 0.8: 154            280
0.6 < ULP <= 0.7: 27650          29197

cospi is much more interesting and as you state above more
difficult to get right.  I have reworked my kernel, yet,
but I find

cospi             Bruce kernel    Steve kernel
         MAX ULP: 0.78223389      0.89921787
    Total tested: 33554431        33554431
0.8 < ULP <= 0.9: 0               3262
0.7 < ULP <= 0.8: 9663            68305
0.6 < ULP <= 0.7: 132948          346214

Perhaps, using double_t would reduce my max ULP at the expense
of speed. 

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Sat Apr 29 19:09:39 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 19F17D56FC5;
 Sat, 29 Apr 2017 19:09:39 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au
 [211.29.132.246])
 by mx1.freebsd.org (Postfix) with ESMTP id A0FADD8A;
 Sat, 29 Apr 2017 19:09:38 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id A7D7342B17F;
 Sun, 30 Apr 2017 05:09:27 +1000 (AEST)
Date: Sun, 30 Apr 2017 05:09:26 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170429181022.GA41420@troutmask.apl.washington.edu>
Message-ID: <20170430042756.A862@besplex.bde.org>
References: <20170427231411.GA11346@troutmask.apl.washington.edu>
 <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
 <20170429005924.GA37947@troutmask.apl.washington.edu>
 <20170429151457.F809@besplex.bde.org>
 <20170429181022.GA41420@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=4e2EsQlSr8nDh46oqDgA:9 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Apr 2017 19:09:39 -0000

On Sat, 29 Apr 2017, Steve Kargl wrote:

> On Sat, Apr 29, 2017 at 05:54:21PM +1000, Bruce Evans wrote:
>> On Fri, 28 Apr 2017, Steve Kargl wrote:
> ...
>>> 	GET_FLOAT_WORD(ix, p);
>>> 	SET_FLOAT_WORD(phi, (ix >> 14) << 14);
>>>
>>> 	GET_FLOAT_WORD(ix, x2);
>>> 	SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);
>>
>> I expect that these GET/SET's are the slowest part.  They are quite fast
>> in float prec, but in double prec on old i386 CPUs compilers generate bad
>> code which can have penalties of 20 cycles per GET/SET.
>>
>> Why the strange reduction?  The double shift is just a manual optimization
>> or pssimization (usually the latter) for clearing low bits.  Here it is
>> used to clear 14 low bits instead of the usual 12.  This is normally
>> written using just a mask of 0xffff0000, unless you want a different
>> number of bits in the hi terms for technical reasons.  Double precision
>> can benefit more from asymmetric splitting of terms since 53 is not
>> divisible by 2; 1 hi term must have less than 26.5 bits and the other term
>> can hold an extra bit.
>
> Because I didn't think about using a mask.  :-)
>
> It's easy to change 14 to 13 or 11 or ..., while I would
> need to write out zeros and one to come up with 0xffff8000,
> etc.

Here are some examples of more delicate splittings from the uncommitted
clog*().  They are usually faster than GET/SET, but slower than converting
to lower precision as is often possible for double precision and ld128
only.  clog*() can't use the casting method since it needs to split in the
middle, and doesn't use GET/SET since it is slow.  It uses methods that
only work on args that are not too large or too small, and uses a GET
earlier to classify the arg size.

XX double prec:
XX 	double_t ax, ax2h, ax2l, t;
XX 
XX 	t = (double)(ax * (0x1p27 + 1));
XX 	axh = (double)(ax - t) + t;
XX 	axl = ax - axh;

Splitting in the middle is required to calculate ax*ax exactly.  Since the
middle bit is 26.5, splitting in the middle is impossible, but there are
tricks to make spitting at bit 26 or 27 work just as well.

Multipling by 0x1p27 shifts to a place where the lower bits are canceled
by the subtraction.  Adding 1 is a trick to create a extra (virtual) bit
needed for the exact multiplication.

At least ax needs to have the extra-precision type double_t so that casting
to double works.  Otherwise, this code would need STRICT_ASSIGN() to work
around compiler bugfreatures.  We assign back to double_t for efficiency.

XX float prec:
XX 	float_t ax, axh, axl, t;
XX 
XX 	t = (float)(ax * (0x1p12F + 1));
XX 	axh = (float)(ax - t) + t;
XX 	axl = ax - axh;

This is simpler because there is a middle.  I think adding 1 is not needed.

XX long double double prec:
XX #if LDBL_MANT_DIG == 64
XX #define	MULT_REDUX	0x1p32		/* exponent MANT_DIG / 2 rounded up */
XX #elif LDBL_MANT_DIG == 113
XX #define	MULT_REDUX	0x1p57
XX #else
XX #error "Unsupported long double format"
XX #endif
XX 
XX 	long double ax, axh, axl, t;
XX 
XX 	/* Calculate ax*ax and ay*ay exactly using Dekker's algorithm. */
XX 	t = (long double)(ax * (MULT_REDUX + 1));
XX 	axh = (long double)(ax - t) + t;
XX 	axl = ax - axh;

This is more unsimpler because the middle depends on a parameter and it
is fractional in some cases.  If adding 1 is not needed for the non-
fractional case, then  it should be added in MULT_REDUX in the fractional
case only.

Several functions including expl and e_rem_pio* use this REDUX method
with the parameter hard-coded in a different way, usually with a
slower STRICT_ASSIGN() and a special conversion to an integer or integer
divided by a power of 2.  The above is supposed to be more careful with
extra precision so that STRICT_ASSIGN() is not needed.

The calculation of axh may have problems with double rounding, but I think
this doesn't matter here -- whatever axh is, the rest ends up in axl and
we just need to ensure that both have 27 lower bits zero.

Bruce

From owner-freebsd-numerics@freebsd.org  Sat Apr 29 19:38:30 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B3AD6D54D63;
 Sat, 29 Apr 2017 19:38:30 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
 [128.95.76.21])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "troutmask", Issuer "troutmask" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9C012AE;
 Sat, 29 Apr 2017 19:38:30 +0000 (UTC)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost [127.0.0.1])
 by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3TJcTRx041989
 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Sat, 29 Apr 2017 12:38:29 -0700 (PDT)
 (envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
 by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3TJcTNF041988;
 Sat, 29 Apr 2017 12:38:29 -0700 (PDT) (envelope-from sgk)
Date: Sat, 29 Apr 2017 12:38:29 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
Message-ID: <20170429193829.GA41964@troutmask.apl.washington.edu>
Reply-To: sgk@troutmask.apl.washington.edu
References: <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
 <20170429005924.GA37947@troutmask.apl.washington.edu>
 <20170429151457.F809@besplex.bde.org>
 <20170429181022.GA41420@troutmask.apl.washington.edu>
 <20170430042756.A862@besplex.bde.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170430042756.A862@besplex.bde.org>
User-Agent: Mutt/1.7.2 (2016-11-26)
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Apr 2017 19:38:30 -0000

On Sun, Apr 30, 2017 at 05:09:26AM +1000, Bruce Evans wrote:
> On Sat, 29 Apr 2017, Steve Kargl wrote:
> 
> > On Sat, Apr 29, 2017 at 05:54:21PM +1000, Bruce Evans wrote:
> >> On Fri, 28 Apr 2017, Steve Kargl wrote:
> > ...
> >>> 	GET_FLOAT_WORD(ix, p);
> >>> 	SET_FLOAT_WORD(phi, (ix >> 14) << 14);
> >>>
> >>> 	GET_FLOAT_WORD(ix, x2);
> >>> 	SET_FLOAT_WORD(x2hi, (ix >> 14) << 14);
> >>
> >> I expect that these GET/SET's are the slowest part.  They are quite fast
> >> in float prec, but in double prec on old i386 CPUs compilers generate bad
> >> code which can have penalties of 20 cycles per GET/SET.
> >>
> >> Why the strange reduction?  The double shift is just a manual optimization
> >> or pssimization (usually the latter) for clearing low bits.  Here it is
> >> used to clear 14 low bits instead of the usual 12.  This is normally
> >> written using just a mask of 0xffff0000, unless you want a different
> >> number of bits in the hi terms for technical reasons.  Double precision
> >> can benefit more from asymmetric splitting of terms since 53 is not
> >> divisible by 2; 1 hi term must have less than 26.5 bits and the other term
> >> can hold an extra bit.
> >
> > Because I didn't think about using a mask.  :-)
> >
> > It's easy to change 14 to 13 or 11 or ..., while I would
> > need to write out zeros and one to come up with 0xffff8000,
> > etc.
> 
> Here are some examples of more delicate splittings from the uncommitted
> clog*().  They are usually faster than GET/SET, but slower than converting
> to lower precision as is often possible for double precision and ld128
> only.  clog*() can't use the casting method since it needs to split in the
> middle, and doesn't use GET/SET since it is slow.  It uses methods that
> only work on args that are not too large or too small, and uses a GET
> earlier to classify the arg size.

I didn't know about these other splitting methods.  Thanks for
pointing them out to me.  

I updated by k_sinpif.c to use the standard masking with 0xffff0000.
It has no effect on the timing on Core2 dou.  It did however effect
the max ULP.  With exhaustive testing in [0x1p-14,0.25] I now have 

         MAX ULP: 0.68287528
    Total tested: 100663296
0.6 < ULP <= 0.7: 5607

the older version with the shifts by 14 bits gives

         MAX ULP: 0.73345101
    Total tested: 100663296
0.7 < ULP <= 0.8: 45
0.6 < ULP <= 0.7: 11977

The value of 14 is a holdover from an earlier version.

Getting back to the use of float_t and double_t.  If one
wants the performance penalty, these then work well.  Changing
types to float_t in k_cospif.c, I find a slowdown of for cospif,
but I also find

         MAX ULP: 0.64679509
    Total tested: 1048576000
0.6 < ULP <= 0.7: 31598

with exhaustive testing in [0,0.25].

-- 
Steve
20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4
20161221 https://www.youtube.com/watch?v=IbCHE-hONow

From owner-freebsd-numerics@freebsd.org  Sat Apr 29 20:13:00 2017
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id E8B38D55A65;
 Sat, 29 Apr 2017 20:13:00 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id 952EC16C1;
 Sat, 29 Apr 2017 20:13:00 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 2DB851041D4A;
 Sun, 30 Apr 2017 06:12:46 +1000 (AEST)
Date: Sun, 30 Apr 2017 06:12:43 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: Bruce Evans <brde@optusnet.com.au>, freebsd-hackers@freebsd.org, 
 freebsd-numerics@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170429184209.GB41420@troutmask.apl.washington.edu>
Message-ID: <20170430051230.X1000@besplex.bde.org>
References: <20170428010122.GA12814@troutmask.apl.washington.edu>
 <20170428183733.V1497@besplex.bde.org>
 <20170428165658.GA17560@troutmask.apl.washington.edu>
 <20170429035131.E3406@besplex.bde.org>
 <20170428201522.GA32785@troutmask.apl.washington.edu>
 <20170429070036.A4005@besplex.bde.org>
 <20170428233552.GA34580@troutmask.apl.washington.edu>
 <20170429005924.GA37947@troutmask.apl.washington.edu>
 <20170429151457.F809@besplex.bde.org>
 <20170429194239.P3294@besplex.bde.org>
 <20170429184209.GB41420@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=ZDInmu-daXXT0OxnvG4A:9 a=dM07vABZqfTYfvwQ:21
 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 29 Apr 2017 20:13:01 -0000

On Sat, 29 Apr 2017, Steve Kargl wrote:

> On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote:
>. ...
>> Using the standard kernels is easy and works well:
>
> As your code only works on the interval [0,0.25], I took
> the liberty to use it as a __kernel_sinpi and __kernel_cospi.

You already wrote the reduction part, and lgamma already has it to.

I just checked lgamma.  Its reduction part is short (especially
after your improvements to it).  Then it is sloppy and doesn't
caclulate Pi*x in extra precision.  I think this is justified since
lgamma generally has errors much larger than 1 ulp.  it does some
micro-optimizations involving minus signs.  There is some danger that
with the kernels inline, the compiler will generate large inline
code for multiple copies whose difference is just the placement of
a single instruction to do the negation.  If this is an optimization,
then it is small.

>
>> XX double
>> XX cospi(double x)
>> XX {
>> XX 	double_t hi, lo;
>
> If sizeof(double_t) indicates what I think it means,
> This is slow on my Core2 duo (aka ia32 system).

I doubt that it is what you think it is, since you neglect testing
on i386 :-).  It makes no difference on amd64, but is needed on i386
to optimized and give correct code.  However, double_t is broken with
clang on i386 with certain CFLAGS that give SSE.  Then clang produces
code that is incompatible with FLT_EVAL_METHOD and double_t.  This
makes clangs use of SSE a pessimization.  It dodn't seem to break
correctness (the result is the same as using long double instead of
double_t).

>> XX 	hi = (float)x;
>> XX 	lo = x - hi;
>
> This is the splitting I use in my double version versions
> with hi and lo as simply double.

It will work without double_t for the splitting.  double_t (or
lots of slow STRICT_ASSIGN()s is needed for 2sum.  2sum is like
the above except it casts to double instead of float, and compilers
intentionally break casts.

>> ...
>> Efficiency is very good in some cases, but anomalous in others: all
>> times in cycles, on i386, on the range [0, 0.25]
>>
>> athlon-xp, gcc-3.3           Haswell, gcc-3.3   Haswell, gcc-4.2.1
>> cos:   61-62                 44                 43
>> cospi: 69-71 (8-9 extra)     78 (anomalous...)  42 (faster to do more!)
>> ...
>
> I is unclear how you're making your measurements.   My timings
> with my kernels compared to kernels based on your code:

Basically by timing a for loop, but with many refinements to make sure
that I am timing the function and not the loop, or at least to make sure
that the loop timing doesn't change.  For example, on amd64 and i386,
the stack is aligned at 4K plus a smaller adjustable parameter since
otherwise the timing for long doubles is too variable.  Also, rebuilding
all the object files used with identical CFLAGS (usually -O
-march=better-than-native).  This probably makes my tests run 20-50
cycles per iteration faster (the simplest non-inline function takes 8
or 9 cycles and that takes running it in parallel with itself).

>      |   Bruce      |   Steve
> ------+--------------+--------------
> sinpi | 0.0742 (148) | 0.0633 (126)
> cospi | 0.0720 (144) | 0.0513 (102)
>
> First number is microseconds per call and the (xxx) is the time*cpu_freq.

The difference might be loop overhead or loss of parallelism.  IIRC, you
use something to serialize after every function.  I have 4 variations and
rarely use the serialization ones.

Or thie difference might be more that the kernels are not inlined.  I
forgot to inline them too.  I always rebuild the kernels for every test,
so they get compiled with non-default CFLAGS, but they don't get inlined
without #include statements.

> As far as over-engineering, for sinpi I find
>
> sinpi            Bruce kernel    Steve kernel
>         MAX ULP: 0.73021263     0.73955815
>    Total tested: 33554431       33554431
> 0.7 < ULP <= 0.8: 154            280
> 0.6 < ULP <= 0.7: 27650          29197

See, the kernels are already tuned almost as much as possible for
accuracy/efficiency tradeoffs.

> cospi is much more interesting and as you state above more
> difficult to get right.  I have reworked my kernel, yet,
> but I find
>
> cospi             Bruce kernel    Steve kernel
>         MAX ULP: 0.78223389      0.89921787
>    Total tested: 33554431        33554431
> 0.8 < ULP <= 0.9: 0               3262
> 0.7 < ULP <= 0.8: 9663            68305
> 0.6 < ULP <= 0.7: 132948          346214
>
> Perhaps, using double_t would reduce my max ULP at the expense
> of speed.

On i386, it should reduce the error at the expense of slowness.
On amd64, it makes no difference, but using long double should
reduct the error at the expense of speed.

Usually the reduction in the maximum error is only about 0.1 ulps
unless the algorithm uses extra precision intentionally.  This
is available on i386, but is not large enough to compensate for
the slowness.

My example of log10*() unintentionally gave an example of using
extra precision intentionally for efficiency.  Here it is for log10():

XX double
XX log10(double x)
XX {
XX 	struct Double r;
XX 	double_t hi, lo;
XX 
XX 	DOPRINT_START(&x);
XX 	k_log(x, &r);
XX 	if (!r.lo_set)
XX 		RETURNP(r.hi);
XX 	if (0 && sizeof(double_t) > sizeof(double))
XX 		RETURNP((invln10_lo + (double_t)invln10_hi) * (r.lo + r.hi));

This avoids the emulated extra-precision calculations of possible.  This
is under 'if (0)' since it is too hard to determine if extra precision is
available.  The sizeof() calculation doesn't tell you if it is.  On i386,
double_t is long double, but extra precision is not available under FreeBSD
unless the program used fpsetprec(FP_PE) to change the default of double
precision.

This is not under 'if (0)' for float precision.  Then it is assume that
the program doesn't use fpsetprec(FP_PS) to change the default to float
precision.

XX 	_2sumF(r.hi, r.lo);
XX 	hi = (float)r.hi;
XX 	lo = r.lo + (r.hi - hi);
XX 	RETURN2P(invln10_hi * hi,
XX 	    (invln10_lo + invln10_hi) * lo + invln10_lo * hi);
XX }

i386/i387, when used as designed, can execute this function much faster
than SSE by not executing the lines beginning with _2sumF().  These lines
take 20-40 cycles (latency) on modern SSE.  Throughput is better.

This would probably be a good optimization for amd64 too (using long
double instead of double_t for the extra precision).  The Intel ia64
math library uses this method of switching to extended precision often.
ia64 has a better FP design.  I don't know its details, but it seems
to combine the best features of i387 and SSE.  The FP registers should
have extra precision as on i387, but the instructions should make it
efficient to not use the extra precision for languages and programmers
that don't want it.

Bruce