From owner-freebsd-numerics@freebsd.org Sun Apr 23 18:40:47 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 110DCD4DF8A for ; Sun, 23 Apr 2017 18:40:47 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 00A9A1F97 for ; Sun, 23 Apr 2017 18:40:47 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v3NIeki6020766 for ; Sun, 23 Apr 2017 18:40:46 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-numerics@FreeBSD.org Subject: [Bug 170206] [msun] [patch] complex arcsinh, log, etc. Date: Sun, 23 Apr 2017 18:40:47 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: bin X-Bugzilla-Version: 8.3-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: jbeich@FreeBSD.org X-Bugzilla-Status: In Progress X-Bugzilla-Resolution: X-Bugzilla-Priority: Normal X-Bugzilla-Assigned-To: freebsd-numerics@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: attachments.mimetype Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Apr 2017 18:40:47 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D170206 Jan Beich changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #126474|text/x-csrc |text/plain mime type| | --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-numerics@freebsd.org Sun Apr 23 18:40:52 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EB0F6D4DFA0 for ; Sun, 23 Apr 2017 18:40:52 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DAA9E1FB2 for ; Sun, 23 Apr 2017 18:40:52 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v3NIeqES020905 for ; Sun, 23 Apr 2017 18:40:52 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-numerics@FreeBSD.org Subject: [Bug 170206] [msun] [patch] complex arcsinh, log, etc. Date: Sun, 23 Apr 2017 18:40:52 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: bin X-Bugzilla-Version: 8.3-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: jbeich@FreeBSD.org X-Bugzilla-Status: In Progress X-Bugzilla-Resolution: X-Bugzilla-Priority: Normal X-Bugzilla-Assigned-To: freebsd-numerics@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: attachments.mimetype Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Apr 2017 18:40:53 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D170206 Jan Beich changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #126475|text/x-csrc |text/plain mime type| | --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-numerics@freebsd.org Sun Apr 23 18:40:58 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 913D3D4DFB7 for ; Sun, 23 Apr 2017 18:40:58 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 80EAC1FD1 for ; Sun, 23 Apr 2017 18:40:58 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v3NIewn1021065 for ; Sun, 23 Apr 2017 18:40:58 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-numerics@FreeBSD.org Subject: [Bug 170206] [msun] [patch] complex arcsinh, log, etc. Date: Sun, 23 Apr 2017 18:40:58 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: bin X-Bugzilla-Version: 8.3-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: jbeich@FreeBSD.org X-Bugzilla-Status: In Progress X-Bugzilla-Resolution: X-Bugzilla-Priority: Normal X-Bugzilla-Assigned-To: freebsd-numerics@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: attachments.mimetype Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 23 Apr 2017 18:40:58 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D170206 Jan Beich changed: What |Removed |Added ---------------------------------------------------------------------------- Attachment #126478|text/x-csrc |text/plain mime type| | --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-numerics@freebsd.org Mon Apr 24 23:55:39 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D117ED4EB18; Mon, 24 Apr 2017 23:55:39 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 9E412A4E; Mon, 24 Apr 2017 23:55:39 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3ONRhmP054815 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 24 Apr 2017 16:27:43 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3ONRhSI054814; Mon, 24 Apr 2017 16:27:43 -0700 (PDT) (envelope-from sgk) Date: Mon, 24 Apr 2017 16:27:43 -0700 From: Steve Kargl To: freebsd-numerics@freebsd.org Cc: freebsd-standards@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170424232743.GA54621@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170413171248.GA86780@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170413171248.GA86780@troutmask.apl.washington.edu> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Apr 2017 23:55:39 -0000 On Thu, Apr 13, 2017 at 10:12:48AM -0700, Steve Kargl wrote: > On Sun, Apr 09, 2017 at 03:08:09PM -0700, Steve Kargl wrote: > > Both IEEE-754 2008 and ISO/IEC TS 18661-4 define the half-cycle > > trignometric functions cospi, sinpi, and tanpi. The attached > > patch implements cospi[fl], sinpi[fl], and tanpi[fl]. Limited > > testing on the cospi and sinpi reveal a max ULP less than 0.89; > > while tanpi is more problematic with a max ULP less than 2.01 > > in the interval [0,0.5]. The algorithms used in these functions > > are documented in {ks}_cospi.c, {ks}_sinpi.c, and s_tanpi.c. > > > > Note 1. ISO/IEC TS 18661-4 says these funstions are guarded by > > a predefine macro. I have no idea or interest in what clang and > > gcc do with regards to this macro. I've put the functions behind > > __BSD_VISIBLE. > > > > Note 2. I no longer have access to a system with ld128 and > > adequate support to compile and test the ld128 implementations > > of these functions. Given the almost complete lack of input from > > others on improvements to libm, I doubt that anyone cares. If > > someone does care, the ld128 files contain a number of FIXME comments, > > and in particular, while the polynomial coefficients are given > > I did not update the polynomial algorithms to properly use the > > coefficients. > > > > The code is attached the bug reportr. > > > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218514 > > > > While everyone is busy reviewing and testing the patch available > in bugzilla, I suspect some may be wondering about the inverse > half-cycle trignometric functions. I have worked out an algorithm > for asinpi[fl] and have a working implemenation of asinpif(x). > It will take a couple of weeks (due to limited available time) > before I can submit asinpi[fl], acospi[fl], and atanpi[fl], but > work is in progress. > I have what appears to be working versions of asinpi[fl]. It was suggested elsewhere that using an Estrin's method to sum the polynomial approximations instead of Horner's method may allow modern CPUs to better schedule generated code. I have implemented an Estrin's-like method for sinpi[l] and cospi[l], and indeed the generated code is faster on my Intel core2 duo with only a slight degradation in the observed max ULP. I'll post new versions to bugzilla in the near future. -- Steve 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Tue Apr 25 05:30:44 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5C6A0D4F358; Tue, 25 Apr 2017 05:30:44 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id BB1E81D90; Tue, 25 Apr 2017 05:30:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 40EE142B512; Tue, 25 Apr 2017 15:13:45 +1000 (AEST) Date: Tue, 25 Apr 2017 15:13:44 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170424232743.GA54621@troutmask.apl.washington.edu> Message-ID: <20170425143120.R933@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170413171248.GA86780@troutmask.apl.washington.edu> <20170424232743.GA54621@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=VIhQITioF-Sxmys2TWEA:9 a=s4eJatwz83rb22bD:21 a=v3ws3rOyY_mnV_cf:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Apr 2017 05:30:44 -0000 On Mon, 24 Apr 2017, Steve Kargl wrote: > I have what appears to be working versions of asinpi[fl]. It > was suggested elsewhere that using an Estrin's method to > sum the polynomial approximations instead of Horner's method > may allow modern CPUs to better schedule generated code. > I have implemented an Estrin's-like method for sinpi[l] > and cospi[l], and indeed the generated code is faster on > my Intel core2 duo with only a slight degradation in the > observed max ULP. I'll post new versions to bugzilla in > the near future. I didn't remember Estrin, but checked Knuth and see that his method is a subset of what I use routinely. I might have learned it from Knuth. I tried all methods in Knuth, but they didn't seem to be any better at first. Later I learned more about scheduling. Estrin's method now seems routine as a subset of scheduling. You already used it in expl: for ld80: X x2 = x * x; X x4 = x2 * x2; Unlike in Knuth's description of Estrin, we don't try to order the operations to suit the CPU(s) in the code, but just try to keep them independent, as required for them to execute independently. Compilers will rearrange them anyway, sometimes even correctly, but it makes little difference (on OOE CPUs) to throw all of the operations into the pipeline and see how fast something comes out). X q = x4 * (x2 * (x4 * X /* X * XXX the number of terms is no longer good for X * pairwise grouping of all except B3, and the X * grouping is no longer from highest down. X */ X (x2 * B12 + (x * B11 + B10)) + X (x2 * (x * B9 + B8) + (x * B7 + B6))) + X (x * B5 + B4.e)) + x2 * x * B3.e; This is arranged to look a bit like Horner's method at first (operations not written in separate statements), but it is basically Estrin's method with no extensions to more complicated sub-expressions than Estrin's, but complications for accuracy: - I think accuracy for the B3 term is unimportant, but for some reason (perhaps just the one in the comment) it is grouped separately. Estrin might use x3 * (B3.e + x * B4.e). - accuracy is important for B0 through B2 terms, and we don't use Estrin for them. Efficiency is a matter of running the above in parallel with the evaluation of lower terms. The above shouldn't steal any resources from evaluation of the lower terms unless it is the slow part. Normally you want N (2 independent streams to complete at the same time ready for a final addition). The above also does some power-of-2 grouping which has been messed up by changing the number of terms in the polynomial (see the comment). Generally this costs time time of 1 operation for each missed optimization. The above hasn't been rearranged since it is a lot of work to keep rearranging it during development and development stalled after committing it. The ld128 cases has larger problems from this. It has 2 sets of polynomials, both higher order, with sub-optimal numbers of terms. Here is my program for testing the results of quadratic grouping on x86. It shows little qualitative difference between Athlon-XP and Haswell. Haswell is just faster: XX #include XX XX static __inline void XX cpuid(int ax) XX { XX __asm __volatile("cpuid" : "=a" (ax) : "0" (ax) : "cx", "dx", "bx"); XX } XX #define puid(ax) XX XX double P0, P1, P2, P3; XX double P4, P5, P6, P7; XX double P8, P9, Pa, Pb; XX double Pc, Pd, Pe, Pf; XX double Q0, Q1, Q2, Q3; XX double Q4, Q5, Q6, Q7; XX double Q8, Q9, Qa, Qb; XX double Qc, Qd, Qe, Qf; XX volatile double a, r; XX XX #define FREQ 2010168298 /* sysctl -n machdep.tsc_freq */ XX XX #define NITER (FREQ / 10) XX XX int XX main(int argc, char **argv) XX { XX double x; XX int i, n; XX XX n = (argc == 1 ? 0 : atoi(argv[1])); XX switch (n) { XX default: XX for (i = 0; i < NITER; i++) { XX cpuid(0); XX } XX break; XX case 2: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = x*P1+P0; XX cpuid(0); XX } XX break; XX case -4: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = P0+x*(P1+x*(P2+x*P3)); XX cpuid(0); XX } XX break; XX case 4: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = (x*P3+P2)*(x*x)+(x*P1+P0); XX cpuid(0); XX } XX break; XX case 5: XX for (i = 0; i < NITER; i++) { XX x = a; XX double t32 = x*P3+P2; XX double t10 = x*P1+P0; XX double x2 = x*x; XX r = t32*x2+t10; XX cpuid(0); XX } XX break; XX case -8: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*P7)))))); XX cpuid(0); XX } XX break; XX case 6: XX for (i = 0; i < NITER; i++) { XX x = a; XX double lo = 0.0; XX r = ((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) + XX ((x*P3+P2)*(x*x)+lo); XX cpuid(0); XX } XX break; XX case 7: XX for (i = 0; i < NITER; i++) { XX double d = a; XX double lo = 0.0; XX double t = d*P7+P6; XX double z = d*d; XX double s = d*P5+P4; XX double rl = d*P3+P2; XX #if 0 XX r = lo + z * rl + z * z * s + z * z * (z * t); XX #elif 0 XX r = lo + z * z * (z * t) + z * z * s + z * rl; XX #elif 0 XX r = z * z * (z * t) + (z * z * s + lo) + z * rl; XX #else XX r = z * z * (z * t) + z * z * s + (z * rl + lo); XX #endif XX cpuid(0); XX } XX break; XX case 8: XX for (i = 0; i < NITER; i++) { XX cpuid(0); XX } XX break; XX case 9: XX for (i = 0; i < NITER; i++) { XX x = a; XX double t76 = x*P7+P6; XX double t54 = x*P5+P4; XX double t32 = x*P3+P2; XX double x2 = x*x; XX r = ((t76 )*x2+(t54 ))*(x2*x2) + XX ((t32 )*x2+(x*P1+P0)); XX cpuid(0); XX } XX break; XX case 10: XX for (i = 0; i < NITER; i++) { XX x = a; XX double t76 = x*P7+P6; XX double t32 = x*P3+P2; XX double t54 = x*P5+P4; XX double x2 = x*x; XX r = ((t76 )*x2+(t54 ))*(x2*x2) + XX ((t32 )*x2+(x*P1+P0)); XX cpuid(0); XX } XX break; XX case 11: XX for (i = 0; i < NITER; i++) { XX x = a; XX double t76 = x*P7+P6; XX double t54 = x*P5+P4; XX double t32 = x*P3+P2; XX double x2 = x*x; XX double t7654 = t76*x2+t54; XX double t3210 = t32*x2+(x*P1+P0); XX r = t7654*(x2*x2) + t3210; XX cpuid(0); XX } XX break; XX case -16: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*(P7+x*(P8+x*(P9+x*(Pa+x*(Pb+x*(Pc+x*(Pd+x*(Pe+x*(Pf))))))))))))))); XX cpuid(0); XX } XX break; XX case 16: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = (((x*Pf+Pe)*(x*x)+(x*Pd+Pc))*((x*x)*(x*x)) + XX ((x*Pb+Pa)*(x*x)+(x*P9+P8)))* XX (((x*x)*(x*x))*((x*x)*(x*x))) + XX (((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) + XX ((x*P3+P2)*(x*x)+(x*P1+P0))); XX cpuid(0); XX } XX break; XX case 17: XX for (i = 0; i < NITER; i++) { XX x = a; XX double tfe = x*Pf+Pe; XX double tba = x*Pb+Pa; XX double t76 = x*P7+P6; XX double t32 = x*P3+P2; XX double x2 = x*x; XX double tdc = x*Pd+Pc; XX double t98 = x*P9+P8; XX double t54 = x*P5+P4; XX r = (((tfe )*x2+(tdc ))*(x2*x2) + XX ((tba )*x2+(t98 )))*((x2*x2)*(x2*x2)) + XX (((t76 )*x2+(t54 ))*(x2*x2) + XX ((t32 )*x2+(x*P1+P0))); XX cpuid(0); XX } XX break; XX case 18: XX for (i = 0; i < NITER; i++) { XX x = a; XX double tfe = x*Pf+Pe; XX double tba = x*Pb+Pa; XX double t76 = x*P7+P6; XX double tdc = x*Pd+Pc; XX double x2 = x*x; XX double t32 = x*P3+P2; XX double t98 = x*P9+P8; XX double t54 = x*P5+P4; XX double t10 = x*P1+P0; XX double x4 = x2*x2; XX double tfedc = x2*tfe+tdc; XX double tba98 = x2*tba+t98; XX double t7654 = x2*t76+t54; XX double t3210 = x2*t32+t10; XX double tfedcba98 = x4*tfedc+tba98; XX double t76543210 = x4*t7654+t3210; XX r = (x4*x4)*tfedcba98+t76543210; XX cpuid(0); XX } XX break; XX case -32: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = P0+x*(P1+x*(P2+x*(P3+x*(P4+x*(P5+x*(P6+x*(P7+x*(P8+x*(P9+x*(Pa+x*(Pb+x*(Pc+x*(Pd+x*(Pe+x*(Pf+x*(Q0+x*(Q1+x*(Q2+x*(Q3+x*(Q4+x*(Q5+x*(Q6+x*(Q7+x*(Q8+x*(Q9+x*(Qa+x*(Qb+x*(Qc+x*(Qd+x*(Qe+x*(Qf))))))))))))))))))))))))))))))); XX cpuid(0); XX } XX break; XX case 32: XX for (i = 0; i < NITER; i++) { XX x = a; XX r = ((((x*Qf+Qe)*(x*x)+(x*Qd+Qc))*((x*x)*(x*x)) + XX ((x*Qb+Qa)*(x*x)+(x*Q9+Q8)))* XX (((x*x)*(x*x))*((x*x)*(x*x))) + XX (((x*Q7+Q6)*(x*x)+(x*Q5+Q4))*((x*x)*(x*x)) + XX ((x*Q3+Q2)*(x*x)+(x*Q1+Q0))))* XX ((((x*x)*(x*x))*((x*x)*(x*x)))* XX (((x*x)*(x*x))*((x*x)*(x*x)))) + XX ((((x*Pf+Pe)*(x*x)+(x*Pd+Pc))*((x*x)*(x*x)) + XX ((x*Pb+Pa)*(x*x)+(x*P9+P8)))* XX (((x*x)*(x*x))*((x*x)*(x*x))) + XX (((x*P7+P6)*(x*x)+(x*P5+P4))*((x*x)*(x*x)) + XX ((x*P3+P2)*(x*x)+(x*P1+P0)))); XX cpuid(0); XX } XX break; XX case 33: XX for (i = 0; i < NITER; i++) { XX x = a; XX double Tfe = x*Qf+Qe; XX double tfe = x*Pf+Pd; XX double T76 = x*Q7+Q6; XX double t76 = x*P7+P6; XX double x2 = x*x; XX r = ((((Tfe )*x2+(x*Qd+Qc))*(x2*x2) + XX ((x*Qb+Qa)*x2+(x*Q9+Q8)))*((x2*x2)*(x2*x2)) + XX (((T76 )*x2+(x*Q5+Q4))*(x2*x2) + XX ((x*Q3+Q2)*x2+(x*Q1+Q0))))*(((x2*x2)*(x2*x2))*((x2*x2)*(x2*x2))) + XX ((((tfe )*x2+(x*Pd+Pc))*(x2*x2) + XX ((x*Pb+Pa)*x2+(x*P9+P8)))*((x2*x2)*(x2*x2)) + XX (((t76 )*x2+(x*P5+P4))*(x2*x2) + XX ((x*P3+P2)*x2+(x*P1+P0)))); XX cpuid(0); XX } XX break; XX } XX return (0); XX } It only tests power-of-2 number of terms. With say 6 or 7 terms, it is harder to run much faster than with 8 terms, but easy to try all other resonable groupings. With 8-15 terms, it is harder to try enough other groupings. It tests using temporary variables to try to get the best order. This makes a little difference with gcc-3.3, less than that with gcc-4.2 and no difference with clang. Looking at the generated code can be instructive. The compiler sometimes finds a good order that you might not think of. Also, you can see where it is constrained by dependencies. Modern x86 (SSE-mumble or AVX-mumble) have the basic Estrin operation of B0 + x * B1 as a single operation (v*fma* in AVX). Basic Estrin would reduce to Horner if this were used. But the C standard doesn't really allow using it, and compilers don't use it on x86. gcc did use it on ia64. It also takes unportable CFLAGS to get high SSE/AVX support, and gives unportable binaries when used. Don't use the libm fma since that tends to be 10-100 times slower than not using it (perhaps only 10 times slower if it claims to be fast). This libm fma is only useful for algorithm development and cases where efficiency doesn't matter. Fma is about half an ulp more accurate than separate operations. Bruce From owner-freebsd-numerics@freebsd.org Tue Apr 25 06:36:45 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 37F36D4F513; Tue, 25 Apr 2017 06:36:45 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 2199F1D18; Tue, 25 Apr 2017 06:36:45 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3P6ah72046094 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 24 Apr 2017 23:36:43 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3P6ahPV046093; Mon, 24 Apr 2017 23:36:43 -0700 (PDT) (envelope-from sgk) Date: Mon, 24 Apr 2017 23:36:43 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170425063643.GA45906@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170413171248.GA86780@troutmask.apl.washington.edu> <20170424232743.GA54621@troutmask.apl.washington.edu> <20170425143120.R933@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170425143120.R933@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Apr 2017 06:36:45 -0000 On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote: > On Mon, 24 Apr 2017, Steve Kargl wrote: > > > I have what appears to be working versions of asinpi[fl]. It > > was suggested elsewhere that using an Estrin's method to > > sum the polynomial approximations instead of Horner's method > > may allow modern CPUs to better schedule generated code. > > I have implemented an Estrin's-like method for sinpi[l] > > and cospi[l], and indeed the generated code is faster on > > my Intel core2 duo with only a slight degradation in the > > observed max ULP. I'll post new versions to bugzilla in > > the near future. > > I didn't remember Estrin, but checked Knuth and see that his > method is a subset of what I use routinely. I might have learned > it from Knuth. I tried all methods in Knuth, but they didn't seem > to be any better at first. Later I learned more about scheduling. > Estrin's method now seems routine as a subset of scheduling. I think I came across this in Knuth as well, but I may have picked of the name Estrin from Mueller's book (or google :). > > You already used it in expl: for ld80: > I recall that that is your handy work. > X x2 = x * x; > X x4 = x2 * x2; > > Unlike in Knuth's description of Estrin, we don't try to order the > operations to suit the CPU(s) in the code, but just try to keep them > independent, as required for them to execute independently. Compilers > will rearrange them anyway, sometimes even correctly, but it makes > little difference (on OOE CPUs) to throw all of the operations into > the pipeline and see how fast something comes out). > > X q = x4 * (x2 * (x4 * > X /* > X * XXX the number of terms is no longer good for > X * pairwise grouping of all except B3, and the > X * grouping is no longer from highest down. > X */ > X (x2 * B12 + (x * B11 + B10)) + > X (x2 * (x * B9 + B8) + (x * B7 + B6))) + > X (x * B5 + B4.e)) + x2 * x * B3.e; > > This is arranged to look a bit like Horner's method at first (operations > not written in separate statements), but it is basically Estrin's method > with no extensions to more complicated sub-expressions than Estrin's, > but complications for accuracy: > - I think accuracy for the B3 term is unimportant, but for some reason > (perhaps just the one in the comment) it is grouped separately. Estrin > might use x3 * (B3.e + x * B4.e). Yes, it is most likely an accuracy issue. I found something similar with sinpi/cospi. My k_sinpi.c has #if 0 double x2; /* Horner's method. */ x2 = x * x; sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3) + s2) + s1; #else double x2, x4; /* Sort of Estrin's method. */ x2 = x * x; x4 = x2 * x2; sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4 + (s2 + s3 * x2 + s4 * x4) * x2 + s1; #endif The addition of s1 must occur last, or I get greater 1 ULP for a number of values. My timing show on Intel core2 duo /* Horner's methods. */ laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22 sinpi time: 0.066 usec per call cospi time: 0.063 usec per call /* Estrin's methods. */ laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22 sinpi time: 0.059 usec per call cospi time: 0.055 usec per call Here, -n 22 means to test 2**22 - 1 values in the interval [0,0.25]. For the older functions that were committed a few years ago, I should probably go back to recheck timings and accuracy with and without Estrin's method. -- Steve 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Tue Apr 25 08:28:08 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A9E72D4EB9C; Tue, 25 Apr 2017 08:28:08 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 4040C1C43; Tue, 25 Apr 2017 08:28:07 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id B19C0104877B; Tue, 25 Apr 2017 18:27:57 +1000 (AEST) Date: Tue, 25 Apr 2017 18:27:52 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170425063643.GA45906@troutmask.apl.washington.edu> Message-ID: <20170425173542.S1401@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170413171248.GA86780@troutmask.apl.washington.edu> <20170424232743.GA54621@troutmask.apl.washington.edu> <20170425143120.R933@besplex.bde.org> <20170425063643.GA45906@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=pMMk53J5QUvvM_BoBYoA:9 a=JWkNqyrBzBJkAUbv:21 a=d3hanlBOIU_nJgXb:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Apr 2017 08:28:08 -0000 On Mon, 24 Apr 2017, Steve Kargl wrote: > On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote: >> [for expl] >> This is arranged to look a bit like Horner's method at first (operations >> not written in separate statements), but it is basically Estrin's method >> with no extensions to more complicated sub-expressions than Estrin's, >> but complications for accuracy: >> - I think accuracy for the B3 term is unimportant, but for some reason >> (perhaps just the one in the comment) it is grouped separately. Estrin >> might use x3 * (B3.e + x * B4.e). > > Yes, it is most likely an accuracy issue. I found something > similar with sinpi/cospi. My k_sinpi.c has > > #if 0 > double x2; > /* Horner's method. */ > x2 = x * x; > sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3) > + s2) + s1; Horner's of course has about as many dependencies as operations (13), so on modern x86 this is limited by the latency of 3-5 cycles/instructions or 39-65 cycles. Multiplications take an extra cycle in more cases, so it is better to have more additions, but unfortunately Horner gives 1 more multuplication. With no dependencies, the limit would be throughput -- 2 instructions/cycle of 6.5 cycles rounded up to 7; then add latency for the last operation to get about 11. With practical dependencies, it is hard to get more than a 2-fold speedup. > #else > double x2, x4; > /* Sort of Estrin's method. */ > x2 = x * x; > x4 = x2 * x2; > sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4 > + (s2 + s3 * x2 + s4 * x4) * x2 + s1; > #endif This looks OK. The only obvious improvement is to try grouping x4*x4 so that it can be evaluated separately. This reduces the length of dependency chains by 1, but increases early pressure on multiplication. Don't write this as x8 = x4*x4. Compilers will do the equivalent. You can also replace x4 by (x2*x2) but there are many x4's so this might be less readable. > The addition of s1 must occur last, or I get greater 1 ULP > for a number of values. Indeed. This prevents using the best order, which is the opposite. You used the best order for the other inner terms, but not so obviously for the first outer addition. However, the compiler is free to reorder that: consider: T2 + T1 + T0 T2 is more complicated and everything depends on it, so should be started first, but possibly T1 can be started first because it is simpler. Here it is not significantly simpler. However, for: T3 + T2 + T1 + T0 the natural left-to-right grouping is usually pessimial. This should be written as: T3 + (T2 + T1) + T0 Unfortunately, we can't write this naturally and usually optimally as: T0 + T1 + T2 + T3 due to the accuracy problem with T0. Maybe right this as: T1 + T2 + T3 + T0 // only T0 in unnatural order For expl, I tried all orders. > My timing show on Intel core2 duo > > /* Horner's methods. */ > laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22 > sinpi time: 0.066 usec per call > cospi time: 0.063 usec per call > > /* Estrin's methods. */ > laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22 > sinpi time: 0.059 usec per call > cospi time: 0.055 usec per call > > Here, -n 22 means to test 2**22 - 1 values in the interval > [0,0.25]. The improvement is marginal. Only worth doing if it doesn't complicate the code much. Older CPUs can't exploit te parallelism so well, so I would expect a relatively larger improvement. Don't measure in usec, since they are hard to scale. 0.66 usec and 1.86 GHz is more than 100 cycles. This is quite slow. I get similar times for cos and sin on an older Athlon-XP (~40 in float prec, ~160 in double and ~270 in long double). Newer CPUs are about 4 times faster in double prec (only 2 times fast in float prec) in cycles despite having similar nominal timing, by unclogging their pipes. We should also try to use the expression A*x + B more often so that things with be faster when this is done by fma, but this happens naturally anyway. Bruce From owner-freebsd-numerics@freebsd.org Wed Apr 26 03:23:58 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4FE8CD50FC3; Wed, 26 Apr 2017 03:23:58 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 2B09C9C; Wed, 26 Apr 2017 03:23:58 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3Q3No8O087663 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 25 Apr 2017 20:23:50 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3Q3NoEL087662; Tue, 25 Apr 2017 20:23:50 -0700 (PDT) (envelope-from sgk) Date: Tue, 25 Apr 2017 20:23:50 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170426032350.GA87524@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170413171248.GA86780@troutmask.apl.washington.edu> <20170424232743.GA54621@troutmask.apl.washington.edu> <20170425143120.R933@besplex.bde.org> <20170425063643.GA45906@troutmask.apl.washington.edu> <20170425173542.S1401@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170425173542.S1401@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Apr 2017 03:23:58 -0000 On Tue, Apr 25, 2017 at 06:27:52PM +1000, Bruce Evans wrote: > On Mon, 24 Apr 2017, Steve Kargl wrote: > > > On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote: > > >> [for expl] > >> This is arranged to look a bit like Horner's method at first (operations > >> not written in separate statements), but it is basically Estrin's method > >> with no extensions to more complicated sub-expressions than Estrin's, > >> but complications for accuracy: > >> - I think accuracy for the B3 term is unimportant, but for some reason > >> (perhaps just the one in the comment) it is grouped separately. Estrin > >> might use x3 * (B3.e + x * B4.e). > > > > Yes, it is most likely an accuracy issue. I found something > > similar with sinpi/cospi. My k_sinpi.c has > > > > #if 0 > > double x2; > > /* Horner's method. */ > > x2 = x * x; > > sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3) > > + s2) + s1; > > Horner's of course has about as many dependencies as operations (13), so > on modern x86 this is limited by the latency of 3-5 cycles/instructions > or 39-65 cycles. Multiplications take an extra cycle in more cases, so > it is better to have more additions, but unfortunately Horner gives 1 > more multuplication. > > With no dependencies, the limit would be throughput -- 2 instructions/cycle > of 6.5 cycles rounded up to 7; then add latency for the last operation to > get about 11. > > With practical dependencies, it is hard to get more than a 2-fold speedup. > > > #else > > double x2, x4; > > /* Sort of Estrin's method. */ > > x2 = x * x; > > x4 = x2 * x2; > > sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4 > > + (s2 + s3 * x2 + s4 * x4) * x2 + s1; > > #endif > > This looks OK. The only obvious improvement is to try grouping x4*x4 > so that it can be evaluated separately. This reduces the length of > dependency chains by 1, but increases early pressure on multiplication. > Don't write this as x8 = x4*x4. Compilers will do the equivalent. > You can also replace x4 by (x2*x2) but there are many x4's so this > might be less readable. The grouping of (x4*x4) and (x2 * x4) in other polynomials indeed give small improvements. I've reworked the approximations in sinpi[l], cospi[l], and tanpi[l]. I did not see any improves with the float versions as the order of the polynomials are fairly small (order of 4). > > The addition of s1 must occur last, or I get greater 1 ULP > > for a number of values. > > Indeed. This prevents using the best order, which is the opposite. > You used the best order for the other inner terms, but not so obviously > for the first outer addition. However, the compiler is free to reorder > that: consider: > > T2 + T1 + T0 > > T2 is more complicated and everything depends on it, so should be started > first, but possibly T1 can be started first because it is simpler. Here > it is not significantly simpler. However, for: > > T3 + T2 + T1 + T0 > > the natural left-to-right grouping is usually pessimial. This should be > written as: > > T3 + (T2 + T1) + T0 > > Unfortunately, we can't write this naturally and usually optimally as: > > T0 + T1 + T2 + T3 > > due to the accuracy problem with T0. Maybe right this as: > > T1 + T2 + T3 + T0 // only T0 in unnatural order > > For expl, I tried all orders. I tried a few different orders; particularly with the long double functions. I also managed to break the code a few times. Thankfully, I have everything in a subversion repository. > > > My timing show on Intel core2 duo > > > > /* Horner's methods. */ > > laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22 > > sinpi time: 0.066 usec per call > > cospi time: 0.063 usec per call > > > > /* Estrin's methods. */ > > laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22 > > sinpi time: 0.059 usec per call > > cospi time: 0.055 usec per call > > > > Here, -n 22 means to test 2**22 - 1 values in the interval > > [0,0.25]. > > The improvement is marginal. Only worth doing if it doesn't complicate > the code much. Older CPUs can't exploit te parallelism so well, so I > would expect a relatively larger improvement. Don't measure in usec, > since they are hard to scale. 0.66 usec and 1.86 GHz is more than 100 > cycles. This is quite slow. I get similar times for cos and sin on an > older Athlon-XP (~40 in float prec, ~160 in double and ~270 in long > double). Newer CPUs are about 4 times faster in double prec (only 2 > times fast in float prec) in cycles despite having similar nominal > timing, by unclogging their pipes. I'm assuming your 0.66 in the above is a typo as I wrote 0.066. I also got some improvement by changing the splitting of a few entities into high and low parts by a simple cast. Instead of doing EXTRACT_WORDS(hx, lx, x); INSERT_WORDS(xhi, hx, 0); xlo = x - xhi; I do xhi = (float)x; xlo = x - xhi; but this seems to only help in some situation. My new timing for x in [0,0.25] on my 1995.05 MHz Core2 duo gives Horner sinpi time: 0.073 usec per call (~146 cycles) cospi time: 0.070 usec per call (~140) tanpi time: 0.112 usec per call (~224) Estrin sinpi time: 0.064 usec per call (~128) cospi time: 0.060 usec per call (~120) tanpi time: 0.088 usec per call (~176) tanpi benefits the most as it has an order 16 polynomial, which allows some freedom in group terms. I don't think I can do much better. For example, the approximation for sinpi(x) is sinpi(x) = x * (s0 + x2 * S(x2)) where S(x2) is the polynomial and I need to split x2 into high and low parts and s0 is split. The kernel with the Estrin method and with the polynomial coefficients is static inline double __kernel_sinpi(double x) { double hi, lo, sm, xhi, xlo; uint32_t hx, lx; double x2, x4; /* Sort of Estrin's method. */ x2 = x * x; x4 = x2 * x2; sm = (s5 + s6 * x2 + s7 * x4) * (x4 * x4) + (s2 + s3 * x2 + s4 * x4) * x2 + s1; EXTRACT_WORDS(hx, lx, x); INSERT_WORDS(xhi, hx, 0); xlo = x - xhi; lo = xlo * (x + xhi) * sm; hi = xhi * xhi * sm; lo = x * lo + xlo * (s0hi + s0lo) + x * hi; return (lo + xhi * s0lo + xhi * s0hi); } The order of terms in the last 5 lines matters. Testing yields a = 2.4220192246482908e-01, /* 0x3fcf0078, 0xfc01e3f0 */ sinpi(a) = 6.8957335930938313e-01, /* 0x3fe610fc, 0x264da72e */ sin(pi*a) = 6.8957335930938324e-01, /* 0x3fe610fc, 0x264da72f */ ULP: 0.79950446 MAX ULP: 0.79950446 Total tested: 8388606 1.0 < ULP : 0 0.9 < ULP <= 1.0: 0 0.8 < ULP <= 0.9: 0 0.7 < ULP <= 0.8: 554 0.6 < ULP <= 0.7: 11275 > We should also try to use the expression A*x + B more often so that > things with be faster when this is done by fma, but this happens > naturally anyway. Using A*x+B or B+A*x did not seem to matter. -- Steve 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Thu Apr 27 23:14:17 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B33AED53837; Thu, 27 Apr 2017 23:14:17 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 9AE9F6D4; Thu, 27 Apr 2017 23:14:17 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3RNEBps011407 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 27 Apr 2017 16:14:11 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3RNEBRS011406; Thu, 27 Apr 2017 16:14:11 -0700 (PDT) (envelope-from sgk) Date: Thu, 27 Apr 2017 16:14:11 -0700 From: Steve Kargl To: freebsd-numerics@freebsd.org Cc: freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170427231411.GA11346@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170409220809.GA25076@troutmask.apl.washington.edu> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Apr 2017 23:14:17 -0000 On Sun, Apr 09, 2017 at 03:08:09PM -0700, Steve Kargl wrote: > Both IEEE-754 2008 and ISO/IEC TS 18661-4 define the half-cycle > trignometric functions cospi, sinpi, and tanpi. The attached > patch implements cospi[fl], sinpi[fl], and tanpi[fl]. Limited > testing on the cospi and sinpi reveal a max ULP less than 0.89; > while tanpi is more problematic with a max ULP less than 2.01 > in the interval [0,0.5]. The algorithms used in these functions > are documented in {ks}_cospi.c, {ks}_sinpi.c, and s_tanpi.c. > > Note 1. ISO/IEC TS 18661-4 says these funstions are guarded by > a predefine macro. I have no idea or interest in what clang and > gcc do with regards to this macro. I've put the functions behind > __BSD_VISIBLE. > > Note 2. I no longer have access to a system with ld128 and > adequate support to compile and test the ld128 implementations > of these functions. Given the almost complete lack of input from > others on improvements to libm, I doubt that anyone cares. If > someone does care, the ld128 files contain a number of FIXME comments, > and in particular, while the polynomial coefficients are given > I did not update the polynomial algorithms to properly use the > coefficients. > > The code is attached the bug reportr. > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218514 > I have attached a new diff to the bugzilla report. The diff is 3090 lines and won't be broadcast the mailing list. This diff includes fixes for a few inconsequential bugs and implements modified Estrin's method for sum a few ploynomials. If you want the previous Horner's method then add -DHORNER to your CFLAGS. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Fri Apr 28 01:01:23 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AF430D4DF87; Fri, 28 Apr 2017 01:01:23 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 9A01B274; Fri, 28 Apr 2017 01:01:23 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3S11M2W012863 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 27 Apr 2017 18:01:22 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3S11MRW012862; Thu, 27 Apr 2017 18:01:22 -0700 (PDT) (envelope-from sgk) Date: Thu, 27 Apr 2017 18:01:22 -0700 From: Steve Kargl To: freebsd-numerics@freebsd.org Cc: freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170428010122.GA12814@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170427231411.GA11346@troutmask.apl.washington.edu> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 01:01:23 -0000 On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote: > > I have attached a new diff to the bugzilla report. The > diff is 3090 lines and won't be broadcast the mailing list. > > This diff includes fixes for a few inconsequential bugs > and implements modified Estrin's method for sum a few > ploynomials. If you want the previous Horner's method > then add -DHORNER to your CFLAGS. > For those curious about testing, here are some numbers for the Estrin's method. These were obtained on an AMD FX(tm)-8350 @ 4018.34-MHz. The times are in microseconds and the number in parentheses are estimated cycles. | cospi | sinpi | tanpi ------------+--------------+--------------+-------------- float | 0.0089 (37) | 0.0130 (54) | 0.0194 (80) double | 0.0134 (55) | 0.0160 (66) | 0.0249 (102) long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393) ------------+--------------+--------------+-------------- The timing tests are done on the interval [0,0.25] as this is the interval where the polynomial approximations apply. Limited accuracy testing gives | cospif | cospi | cospil -----------------+------------+------------+------------- MAX ULP | 0.88242114 | 0.89339466 | 0.88799449 Total tested | 16777214 | 16777214 | 16777214 0.8 < ULP <= 0.9 | 1613 | 1682 | 1601 0.7 < ULP <= 0.8 | 33925 | 33817 | 33843 0.6 < ULP <= 0.7 | 171799 | 172775 | 171899 -----------------+------------+------------+------------- | sinpif | sinpi | sinpil -----------------+------------+------------+------------- MAX ULP | 0.80103672 | 0.79851085 | 0.77049407 Total tested | 16777214 | 16777214 | 16777214 0.8 < ULP <= 0.9 | 1 | 0 | 0 0.7 < ULP <= 0.8 | 727 | 1075 | 546 0.6 < ULP <= 0.7 | 19268 | 22722 | 19019 -----------------+------------+------------+------------- x in [0,0.25] | tanpif | tanpi | tanpil -----------------+------------+------------+------------- MAX ULP | 1.37954760 | 1.37300168 | 1.38800823 Total tested | 16777214 | 16777214 | 16777214 1.0 < ULP | 39109 | 38712 | 38798 0.9 < ULP <= 1.0 | 130373 | 131086 | 130803 0.8 < ULP <= 0.9 | 342624 | 341350 | 341155 0.7 < ULP <= 0.8 | 687244 | 687122 | 686814 0.6 < ULP <= 0.7 | 1078964 | 1079881 | 1078555 -----------------+------------+------------+------------- In the interval [0.25,0.5] tanpi[fl] is computed by cospi / sinpi. The numbers look like x in [0.25,0.5] | tanpif | tanpi | tanpil -----------------+------------+------------+------------- MAX ULP | 1.93529165 | 2.04485533 | 1.95823689 Total tested | 16777215 | 16777215 | 16777215 1.0 < ULP | 699310 | 701244 | 698946 0.9 < ULP <= 1.0 | 442978 | 443344 | 443599 0.8 < ULP <= 0.9 | 642991 | 642561 | 642601 0.7 < ULP <= 0.8 | 889357 | 890307 | 889233 0.6 < ULP <= 0.7 | 1176473 | 1177407 | 1177121 -----------------+------------+------------+------------- -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Fri Apr 28 04:18:51 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 542DDD53230; Fri, 28 Apr 2017 04:18:51 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 36C457B5; Fri, 28 Apr 2017 04:18:51 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3S4In0M013553 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 27 Apr 2017 21:18:49 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3S4Intr013552; Thu, 27 Apr 2017 21:18:49 -0700 (PDT) (envelope-from sgk) Date: Thu, 27 Apr 2017 21:18:49 -0700 From: Steve Kargl To: freebsd-numerics@freebsd.org Cc: freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170428041849.GA13544@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170427231411.GA11346@troutmask.apl.washington.edu> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 04:18:51 -0000 On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote: > > > > The code is attached the bug reportr. > > > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218514 > > > > I have attached a new diff to the bugzilla report. The > diff is 3090 lines and won't be broadcast the mailing list. > > This diff includes fixes for a few inconsequential bugs > and implements modified Estrin's method for sum a few > ploynomials. If you want the previous Horner's method > then add -DHORNER to your CFLAGS. > Grrrr. Find a sloppy theshold can be fun. Index: src/s_cospif.c =================================================================== --- src/s_cospif.c (revision 1916) +++ src/s_cospif.c (working copy) @@ -61,7 +61,7 @@ SET_FLOAT_WORD(ax, ix); if (ix < 0x3f800000) { /* |x| < 1 */ - if (ix < 0x39000000) { /* |x| < 0x1p-13 */ + if (ix < 0x38800000) { /* |x| < 0x1p-14 */ if (huge + ax > 0) /* Raise inexact iff != 0. */ return (1); } -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Fri Apr 28 08:00:26 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C6440D51F54 for ; Fri, 28 Apr 2017 08:00:26 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 928321047 for ; Fri, 28 Apr 2017 08:00:26 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 7A00D428223; Fri, 28 Apr 2017 18:00:16 +1000 (AEST) Date: Fri, 28 Apr 2017 18:00:19 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170427231411.GA11346@troutmask.apl.washington.edu> Message-ID: <20170428175851.A1386@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=VbSHBBh9 c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=kwgiUtXkrd9uoIphUegA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 08:00:26 -0000 [cc list trimmed] On Thu, 27 Apr 2017, Steve Kargl wrote: > I have attached a new diff to the bugzilla report. The > diff is 3090 lines and won't be broadcast the mailing list. But no one will see it there. Bruce From owner-freebsd-numerics@freebsd.org Fri Apr 28 08:55:18 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 76190D5279F; Fri, 28 Apr 2017 08:55:18 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id 15EAFEA0; Fri, 28 Apr 2017 08:55:17 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id ED08AD4A8E5; Fri, 28 Apr 2017 18:33:57 +1000 (AEST) Date: Fri, 28 Apr 2017 18:33:54 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170428041849.GA13544@troutmask.apl.washington.edu> Message-ID: <20170428180046.P1386@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428041849.GA13544@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=6I5d2MoRAAAA:8 a=uopPRt1wpuTJ01loTfUA:9 a=44QarpAT_w9gG2nY:21 a=cCMfiCuOYRyRPczO:21 a=CjuIK1q_8ugA:10 a=IjZwj45LgO3ly-622nXo:22 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 08:55:18 -0000 On Thu, 27 Apr 2017, Steve Kargl wrote: > On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote: >>> >>> The code is attached the bug reportr. >>> >>> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218514 >> >> I have attached a new diff to the bugzilla report. The >> diff is 3090 lines and won't be broadcast the mailing list. > ... > Grrrr. Find a sloppy theshold can be fun. > > Index: src/s_cospif.c > =================================================================== > --- src/s_cospif.c (revision 1916) > +++ src/s_cospif.c (working copy) > @@ -61,7 +61,7 @@ > SET_FLOAT_WORD(ax, ix); > > if (ix < 0x3f800000) { /* |x| < 1 */ > - if (ix < 0x39000000) { /* |x| < 0x1p-13 */ > + if (ix < 0x38800000) { /* |x| < 0x1p-14 */ > if (huge + ax > 0) /* Raise inexact iff != 0. */ > return (1); > } This looks too different from s_cosf.c: XX if(ix <= 0x3f490fda) { /* |x| ~<= pi/4 */ XX if(ix<0x39800000) /* |x| < 2**-12 */ XX if(((int)x)==0) return 1.0; /* 1 with inexact if x != 0 */ XX return __kernel_cosdf(x); XX } The (int)x == 0 method is perhaps not best (especially not the excessive parentheses used in it), but it was verified to be faster than other methods. This might be a matter of tricking the compiler into not optimizing the unusual case. (__predict ugliness doesn't work any better.) Non-formatting style bugs include: - arguably worse spelling of powers of 2 in comments - less information in the comment about inexact. I try to change all similar comments to be consistent whenever I change their code. s_cos.c still has a variant with less detail "/* generate inexact". s_sinl.c is still broken by your not copying the fdlibm code: XX /* If x = +-0 or x is a subnormal number, then sin(x) = x */ XX if (z.bits.exp == 0) XX return (x); This is missing not only the comment about setting inexact, but also the code. This is also poorly optimized. XX /* Optimize the case where x is already within range. */ XX if (z.e < M_PI_4) { XX hi = __kernel_sinl(z.e, 0, 0); XX RETURNI(s ? -hi : hi); XX } This is is considerably more broken. It underflows for small x. Evaluation of polynomials always depends on not doing it for tiny x, since evaluation of higher terms always underflows for tiny x even when x is not denormal (except when the poly is linear, the T1*x term doesn't underflow if either T1 is 0 or T1 is larger than about 1 and x is not denormal). Filtering out tiny x is just an optimization, but prevents this underflow. It sometimes also gives free tests for 0 and denormals (I don't see how to fold this test into the others), and free optimizations for tiny x. It does do this for sin(x), provided we don't support non-standard rounding modes. The correct value is just x with inexact if x != 0. s_cosl.c has the same bug. I sent you preliminary patches in 2008 when the long double trig functions were committed, but left fixing these bugs as an exercise. Actually, I thought I fixed this in cosl() and only left sinl() as an exercise. XX diff -u2 s_cosl.c~ s_cosl.c XX --- s_cosl.c~ Thu May 30 18:17:21 2013 XX +++ s_cosl.c Tue Sep 6 03:43:57 2016 XX @@ -55,21 +55,17 @@ XX long double y[2]; XX long double hi, lo; XX + int ex; XX XX z.e = x; XX - z.bits.sign = 0; XX - XX - /* If x = +-0 or x is a subnormal number, then cos(x) = 1 */ XX - if (z.bits.exp == 0) XX - return (1.0); XX - XX - /* If x = NaN or Inf, then cos(x) = NaN. */ XX - if (z.bits.exp == 32767) XX - return ((x - x) / (x - x)); XX + ex = z.bits.exp; Mostly style and minor efficiency fixes. XX XX ENTERI(); XX XX - /* Optimize the case where x is already within range. */ XX - if (z.e < M_PI_4) XX - RETURNI(__kernel_cosl(z.e, 0)); XX + if (ex < 0x3ffe || (ex == 0x3ffe && z.bits.manh < 0xc90fdaa2)) XX + RETURNI(__kernel_cosl(x, 0)); Efficiency fix for the comparison. But no fix for underflow, etc. XX + XX + /* If x = NaN or Inf, then cos(x) = NaN. */ XX + if (ex == 32767) XX + RETURNI(x - x); XX XX e0 = __ieee754_rem_pio2l(x, y); ISTR breaking s_sinf.c and/or s_cosf.c with the invalid optimization of just evaluating the polynomial, to avoid the branch for the micro-optimization. But it is not juste for an optimization. Bruce From owner-freebsd-numerics@freebsd.org Fri Apr 28 09:39:45 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 75D16D53FF8; Fri, 28 Apr 2017 09:39:45 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 0CEC0ECC; Fri, 28 Apr 2017 09:39:44 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 5551B1048F3C; Fri, 28 Apr 2017 19:13:17 +1000 (AEST) Date: Fri, 28 Apr 2017 19:13:16 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170428010122.GA12814@troutmask.apl.washington.edu> Message-ID: <20170428183733.V1497@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=HgMA7pdRT-D6GSaveAAA:9 a=bWHRhd8LnqQueheO:21 a=6RyRNACK7q-QfLbh:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 09:39:45 -0000 On Thu, 27 Apr 2017, Steve Kargl wrote: > On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote: >> >> I have attached a new diff to the bugzilla report. The >> diff is 3090 lines and won't be broadcast the mailing list. >> >> This diff includes fixes for a few inconsequential bugs >> and implements modified Estrin's method for sum a few >> ploynomials. If you want the previous Horner's method >> then add -DHORNER to your CFLAGS. > > For those curious about testing, here are some numbers > for the Estrin's method. These were obtained on an AMD > FX(tm)-8350 @ 4018.34-MHz. The times are in microseconds > and the number in parentheses are estimated cycles. > > | cospi | sinpi | tanpi > ------------+--------------+--------------+-------------- > float | 0.0089 (37) | 0.0130 (54) | 0.0194 (80) > double | 0.0134 (55) | 0.0160 (66) | 0.0249 (102) > long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393) > ------------+--------------+--------------+-------------- > > The timing tests are done on the interval [0,0.25] as > this is the interval where the polynomial approximations > apply. Limited accuracy testing gives These still seem slow. I did a quick test of naive evaluations of these functions as standard_function(Pi * x) and get times a bit faster on Haswell, except 2-4 times faster for the long double case, with the handicaps of using gcc-3.3.3 and i386. Of course, the naive evaluation is inaccurate, especially near multiples of Pi/2. > x in [0,0.25] | tanpif | tanpi | tanpil > -----------------+------------+------------+------------- > MAX ULP | 1.37954760 | 1.37300168 | 1.38800823 Just use the naive evaluation to get similar errors in this range. It is probably faster too. For tiny x, both reduce to the approximation Pi*x, with an error like this expected unless the evaluation is done in extra precision. > In the interval [0.25,0.5] tanpi[fl] is computed by > cospi / sinpi. The numbers look like > > x in [0.25,0.5] | tanpif | tanpi | tanpil > -----------------+------------+------------+------------- > MAX ULP | 1.93529165 | 2.04485533 | 1.95823689 The errors build up only linearly in the number of operations, which is good. Note that on i386 with its extended precision, in float precision the naive method is accurate to nearly 0.5 ulps provided you use extended precision for Pi, the multiplication, and also the function, so sinpif() is only worth having if it can do this almost as fast as sinf() (about 15 cycles throughput and less than 100 latency (50?) on modern x86). The extra precision is used automatically by sinf() (by using a double hack. Double is not very different from float+extended on i386). I think accuracy is enough up to extend float precision up to a useful multiple of Pi (suppose double precision and not full extended, so only 53 bits for Pi, so 29 extra; lose 24 to cancelations and 5 are left, so the accuracy is enough up to about 2**5*Pi). Bruce From owner-freebsd-numerics@freebsd.org Fri Apr 28 16:56:59 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A4CE3D549E2; Fri, 28 Apr 2017 16:56:59 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 859451613; Fri, 28 Apr 2017 16:56:59 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3SGuwdl017872 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 28 Apr 2017 09:56:58 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3SGuwdV017871; Fri, 28 Apr 2017 09:56:58 -0700 (PDT) (envelope-from sgk) Date: Fri, 28 Apr 2017 09:56:58 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170428165658.GA17560@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170428183733.V1497@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 16:56:59 -0000 On Fri, Apr 28, 2017 at 07:13:16PM +1000, Bruce Evans wrote: > On Thu, 27 Apr 2017, Steve Kargl wrote: > > > On Thu, Apr 27, 2017 at 04:14:11PM -0700, Steve Kargl wrote: > >> > >> I have attached a new diff to the bugzilla report. The > >> diff is 3090 lines and won't be broadcast the mailing list. > >> > >> This diff includes fixes for a few inconsequential bugs > >> and implements modified Estrin's method for sum a few > >> ploynomials. If you want the previous Horner's method > >> then add -DHORNER to your CFLAGS. > > > > For those curious about testing, here are some numbers > > for the Estrin's method. These were obtained on an AMD > > FX(tm)-8350 @ 4018.34-MHz. The times are in microseconds > > and the number in parentheses are estimated cycles. > > > > | cospi | sinpi | tanpi > > ------------+--------------+--------------+-------------- > > float | 0.0089 (37) | 0.0130 (54) | 0.0194 (80) > > double | 0.0134 (55) | 0.0160 (66) | 0.0249 (102) > > long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393) > > ------------+--------------+--------------+-------------- > > > > The timing tests are done on the interval [0,0.25] as > > this is the interval where the polynomial approximations > > apply. Limited accuracy testing gives > > These still seem slow. I did a quick test of naive evaluations of > these functions as standard_function(Pi * x) and get times a bit faster > on Haswell, except 2-4 times faster for the long double case, with the > handicaps of using gcc-3.3.3 and i386. Of course, the naive evaluation > is inaccurate, especially near multiples of Pi/2. The slowness comes from handling extra precision arithmetic. For sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial approximation and x2 = x * x. One needs to take care in the ordering on the evaluation where x and x2 are split into high and low parts. See the source code posted in response to your other email. > > x in [0,0.25] | tanpif | tanpi | tanpil > > -----------------+------------+------------+------------- > > MAX ULP | 1.37954760 | 1.37300168 | 1.38800823 > > Just use the naive evaluation to get similar errors in this > range. It is probably faster too. I haven't checked tanpif, but naive evaluation is faster for sinpif(x). But getting the worry answer fast seems to be counter to what one may expect from a math library. Consider the two naive implementations: float bde1(float x) { return (sinf((float)M_PI * x)); } float bde2(float x) { return (sinf((float)(M_PI * x))); } float bde3(float x) { return ((float)sin(M_PI * x)); } x in [0,0.25] | sinpif | bde1(x) | bde2(x) | bde3(x) -----------------+-------------+-------------+--------------+------------ Time usec (cyc) | 0.0130 (54) | 0.0084 (35) | 0.0093 (38) | 0.0109 (45) MAX ULP | 0.80103672 | 1.94017851 | 1.46830785 | 0.5 1.0 < ULP | 0 | 736496 | 47607 | 0 0.9 < ULP <= 1.0 | 0 | 563122 | 101162 | 0 0.8 < ULP <= 0.9 | 1 | 751605 | 386128 | 0 0.7 < ULP <= 0.8 | 727 | 942349 | 753647 | 0 0.6 < ULP <= 0.7 | 19268 | 1169719 | 1123518 | 0 x in [3990,3992] | sinpif | bde1(x) | bde2(x) | bde3(x) -----------------+-------------+-------------+-------------+------------ Time usec (cyc) | 0.0161 (66) | 0.0137 (56) | 0.0144 (59) | 0.0211 (87) MAX ULP | 0.65193975 | 10458803. | 6471461. | 0.5 1.0 < ULP | 0 | 16773117 | 16762878 | 0 0.9 < ULP <= 1.0 | 0 | 0 | 0 | 0 0.8 < ULP <= 0.9 | 0 | 0 | 0 | 0 0.7 < ULP <= 0.8 | 0 | 0 | 0 | 0 0.6 < ULP <= 0.7 | 19268 | 2047 | 2047 | 0 So bde3(x) is best, but if you're going to use sin(M_PI*x) for float sinpif, then use sinpi((double)x) which is faster than bde3 and just as accurate. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Fri Apr 28 17:05:29 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1EAE8D54C27 for ; Fri, 28 Apr 2017 17:05:29 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id E60081A9F for ; Fri, 28 Apr 2017 17:05:28 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3SH5S9J017958 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 28 Apr 2017 10:05:28 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3SH5SDB017957; Fri, 28 Apr 2017 10:05:28 -0700 (PDT) (envelope-from sgk) Date: Fri, 28 Apr 2017 10:05:28 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170428170528.GD17560@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428175851.A1386@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170428175851.A1386@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 17:05:29 -0000 On Fri, Apr 28, 2017 at 06:00:19PM +1000, Bruce Evans wrote: > > On Thu, 27 Apr 2017, Steve Kargl wrote: > > > I have attached a new diff to the bugzilla report. The > > diff is 3090 lines and won't be broadcast the mailing list. > > But no one will see it there. > Here's the diff in its full glory (without the sloppy threshold fix). There no signature at the end. Index: lib/msun/Makefile =================================================================== --- lib/msun/Makefile (revision 317529) +++ lib/msun/Makefile (working copy) @@ -59,6 +59,7 @@ s_asinh.c s_asinhf.c s_atan.c s_atanf.c s_carg.c s_cargf.c s_cargl.c \ s_cbrt.c s_cbrtf.c s_ceil.c s_ceilf.c \ s_copysign.c s_copysignf.c s_cos.c s_cosf.c \ + s_cospi.c s_cospif.c \ s_csqrt.c s_csqrtf.c s_erf.c s_erff.c \ s_exp2.c s_exp2f.c s_expm1.c s_expm1f.c s_fabsf.c s_fdim.c \ s_finite.c s_finitef.c \ @@ -74,7 +75,10 @@ s_rint.c s_rintf.c s_round.c s_roundf.c \ s_scalbln.c s_scalbn.c s_scalbnf.c s_signbit.c \ s_signgam.c s_significand.c s_significandf.c s_sin.c s_sinf.c \ - s_tan.c s_tanf.c s_tanh.c s_tanhf.c s_tgammaf.c s_trunc.c s_truncf.c \ + s_sinpi.c s_sinpif.c \ + s_tan.c s_tanf.c s_tanh.c s_tanhf.c \ + s_tanpi.c s_tanpif.c \ + s_tgammaf.c s_trunc.c s_truncf.c \ w_cabs.c w_cabsf.c w_drem.c w_dremf.c # Location of fpmath.h and _fpmath.h @@ -100,11 +104,16 @@ e_lgammal.c e_lgammal_r.c \ e_remainderl.c e_sinhl.c e_sqrtl.c \ invtrig.c k_cosl.c k_sinl.c k_tanl.c \ - s_asinhl.c s_atanl.c s_cbrtl.c s_ceill.c s_cosl.c s_cprojl.c \ + s_asinhl.c s_atanl.c s_cbrtl.c s_ceill.c s_cosl.c \ + s_cospil.c \ + s_cprojl.c \ s_csqrtl.c s_erfl.c s_exp2l.c s_expl.c s_floorl.c s_fmal.c \ s_fmaxl.c s_fminl.c s_frexpl.c s_logbl.c s_logl.c s_nanl.c \ s_nextafterl.c s_nexttoward.c s_remquol.c s_rintl.c s_roundl.c \ - s_scalbnl.c s_sinl.c s_tanhl.c s_tanl.c s_truncl.c w_cabsl.c + s_scalbnl.c s_sinl.c \ + s_sinpil.c \ + s_tanhl.c s_tanl.c \ + s_tanpil.c s_truncl.c w_cabsl.c .endif # C99 complex functions @@ -131,13 +140,19 @@ MAN= acos.3 acosh.3 asin.3 asinh.3 atan.3 atan2.3 atanh.3 \ ceil.3 cacos.3 ccos.3 ccosh.3 cexp.3 \ - cimag.3 copysign.3 cos.3 cosh.3 csqrt.3 erf.3 exp.3 fabs.3 fdim.3 \ + cimag.3 copysign.3 cos.3 cosh.3 \ + cospi.3 \ + csqrt.3 erf.3 exp.3 fabs.3 fdim.3 \ feclearexcept.3 feenableexcept.3 fegetenv.3 \ fegetround.3 fenv.3 floor.3 \ fma.3 fmax.3 fmod.3 hypot.3 ieee.3 ieee_test.3 ilogb.3 j0.3 \ lgamma.3 log.3 lrint.3 lround.3 math.3 nan.3 \ nextafter.3 remainder.3 rint.3 \ - round.3 scalbn.3 signbit.3 sin.3 sinh.3 sqrt.3 tan.3 tanh.3 trunc.3 \ + round.3 scalbn.3 signbit.3 sin.3 sinh.3 \ + sinpi.3 \ + sqrt.3 tan.3 tanh.3 \ + tanpi.3 \ + trunc.3 \ complex.3 MLINKS+=acos.3 acosf.3 acos.3 acosl.3 @@ -166,6 +181,7 @@ MLINKS+=copysign.3 copysignf.3 copysign.3 copysignl.3 MLINKS+=cos.3 cosf.3 cos.3 cosl.3 MLINKS+=cosh.3 coshf.3 cosh.3 coshl.3 +MLINKS+=cospi.3 cospif.3 cospi.3 cospil.3 MLINKS+=csqrt.3 csqrtf.3 csqrt.3 csqrtl.3 MLINKS+=erf.3 erfc.3 erf.3 erff.3 erf.3 erfcf.3 erf.3 erfl.3 erf.3 erfcl.3 MLINKS+=exp.3 expm1.3 exp.3 expm1f.3 exp.3 expm1l.3 exp.3 pow.3 exp.3 powf.3 \ @@ -216,10 +232,12 @@ MLINKS+=scalbn.3 scalbnf.3 scalbn.3 scalbnl.3 MLINKS+=sin.3 sinf.3 sin.3 sinl.3 MLINKS+=sinh.3 sinhf.3 sinh.3 sinhl.3 +MLINKS+=sinpi.3 sinpif.3 sinpi.3 sinpil.3 MLINKS+=sqrt.3 cbrt.3 sqrt.3 cbrtf.3 sqrt.3 cbrtl.3 sqrt.3 sqrtf.3 \ sqrt.3 sqrtl.3 MLINKS+=tan.3 tanf.3 tan.3 tanl.3 MLINKS+=tanh.3 tanhf.3 tanh.3 tanhl.3 +MLINKS+=tanpi.3 tanpif.3 tanpi.3 tanpil.3 MLINKS+=trunc.3 truncf.3 trunc.3 truncl.3 .include Index: lib/msun/Symbol.map =================================================================== --- lib/msun/Symbol.map (revision 317529) +++ lib/msun/Symbol.map (working copy) @@ -294,4 +294,13 @@ casinl; catanl; catanhl; + cospi; + cospif; + cospil; + sinpi; + sinpif; + sinpil; + tanpi; + tanpif; + tanpil; }; Index: lib/msun/ld128/k_cospil.c =================================================================== --- lib/msun/ld128/k_cospil.c (nonexistent) +++ lib/msun/ld128/k_cospil.c (working copy) @@ -0,0 +1,73 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/k_cospi.c for implementation details. + * + * FIXME: This has not been compiled nor has it been tested for accuracy. + * FIXME: This should use bit twiddling. + */ + +static const long double +c0 = 1, +c1hi = -4.93480220054467930941724549993807541e+00L, +c1lo = -1.61159244006703024615160135944400084e-34L, +c02 = 4.05871212641676821818501386202937963e+00L, +c03 = -1.33526276885458949587530478285057448e+00L, +c04 = 2.35330630358893204541879352772399571e-01L, +c05 = -2.58068913900140600125982936922503890e-02L, +c06 = 1.92957430940392304790328456072058283e-03L, +c07 = -1.04638104924845707113757250204498056e-04L, +c08 = 4.30306958703294680806457740580669010e-06L, +c09 = -1.38789524622131344829253970500107607e-07L, +c10 = 3.60473079732244660718578413933676910e-09L, +c11 = -7.70070692297072534641720608808474790e-11L, +c12 = 1.37684487309360112303144478033041762e-12L, +c13 = -2.07957304470476734518573046614539573e-14L; + +static inline long double +__kernel_cospi(long double x) +{ + union IEEEl2bits u; + long double c, chi, clo, x2, x2lo; + double x2hi; + + x2 = x * x; + + c = c09 + (c10 + (c11 + (c12 + c13 * x2) * x2) * x2) * x2; + c = c05 + (c06 + (c07 + (c08 + c * x2) * x2) * x2) * x2; + c = (c02 + (c03 + (c04 + c * x2) * x2) * x2) * x2 + c1hi; + + if (x2 < 0x1p-6) /* |x| < 0x1p-6 */ + return (c0 + c * x2); + + x2hi = x2; + x2lo = x2 - x2hi; + + c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0); + + return (c); +} Property changes on: lib/msun/ld128/k_cospil.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/ld128/k_sinpil.c =================================================================== --- lib/msun/ld128/k_sinpil.c (nonexistent) +++ lib/msun/ld128/k_sinpil.c (working copy) @@ -0,0 +1,73 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/k_sinpi.c for implementation details. + * + * FIXME: This has not been compiled nor has it been tested for accuracy. + * FIXME: This should use bit twiddling. + * + * Compute sinpi(x) = sin(pi*x) in the interval [0,0.25] + */ + +static const long double +s0lo = 3.14159265358979322702026593105983920e+00L, +s0md = 1.14423774522196636802274144823600323e-17L, +s0lo = -5.10459839933399352423791604618557491e-52L, +s01 = -5.16771278004997002924605251118356578e+00L, +s02 = 2.55016403987734544385617758369521912e+00L, +s03 = -5.99264529320792076887739383518724498e-01L, +s04 = 8.21458866111282287988023605285125252e-02L, +s05 = -7.37043094571435077725854643888422319e-03L, +s06 = 4.66302805767612564382677987452304496e-04L, +s07 = -2.19153534478302140522509092972929589e-05L, +s08 = 7.95205400147494466063630770553231865e-07L, +s09 = -2.29484289960219849488354905723109270e-08L, +s10 = 5.39266447760508581521426149354268607e-10L, +s11 = -1.05182947998557665557194979309998767e-11L, +s12 = 1.72036427603492093074818577722479034e-13L; + +static inline long double +__kernel_sinpi(long double x) +{ + union IEEEl2bits u; + long double hi, lo, sm, x2, xlo; + double xhi; + + /* Split x into high and low parts. */ + xhi = x; + xlo = x - xhi; + + x2 = x * x; + sm = x2 * (x2 * (x2 * (x2 * s12 + s11) + s10) + s09) + s08; + sm = x2 * (x2 * (x2 * (x2 * sm + s07) + s06) + s05) + s04; + sm = x2 * (x2 * (x2 * sm + s03) + s02) + s01; + lo = xlo * (x + xhi) * sm; + hi = sm * xhi * xhi; + lo = x * (lo + s0lo) + xlo * (s0hi + s0md) + x * hi; + + return (lo + xhi * s0md + xhi * s0hi); +} Property changes on: lib/msun/ld128/k_sinpil.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/ld128/s_cospil.c =================================================================== --- lib/msun/ld128/s_cospil.c (nonexistent) +++ lib/msun/ld128/s_cospil.c (working copy) @@ -0,0 +1,98 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/s_cospi.c for implementation details. + * + * FIXME: This has not been compiled nor has it been tested for accuracy. + * FIXME: This should use bit twiddling. + */ + +#include "math.h" +#include "math_private.h" +#include "k_cospi.c" +#include "k_sinpi.c" + +static inline long double +__compute_cospil(long double x) +{ + + if (x < 0.5) { + if (x < 0.25) + return (__kernel_cospil(x)); + return (__kernel_sinpil(0.5 - x)); + } else { + if (x < 0.75) + return (-__kernel_sinpil(x - 0.5)); + return (-__kernel_cospil(1 - x)); + } +} + +long double +cospil(long double x) +{ + const double huge = 1e+300; + long double ax, xf; + uint32_t ix; + + ax = fabsl(x); + + if (ax < 1) { /* |x| < 1 */ + if (ax < 0x1p-60) { /* |x| < 0x1p-60 */ + if (huge + x > 0) + return (1); + } + if (ax == 0.5) + return (0); + return (__compute_cospil(ax)); + } + + if (ax < 0x1p112) { /* 1 <= |x| < 0x1p112 */ + + xf = floorl(ax); + + ax -= x; + if (ax == 0.5) + return (0); + + ax = ax == 0 ? 1 : __compute_cospil(ax); + + if (xf > 0x1p50) xf -= 0x1p50; + if (xf > 0x1p30) xf -= 0x1p30; + ix = (uint32_t)xf; + + return (ix & 1 ? -ax : ax); + } + + if (isinf(x) || isnan(x)) + return (x - x); + + /* + * |x| >= 0x1p112 is always an even integer, so return 1. + * FIXME: should this raise FE_INEXACT or FE_INVALID. + */ + return (1); +} Property changes on: lib/msun/ld128/s_cospil.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/ld128/s_sinpil.c =================================================================== --- lib/msun/ld128/s_sinpil.c (nonexistent) +++ lib/msun/ld128/s_sinpil.c (working copy) @@ -0,0 +1,100 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/s_sinpi.c for implementation details. + * + * FIXME: This has not been compiled nor has it been tested for accuracy. + * FIXME: This should use bit twiddling. + */ + +#include "math.h" +#include "math_private.h" +#include "k_cospil.c" +#include "k_sinpil.c" + +static const long double pihi = 3.14159265358979322702026593105983920e+00L; +static const long double pilo = 1.14423774522196636802434264184180742e-17L; + +static inline long double +__compute_sinpil(long double x) +{ + + if (x < 0.5) { + if (x <= 0.25) + return (__kernel_sinpil(x)); + return (__kernel_cospil(0.5 - x)); + } else { + if (x < 0.75) + return (__kernel_cospil(x - 0.5)); + return (__kernel_sinpil(1 - x)); + } +} + +long double +sinpi(long double x) +{ + long double ax, xf; + uint32_t ix; + + ax = fabsl(x); + + if (ax < 1) { /* |x| < 1 */ + if (ax < 0x1p-60) { /* |x| < 0x1p-60 */ + if (x == 0) + return (x); + return (pilo * x + pihi * x); + } + ax = __compute_sinpil(ax); + return (copysignl(ax, x)); + } + + if (ax < 0x1p112) { /* 1 <= |x| < 0x1p112 */ + + xf = floorl(ax); + + ax -= xf; + if (ax == 0) { + ax = 0; + } else { + ax = __compute_sinpil(ax); + if (xf > 0x1p50) xf -= 0x1p50; + if (xf > 0x1p30) xf -= 0x1p30; + ix = (uint32_t)xf; + if (ix & 1) ax = -ax; + } + return (copysignl(ax, x)); + } + + if (isinf(x) || isnan(x)) + return (x - x); + + /* + * |x| >= 0x1p112 is always an integer, so return +-0. + * FIXME: should this raise FE_INEXACT or FE_INVALID. + */ + return (copysignl(0, x)); +} Property changes on: lib/msun/ld128/s_sinpil.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/ld128/s_tanpil.c =================================================================== --- lib/msun/ld128/s_tanpil.c (nonexistent) +++ lib/msun/ld128/s_tanpil.c (working copy) @@ -0,0 +1,98 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/s_tanpi.c for implementation details. + * + * FIXME: This has not been compiled nor has it been tested for accuracy. + * FIXME: This should use bit twiddling. + * FIXME: This should use a polynomial approximation in [0,0.25], but the + * FIXME: increase in max ULP is probably small, so who cares. + */ + +#include "math.h" +#include "math_private.h" + +static const long double pihi = 3.14159265358979322702026593105983920e+00L; +static const long double pilo = 1.14423774522196636802434264184180742e-17L; + +static inline long double +__k_tanpi(long double x) +{ + + return (sinpil(x) / cospil(x)); +} + +static inline long double +__compute_tanpi(long double x) +{ + + return (x < 0.5 ? __k_tanpil(x) : -__k_tanpil(1 - x)); +} + +long double +tanpil(long double x) +{ + long double ax, xf; + uint32_t ix; + + ax = fabsl(ax); + + if (ax < 1) { /* |x| < 1 */ + if (ax < 0x1p-60) { /* |x| < 0x1p-60 */ + if (x == 0) + return (x); + return (pilo * x + pihi * x); + } + if (ax == 0.5) + return ((x - x) / (x - x)); + ax = __compute_tanpil(ax); + return (copysignl(ax, x)); + } + + if (ix < 0x1p112) { /* 1 <= |x| < 0x1p112 */ + + xf = floorl(ax); + + ax -= xf; + + if (ax == 0.5) + return ((x - x) / (x - x)); + + ax = ax == 0 ? 0 : __compute_tanpil(ax); + return (copysignl(ax, x)); + } + + /* x = +-inf or nan. */ + if (isinf(x) || isnan(x)) + return (x - x); + + /* + * |x| >= 0x1p53 is always an integer, so return +-0. + * FIXME: should this raise FE_INEXACT or FE_INVALID. + */ + return (copysignl(0, x)); +} Property changes on: lib/msun/ld128/s_tanpil.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/ld80/k_cospil.c =================================================================== --- lib/msun/ld80/k_cospil.c (nonexistent) +++ lib/msun/ld80/k_cospil.c (working copy) @@ -0,0 +1,78 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/k_cospi.c for implementation details. + */ + +static const union IEEEl2bits +c1hiu = LD80C(0x9de9e64df22ef2d2, 2, -4.93480220054467930927e+00L), +c1lou = LD80C(0xadc4baf0dd791c3e, -63, -1.47187625629472133798e-19L), +c2u = LD80C(0x81e0f840dad61d9b, 2, 4.05871212641676821836e+00L); +#define c1hi (c1hiu.e) +#define c1lo (c1lou.e) +#define c2 (c2u.e) + +static const double +c3 = -1.3352627688545895e+00, /* 0xbff55d3c, 0x7e3cbffa */ +c4 = 2.3533063035889312e-01, /* 0x3fce1f50, 0x6891bab8 */ +c5 = -2.5806891390008115e-02, /* 0xbf9a6d1f, 0x2a2043da */ +c6 = 1.9295743091289794e-03, /* 0x3f5f9d38, 0xa362e3d1 */ +c7 = -1.0463809745687055e-04, /* 0xbf1b6e24, 0xd372e145 */ +c8 = 4.3029513401606985e-06, /* 0x3ed20c42, 0x4210907c */ +c9 = -1.3777927680211844e-07; /* 0xbe827e0f, 0x55d52bbb */ + +static const double c0 = 1; + +static inline long double +__kernel_cospil(long double x) +{ + long double c, chi, clo, x2hi, x2lo; + +#if HORNER + /* Horner's method. */ + long double x2; + x2 = x * x; + c = (c2 + (c3 + (c4 + (c5 + (c6 + (c7 + (c8 + c9 * x2) * x2) + * x2) * x2) * x2) * x2) * x2) * x2 + c1hi; +#else + /* Sort of Estrin's method. */ + long double x2, x4; + x2 = x * x; + x4 = x2 * x2; + c = ((c9 * x2 + c8) * x4 + (c7 * x2 + c6)) * x2 * x4 * x4 + + ((c5 * x2 + c4) * x4 + (c3 * x2 + c2)) * x2 + c1hi; +#endif + if (x2 < 0x1p-6) + return (c0 + c * x2); + x2hi =(float)x2; + x2lo = x2 - x2hi; + chi = (float)c; + clo = c - chi + c1lo; + c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0); + + return (c); +} Property changes on: lib/msun/ld80/k_cospil.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/ld80/k_sinpil.c =================================================================== --- lib/msun/ld80/k_sinpil.c (nonexistent) +++ lib/msun/ld80/k_sinpil.c (working copy) @@ -0,0 +1,78 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/k_sinpi.c for implementation details. + */ + +static const union IEEEl2bits +s0hiu = LD80C(0xc90fdaa200000000, 1, 3.14159265346825122833e+00L), +s0lou = LD80C(0x85a308d313167b73, -33, 1.21542010130121322891e-10L), +s1u = LD80C(0xa55de7312df295f5, 2, -5.16771278004997002909e+00L), +s2u = LD80C(0xa335e33bad570e85, 1, 2.55016403987734544098e+00L); + +#define s0hi (s0hiu.e) +#define s0lo (s0lou.e) +#define s1 (s1u.e) +#define s2 (s2u.e) + +static const double +s3 = -5.9926452932079166e-01, /* 0xbfe32d2c 0xce62bd82 */ +s4 = 8.2145886611090388e-02, /* 0x3fb50783 0x487edcdb */ +s5 = -7.3704309439650085e-03, /* 0xbf7e3074 0xfdc9c0c9 */ +s6 = 4.6630275825043123e-04, /* 0x3f3e8f43 0x18c2a7a0 */ +s7 = -2.1914601021339888e-05, /* 0xbef6faa7 0xea41adee */ +s8 = 7.8877623841310396e-07; /* 0x3eaa7789 0x4aacabad */ + +static inline long double +__kernel_sinpil(long double x) +{ + long double hi, lo, sm, xhi, xlo; + uint64_t lx; + uint16_t ix; + +#if HORNER + /* Horner's method. */ + long double x2; + x2 = x * x; + sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s8 + s7) + s6) + + s5) + s4) + s3) + s2) + s1; +#else + /* Sort of Estrin's method. */ + long double a, b, x2, x4; + x2 = x * x; + x4 = x2 * x2; + sm = ((x2 * s8 + s7) * x4 + (x2 * s6 + s5)) * (x4 * x4) + + (x2 * s4 + s3) * x4 + x2 * s2 + s1; +#endif + xhi = (float)x; + xlo = x - xhi; + lo = xlo * (x + xhi) * sm; + hi = xhi * xhi * sm; + lo = x * lo + xlo * (s0hi + s0lo) + x * hi; + + return (lo + xhi * s0lo + xhi * s0hi); +} Property changes on: lib/msun/ld80/k_sinpil.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/ld80/s_cospil.c =================================================================== --- lib/msun/ld80/s_cospil.c (nonexistent) +++ lib/msun/ld80/s_cospil.c (working copy) @@ -0,0 +1,118 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/s_cospi.c for implementation details. + */ + +#ifdef __i386__ +#include +#endif +#include + +#include "fpmath.h" +#include "math.h" +#include "math_private.h" +#include "k_cospil.c" +#include "k_sinpil.c" + +static inline long double +__compute_cospil(long double x) +{ + + if (x < 0.5) { + if (x < 0.25) + return (__kernel_cospil(x)); + return (__kernel_sinpil(0.5 - x)); + } else { + if (x < 0.75) + return (-__kernel_sinpil(x - 0.5)); + return (-__kernel_cospil(1 - x)); + } +} + +long double +cospil(long double x) +{ + volatile static const double vzero = 0; + const double huge = 1e+300; + long double ax; + uint32_t j0; + uint64_t lx; + uint16_t hx, ix; + + EXTRACT_LDBL80_WORDS(hx, lx, x); + ix = hx & 0x7fff; + INSERT_LDBL80_WORDS(ax, ix, lx); + + ENTERI(); + + if (ix < 0x3fff) { /* |x| < 1 */ + if (ix < 0x3fdd) { /* |x| < 0x1p-34 */ + if (huge + x > 0) + RETURNI(1); + } + if (ix == 0x3ffe && lx == 0x8000000000000000ull) + RETURNI(0); + RETURNI(__compute_cospil(ax)); + } + + if (ix < 0x403e) { /* 1 <= |x| < 0x1p63 */ + /* Determine integer part of ax. */ + j0 = ix - 0x3fff + 1; + if (j0 < 32) { + lx = (lx >> 32) << 32; + lx &= ~(((lx << 32)-1) >> j0); + } else { + uint64_t m = (uint64_t)-1 >> (j0 + 1); + if (lx & m) lx &= ~m; + } + INSERT_LDBL80_WORDS(x, ix, lx); + + ax -= x; + if (ax == 0.5) + RETURNI(0); + + ax = ax == 0 ? 1 : __compute_cospil(ax); + + if (j0 > 40) x -= 0x1p40; + if (j0 > 30) x -= 0x1p30; + j0 = (uint32_t)x; + + return (j0 & 1 ? -ax : ax); + } + + if (ix >= 0x7fff) + RETURNI(vzero / vzero); + + /* + * |x| >= 0x1p63 is always an even integer, so return 1. + * FIXME: should this raise FE_INEXACT or FE_INVALID. + */ + if (ax + 1 > 1) + ax = 1; + return (ax); +} Property changes on: lib/msun/ld80/s_cospil.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/ld80/s_sinpil.c =================================================================== --- lib/msun/ld80/s_sinpil.c (nonexistent) +++ lib/msun/ld80/s_sinpil.c (working copy) @@ -0,0 +1,123 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/s_sinpi.c for implementation details. + */ + +#ifdef __i386__ +#include +#endif +#include + +#include "fpmath.h" +#include "math.h" +#include "math_private.h" +#include "k_cospil.c" +#include "k_sinpil.c" + +static inline long double +__compute_sinpil(long double x) +{ + + if (x <= 0.5) { + if (x <= 0.25) + return (__kernel_sinpil(x)); + return (__kernel_cospil(0.5 - x)); + } else { + if (x <= 0.75) + return (__kernel_cospil(x - 0.5)); + return (__kernel_sinpil(1 - x)); + } +} + +static const double pihi = 3.1415926814079285e+00; /* 0x400921fb 0x58000000 */ +static const double pilo =-2.7818135228334233e-08; /* 0xbe5dde97 0x3dcb3b3a */ + +long double +sinpil(long double x) +{ + volatile static const double vzero = 0; + long double ax; + uint32_t j0; + uint64_t lx; + uint16_t hx, ix; + + EXTRACT_LDBL80_WORDS(hx, lx, x); + ix = hx & 0x7fff; + INSERT_LDBL80_WORDS(ax, ix, lx); + + ENTERI(); + + if (ix < 0x3fff) { /* |x| < 1 */ + if (ix < 0x3fdc) { /* |x| < 0x1p-35 */ + if (x == 0) + RETURNI(x); + INSERT_LDBL80_WORDS(ax, hx, (lx >> 32) << 32); + x -= ax; + RETURNI(pilo * x + pihi * x + pilo * ax + pihi * ax); + } + ax = __compute_sinpil(ax); + RETURNI((hx & 0x8000) ? -ax : ax); + } + + if (ix < 0x403e) { /* 1 <= |x| < 0x1p63 */ + /* Determine integer part of ax. */ + j0 = ix - 0x3fff + 1; + if (j0 < 32) { + lx = (lx >> 32) << 32; + lx &= ~(((lx << 32)-1) >> j0); + } else { + uint64_t m = (uint64_t)-1 >> (j0 + 1); + if (lx & m) lx &= ~m; + } + INSERT_LDBL80_WORDS(x, ix, lx); + + ax -= x; + if (ax == 0) { + ax = 0; + } else { + ax = __compute_sinpil(ax); + if (j0 > 40) x -= 0x1p40; + if (j0 > 30) x -= 0x1p30; + j0 = (uint32_t)x; + if (j0 & 1) ax = -ax; + } + RETURNI((hx & 0x8000) ? -ax : ax); + } + + /* x = +-inf or nan. */ + if (ix >= 0x7fff) + RETURNI(vzero / vzero); + + /* + * |x| >= 0x1p63 is always an integer, so return +-0. + * FIXME: should this raise FE_INEXACT or FE_INVALID. + */ + if (ax + 1 > 1) + ax = copysignl(0, x); + RETURNI(ax); +} Property changes on: lib/msun/ld80/s_sinpil.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/ld80/s_tanpil.c =================================================================== --- lib/msun/ld80/s_tanpil.c (nonexistent) +++ lib/msun/ld80/s_tanpil.c (working copy) @@ -0,0 +1,195 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/s_tanpi.c for implementation details. + */ + +#ifdef __i386__ +#include +#endif +#include + +#include "fpmath.h" +#include "math.h" +#include "math_private.h" +#include "k_cospil.c" +#include "k_sinpil.c" + +static const double pihi = 3.1415926814079285e+00; /* 0x400921fb 0x58000000 */ +static const double pilo =-2.7818135228334233e-08; /* 0xbe5dde97 0x3dcb3b3a */ + +static const union IEEEl2bits +t00hiu = LD80C(0xc90fdaa22168c235, 1, 3.14159265358979323851e+00L), +t00lou = LD80C(0xecdfa9e9520ede79, -65, -5.01599541779735313611e-20L), +t01hiu = LD80C(0xa55de7312df295f5, 3, 1.03354255600999400582e+01L), +t01lou = LD80C(0x8d51cc1ca8412de2, -62, 2.39404580332694307957e-19L), +t02hiu = LD80C(0xa335e33bad570ec7, 5, 4.08026246380375272847e+01L), +t02lou = LD80C(0xe72db1e59703a7ba, -60, -1.56652642070969805771e-18L), +t03u = LD80C(0xa2fffcda474ac1d3, 7, 1.62999951975255277944e+02L), +t04u = LD80C(0xa2fa3971d7988829, 9, 6.51909756145994949750e+02L), +t05u = LD80C(0xa2f99792aec80f33, 11, 2.60759950512193695427e+03L), +t06u = LD80C(0xa2f985abafb74d99, 13, 1.04303805377441052764e+04L), +t07u = LD80C(0xa2f983819a4712aa, 15, 4.17215136963294523085e+04L); +#define t00 (t00hiu.e) +#define t00lo (t00lou.e) +#define t01 (t01hiu.e) +#define t01lo (t01lou.e) +#define t02 (t02hiu.e) +#define t02lo (t02lou.e) +#define t03 (t03u.e) +#define t04 (t04u.e) +#define t05 (t05u.e) +#define t06 (t06u.e) +#define t07 (t07u.e) + +static const double +t08 = 1.6688612339742415e+05, /* 0x41045f30, 0xfcb7c9e9 */ +t09 = 6.6753884815733030e+05, /* 0x41245f25, 0xb241ad77 */ +t10 = 2.6704954502371075e+06, /* 0x41445fcf, 0xb9a15e9a */ +t11 = 1.0666012381827524e+07, /* 0x41645803, 0x8c37ee5b */ +t12 = 4.3253268125143178e+07, /* 0x41849ff0, 0xa1004b11 */ +t13 = 1.5587209092892912e+08, /* 0x41a294d6, 0xb5db9c99 */ +t14 = 1.0162479548145952e+09, /* 0x41ce495b, 0x496844a8 */ +t15 = -2.9808831901354914e+09, /* 0xc1e63595, 0x5ec455f2 */ +t16 = 8.5833091951381027e+10, /* 0x4233fc0d, 0x0b6f618b */ +t17 = -6.8395913998321326e+11, /* 0xc263e7e4, 0x87d1e6d3 */ +t18 = 5.1641660627562871e+12, /* 0x4292c981, 0x228a9126 */ +t19 = -2.1258706455671141e+13, /* 0xc2b355ad, 0xa58c7724 */ +t20 = 5.2041706779955102e+13; /* 0x42c7aa73, 0xb91a998d */ + +static inline long double +__kernel_tanpil(long double x) +{ + long double t, xhi, xlo; + uint64_t lx; + uint16_t ix; + + xhi = (float)x; + xlo = x - xhi; + xlo *= (xlo + xhi + xhi); + xhi *= xhi; + + if (x < 0.25) { +#if HORNER + long double a, b, c, x2; + x2 = xlo + xhi; + a = x2 * (x2 * (x2 * (x2 * t20 + t19) + t18) + t17) + t16; + b = x2 * (x2 * (x2 * (x2 * a + t15) + t14) + t13) + t12; + c = x2 * (x2 * (x2 * (x2 * b + t11) + t10) + t09) + t08; + t = x2 * (x2 * (x2 * (x2 * c + t07) + t06) + t05); +#else + long double d, x2, x4; + double a, b, c, d2, d4; + d2 = x2 = xlo + xhi; + d4 = x4 = x2 * x2; + a = ((d2 * t20 + t19) * d4 + (d2 * t18 + t17)) * d4; + b = ((d2 * t16 + t15) * d4 + (d2 * t14 + t13)); + c = ((d2 * t12 + t11) * d4 + (d2 * t10 + t09)); + d = ((x2 * t08 + t07) * x4 + (x2 * t06 + t05)); + t = (((a * d4 + b) * (x4 * x4) + c) * (x4 * x4) + d) * x2; +#endif + t = t02lo + x2 * (x2 * (t + t04) + t03) + t02; + t = t01lo + xlo * t + xhi * t + t01; + t = t00lo + xlo * t + xhi * t + t00; + return (t * x); + } else if (x > 0.25) { + x = 0.5 - x; + t = __kernel_cospil(x) / __kernel_sinpil(x); + } else + t = 1; + return (t); +} + +static inline long double +__compute_tanpil(long double x) +{ + + return (x < 0.5 ? __kernel_tanpil(x) : -__kernel_tanpil(1 - x)); +} + +long double +tanpil(long double x) +{ + volatile static const double vzero = 0; + long double ax; + uint32_t j0; + uint64_t lx; + uint16_t hx, ix; + + EXTRACT_LDBL80_WORDS(hx, lx, x); + ix = hx & 0x7fff; + INSERT_LDBL80_WORDS(ax, ix, lx); + + ENTERI(); + + if (ix < 0x3fff) { /* |x| < 1 */ + if (ix < 0x3fdd) { /* |x| < 0x1p-34 */ + if (x == 0) + RETURNI(x); + INSERT_LDBL80_WORDS(ax, ix, (lx >> 32) << 32); + x -= ax; + RETURNI(pilo * x + pihi * x + pilo * ax + pihi * ax); + } + if (ix == 0x3ffe && lx == 0x8000000000000000ull) + RETURNI((x - x) / (x - x)); + ax = __compute_tanpil(ax); + RETURNI((hx & 0x8000) ? -ax : ax); + } + + if (ix < 0x403e) { /* 1 <= |x| < 0x1p63 */ + /* Determine integer part of ax. */ + j0 = ix - 0x3fff + 1; + if (j0 < 32) { + lx = (lx >> 32) << 32; + lx &= ~(((lx << 32)-1) >> j0); + } else { + uint64_t m = (uint64_t)-1 >> (j0 + 1); + if (lx & m) lx &= ~m; + } + INSERT_LDBL80_WORDS(x, ix, lx); + + ax -= x; + + if (ax == 0.5) + RETURNI((x - x) / (x - x)); + + ax = ax == 0 ? 0 : __compute_tanpil(ax); + RETURNI((hx & 0x8000) ? -ax : ax); + } + + /* x = +-inf or nan. */ + if (ix >= 0x7fff) + RETURNI(vzero / vzero); + + /* + * |x| >= 0x1p63 is always an integer, so return +-0. + * FIXME: should this raise FE_INEXACT or FE_INVALID. + */ + if (ax + 1 > 1) + ax = copysignl(0, x); + RETURNI(ax); +} Property changes on: lib/msun/ld80/s_tanpil.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/man/cospi.3 =================================================================== --- lib/msun/man/cospi.3 (nonexistent) +++ lib/msun/man/cospi.3 (working copy) @@ -0,0 +1,111 @@ +.\" Copyright (c) 2017 Steven G. Kargl +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" $FreeBSD$ +.\" +.Dd April 1, 2017 +.Dt COSPI 3 +.Os +.Sh NAME +.Nm cospi , +.Nm cospif , +.Nm cospil +.Nd half\(encycle cosine functions +.Sh LIBRARY +.Lb libm +.Sh SYNOPSIS +.In math.h +.Ft double +.Fn cospi "double x" +.Ft float +.Fn cospif "float x" +.Ft long double +.Fn cospil "long double x" +.Sh DESCRIPTION +The +.Fn cospi , +.Fn cospif , +and +.Fn cospil +functions compute the cosine of +.Fa "\(*p \(mu x" . +and measure angles in half-cycles. +.Sh RETURN VALUES +The +.Fn cospi , +.Fn cospif , +and +.Fn cospil +functions returns +.Fn cos "\(*p \(mu x" . +If \*(Bax\*(Ba \*(Ge 2^(p - 1) +where p is the floating\(enpoint precision of +.Ar x , +then the returned value is 1 and it has no significance. +.Sh SPECIAL VALUES +.Bl -tag +.It +.Fn cospi \*(Pm0 +returns 1. +.It +.Fn cospi \*(Pmn/2 +returns 0 for positive integers +.Ar n . +.It +.Fn cospi n +returns 1 for even integers +.Ar n . +.It +.Fn cospi n +returns \-1 for odd integers +.Ar n . +.It +.Fn cospi \*(Pm\(if +return an \*(Na and raises an FE_INVALID exception. +.It +.Fn cospi \*(Na +return an \*(Na and raises an FE_INVALID exception. +.El +.Sh SEE ALSO +.Xr cos 3 , +.Xr fenv 3 , +.Xr math 3 , +.Xr sin 3 , +.Xr sinpi 3 , +.Xr tan 3 , +.Xr tanpi 3 +.Sh AUTHORS +The half\(encycle trignometric functions were written by +.An Steven G. Kargl Aq Mt kargl@FreeBSD.org . +.Sh STANDARDS +These functions conform to +IEEE Std 754\(tm\(en2008 , +\(dqIEEE Standard for Floating-Point Arithmetic\(dq +and to +ISO/IEC TS 18661-4 , +\(dqInformation technology \(em Programming languages, their environments, +and system software interfaces \(em Floating\(enpoint extensions for +C\(dq \(em Part 4: Supplementary functions. + + Property changes on: lib/msun/man/cospi.3 ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/man/sinpi.3 =================================================================== --- lib/msun/man/sinpi.3 (nonexistent) +++ lib/msun/man/sinpi.3 (working copy) @@ -0,0 +1,102 @@ +.\" Copyright (c) 2017 Steven G. Kargl +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" $FreeBSD$ +.\" +.Dd April 1, 2017 +.Dt SINPI 3 +.Os +.Sh NAME +.Nm sinpi , +.Nm sinpif , +.Nm sinpil +.Nd half\(encycle sine functions +.Sh LIBRARY +.Lb libm +.Sh SYNOPSIS +.In math.h +.Ft double +.Fn sinpi "double x" +.Ft float +.Fn sinpif "float x" +.Ft long double +.Fn sinpil "long double x" +.Sh DESCRIPTION +The +.Fn sinpi , +.Fn sinpif , +and +.Fn sinpil +functions compute the sine of +.Fa "\(*p \(mu x" . +and measure angles in half-cycles. +.Sh RETURN VALUES +The +.Fn sinpi , +.Fn sinpif , +and +.Fn sinpil +functions returns +.Fn sin "\(*p \(mu x" . +If \*(Bax\*(Ba \*(Ge 2^(p - 1) +where p is the floating\(enpoint precision of +.Ar x , +then the returned value is \*(Pm0 and it has no significance. +.Sh SPECIAL VALUES +.Bl -tag +.It +.Fn sinpi \*(Pm0 +returns \*(Pm0. +.It +.Fn sinpi \*(Pmn +returns \*(Pm0 for positive integers +.Ar n . +.It +.Fn sinpi \*(Pm\(if +return an \*(Na and raises an FE_INVALID exception. +.It +.Fn sinpi \*(Na +return an \*(Na and raises an FE_INVALID exception. +.El +.Sh SEE ALSO +.Xr cos 3 , +.Xr cospi 3 , +.Xr fenv 3 , +.Xr math 3 , +.Xr sin 3 , +.Xr tan 3 , +.Xr tanpi 3 +.Sh AUTHORS +The half\(encycle trignometric functions were written by +.An Steven G. Kargl Aq Mt kargl@FreeBSD.org . +.Sh STANDARDS +These functions conform to +IEEE Std 754\(tm\(en2008 , +\(dqIEEE Standard for Floating-Point Arithmetic\(dq +and to +ISO/IEC TS 18661-4 , +\(dqInformation technology \(em Programming languages, their environments, +and system software interfaces \(em Floating\(enpoint extensions for +C\(dq \(em Part 4: Supplementary functions. + Property changes on: lib/msun/man/sinpi.3 ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/man/tanpi.3 =================================================================== --- lib/msun/man/tanpi.3 (nonexistent) +++ lib/msun/man/tanpi.3 (working copy) @@ -0,0 +1,106 @@ +.\" Copyright (c) 2017 Steven G. Kargl +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" $FreeBSD$ +.\" +.Dd April 1, 2017 +.Dt TANPI 3 +.Os +.Sh NAME +.Nm tanpi , +.Nm tanpif , +.Nm tanpil +.Nd half\(encycle tangent functions +.Sh LIBRARY +.Lb libm +.Sh SYNOPSIS +.In math.h +.Ft double +.Fn tanpi "double x" +.Ft float +.Fn tanpif "float x" +.Ft long double +.Fn tanpil "long double x" +.Sh DESCRIPTION +The +.Fn tanpi , +.Fn tanpif , +and +.Fn tanpil +functions compute the tangent of +.Fa "\(*p \(mu x" +and measure angles in half-cycles. +.Sh RETURN VALUES +The +.Fn tanpi , +.Fn tanpif , +and +.Fn tanpil +functions returns +.Fn tan "\(*p \(mu x" . +If \*(Bax\*(Ba \*(Ge 2^(p - 1) +where p is the floating\(enpoint precision of +.Ar x , +then the returned value is \*(Pm0 and it has no significance. +.Sh SPECIAL VALUES +.Bl -tag +.It +.Fn tanpi \*(Pm0 +returns \*(Pm0. +.It +.Fn tanpi \*(Pmn +returns \*(Pm0 for positive integers +.Ar n . +.It +.Fn tanpi \*(Pmn/2 +returns \*(Na for n > 0 and raises an FE_INVALID exception. +.It +.Fn tanpi \*(Pm\(if +return an \*(Na and raises an FE_INVALID exception. +.It +.Fn tanpi \*(Na +return an \*(Na and raises an FE_INVALID exception. +.El +.Sh SEE ALSO +.Xr cos 3 , +.Xr cospi 3 , +.Xr fenv 3 , +.Xr math 3 , +.Xr sin 3 , +.Xr sinpi 3 , +.Xr tan 3 , +.Sh AUTHORS +The half\(encycle trignometric functions were written by +.An Steven G. Kargl Aq Mt kargl@FreeBSD.org . +.Sh STANDARDS +These functions conform to +IEEE Std 754\(tm\(en2008 , +\(dqIEEE Standard for Floating-Point Arithmetic\(dq +and to +ISO/IEC TS 18661-4 , +\(dqInformation technology \(em Programming languages, their environments, +and system software interfaces \(em Floating\(enpoint extensions for +C\(dq \(em Part 4: Supplementary functions. + + Property changes on: lib/msun/man/tanpi.3 ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/src/k_cospi.c =================================================================== --- lib/msun/src/k_cospi.c (nonexistent) +++ lib/msun/src/k_cospi.c (working copy) @@ -0,0 +1,83 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * Compute cospi(x) = cos(pi*x) via a polynomial approximation. Coefficients + * were determined for the following function: + * + * f(x) = ((cos(pi*x) - 1) / (x*x)) = c1 + c2*x**2 + c3*x**4 + ... + * + * with an Remes algorithm. If |x*x| < 0x1p-6, the polynomial can be summed + * using a Horner's method. If |x*x| exceeds 0x1p-6, then to accumulate the + * result with sufficient accuracy the c1 coefficient and final + * multiplications needed to be handled specially. + * + * x2 = x * x + * C(x2) = c1hi + c2*x2 + c3*x2**2 + ... + * = c1hi + (c2 + (c3 ...) * x2) * x2 + * cospi(x) = c0 + x2 * C(x2) + */ + +static const double +c0 = 1.0000000000000000e+00, /* 0x3ff00000 0x00000000 */ +c1hi = -4.9348022005446790e+00, /* 0xc013bd3c 0xc9be45de */ +c1lo = -3.1132368291465357e-16, /* 0xbcb66ee8 0x86887579 */ +c2 = 4.0587121264167649e+00, /* 0x40103c1f 0x081b5ac0 */ +c3 = -1.3352627688538099e+00, /* 0xbff55d3c 0x7e3cb243 */ +c4 = 2.3533063028402096e-01, /* 0x3fce1f50 0x68689166 */ +c5 = -2.5806887965588832e-02, /* 0xbf9a6d1e 0xef4b8280 */ +c6 = 1.9294938875092037e-03, /* 0x3f5f9ce2 0x49428066 */ +c7 = -1.0370089699114971e-04; /* 0xbf1b2f3f 0xd835d4fe */ + +static inline double +__kernel_cospi(double x) +{ + double c, chi, clo, x2hi, x2lo; +#if HORNER + /* Horner's method. */ + double x2; + x2 = x * x; + c = (c2 + (c3 + (c4 + (c5 + (c6 + c7 * x2) * x2) * x2) * x2) + * x2) * x2 + c1hi; +#else + /* Sort of Estrin's method. */ + double x2, x4; + x2 = x * x; + x4 = x2 * x2; + c = (c5 + c6 * x2 + c7 * x4) * (x4 * x4) + + (c2 + c3 * x2 + c4 * x4) * x2 + c1hi; +#endif + if (x2 < 0x1p-6) + return (c0 + c * x2); + + x2hi = (float)x2; + x2lo = x2 - x2hi; + chi = (float)c; + clo = c - chi + c1lo; + c = clo * x2lo + (chi * x2lo + clo * x2hi) + (chi * x2hi + c0); + + return (c); +} Property changes on: lib/msun/src/k_cospi.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/src/k_cospif.c =================================================================== --- lib/msun/src/k_cospif.c (nonexistent) +++ lib/msun/src/k_cospif.c (working copy) @@ -0,0 +1,60 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/k_cospi.c for implementation details. + */ + +static const float +c0 = 1.00000000e+00f, /* 0x3fc00000 */ +c1hi = -4.93480206e+00f, /* 0xc09de9e6 0xb418a08d */ +c1lo = -1.42145112e-07f, /* 0xc09de9e6 0xb418a08d */ +c2 = 4.05871058e+00f, /* 0x4081e0f5 0xb1fc37ea */ +c3 = -1.33513784e+00f, /* 0xbfaae5cc 0xb375e792 */ +c4 = 2.32126340e-01f; /* 0x3e6db287 0x31daa25f */ + +static inline float +__kernel_cospif(float x) +{ + float c, chi, clo, x2, x2hi, x2lo; + uint32_t ix; + + x2 = x * x; + c = (c2 + (c3 + c4 * x2) * x2) * x2 + c1hi; + + GET_FLOAT_WORD(ix, x2); + if (ix < 0x3c800000) /* |x2| < 0x1p-6 */ + return (c0 + c * x2); + + SET_FLOAT_WORD(x2hi, (ix >> 14) << 14); + x2lo = x2 - x2hi; + GET_FLOAT_WORD(ix, c); + SET_FLOAT_WORD(chi, (ix >> 14) << 14); + clo = c - chi + c1lo; + c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0); + + return (c); +} Property changes on: lib/msun/src/k_cospif.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/src/k_sinpi.c =================================================================== --- lib/msun/src/k_sinpi.c (nonexistent) +++ lib/msun/src/k_sinpi.c (working copy) @@ -0,0 +1,84 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * Compute sinpi(x) = sin(pi*x) via a polynomial approximation. Coefficients + * were determined for the following function: + * + * f(x) = sin(pi*x) / x = s0 + s1*x**2 + s2*x**4 + ... + * + * with an Remes algorithm. To accumulate the result with sufficient + * accuracy the s0 coefficient and final multiplications needed to be + * handled specially. + * + * x2 = x * x + * S(x2) = s1 + s2*x2 + s3*x2**2 + ... + * = s1 + (s2 + (s3 ...) * x2) * x2 + * sinpi(x) = x * (s0 + x2 * S(x2)) + * + * The final multiplications and additions in the above expression for + * sinpi(x) are done in an extra precision arithmetic. + */ + +static const double +s0hi = 3.1415926218032837e+00, /* 0x400921fb 0x50000000 */ +s0lo = 3.1786509547050787e-08, /* 0x3e6110b4 0x611a5f14 */ +s1 = -5.1677127800499703e+00, /* 0xc014abbc 0xe625be53 */ +s2 = 2.5501640398773415e+00, /* 0x400466bc 0x6775aad9 */ +s3 = -5.9926452932029806e-01, /* 0xbfe32d2c 0xce62ac24 */ +s4 = 8.2145886580065136e-02, /* 0x3fb50783 0x485cc006 */ +s5 = -7.3704298849218506e-03, /* 0xbf7e3074 0xb502de6a */ +s6 = 4.6628273194622205e-04, /* 0x3f3e8eed 0x159b24d7 */ +s7 = -2.1717412527405332e-05; /* 0xbef6c5b9 0x3995dfbe */ + +static inline double +__kernel_sinpi(double x) +{ + double hi, lo, sm, xhi, xlo; + uint32_t hx, lx; +#if HORNER + double x2; + /* Horner's method. */ + x2 = x * x; + sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3) + + s2) + s1; +#else + double x2, x4; + /* Sort of Estrin's method. */ + x2 = x * x; + x4 = x2 * x2; + sm = (s5 + s6 * x2 + s7 * x4) * (x4 * x4) + + (s2 + s3 * x2 + s4 * x4) * x2 + s1; +#endif + EXTRACT_WORDS(hx, lx, x); + INSERT_WORDS(xhi, hx, 0); + xlo = x - xhi; + lo = xlo * (x + xhi) * sm; + hi = xhi * xhi * sm; + lo = x * lo + xlo * (s0hi + s0lo) + x * hi; + + return (lo + xhi * s0lo + xhi * s0hi); +} Property changes on: lib/msun/src/k_sinpi.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/src/k_sinpif.c =================================================================== --- lib/msun/src/k_sinpif.c (nonexistent) +++ lib/msun/src/k_sinpif.c (working copy) @@ -0,0 +1,56 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/k_sinpi.c for implementation details. + */ + +static const float +s0hi = 3.14062500e+00f, /* 0x40490000 */ +s0lo = 9.67653585e-04f, /* 0x3a7daa22 */ +s1 = -5.16771269e+00f, /* 0xc0a55de7 */ +s2 = 2.55016255e+00f, /* 0x402335dd */ +s3 = -5.99202096e-01f, /* 0xbf19654f */ +s4 = 8.10018554e-02f; /* 0x3da5e44d */ + +static inline float +__kernel_sinpif(float x) +{ + float hi, lo, sm, x2, xhi, xlo; + uint32_t ix; + + GET_FLOAT_WORD(ix, x); + SET_FLOAT_WORD(xhi, (ix >> 14) << 14); + xlo = x - xhi; + + x2 = x * x; + sm = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; + lo = xlo * (x + xhi) * sm; + hi = xhi * xhi * sm; + lo = x * lo + xlo * (s0hi + s0lo) + x * hi; + + return (lo + xhi * s0lo + xhi * s0hi); +} Property changes on: lib/msun/src/k_sinpif.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/src/math.h =================================================================== --- lib/msun/src/math.h (revision 317529) +++ lib/msun/src/math.h (working copy) @@ -500,6 +500,15 @@ #if __BSD_VISIBLE long double lgammal_r(long double, int *); +double cospi(double); +float cospif(float); +long double cospil(long double); +double sinpi(double); +float sinpif(float); +long double sinpil(long double); +double tanpi(double); +float tanpif(float); +long double tanpil(long double); #endif __END_DECLS Index: lib/msun/src/s_cospi.c =================================================================== --- lib/msun/src/s_cospi.c (nonexistent) +++ lib/msun/src/s_cospi.c (working copy) @@ -0,0 +1,147 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/** + * cospi(x) computes cos(pi*x) without multiplication by pi (almost). First, + * note that cospi(-x) = cospi(x), so the algorithm considers only |x|. The + * method used depends on the magnitude of x. + * + * 1. For small |x|, cospi(x) = 1 with FE_INEXACT raised where a sloppy + * threshold is used. The threshold is |x| < 0x1pN with N = -(P/2+M). + * P is the precision of the floating-point type and M = 3 or 4. + * + * 2. For |x| < 1, argument reduction is not required and cospi(x) is + * computed by polynomial approximations in the the interval [0,0.25]. + * See k_sinpi.c and k_cospi.c for the details for these polynomial + * approximations. + * + * 3. For 1 <= |x| < 0x1p(P-1), argument reduction is required where + * |x| = j0 + r with j0 an integer and the remainder r statisfies + * 0 <= r < 1. With the given domain, a simplified inline floor(x) + * is used. + * + * cospi(x) = cos(pi*(j0+r)) + * = cos(pi*j0) * cos(pi*r) - sin(pi*j0) * sin(pi*r) + * = cos(pi*j0) * cos(pi*r) + * = +-cospi(r) + * + * If j0 is even, then cos(pi*j0) = 1. If j0 is odd, then cos(pi*j0) = -1. + * cospi(r) is then computed via polynomial approximation. + * + * 4. For |x| >= 0x1p(P-1), |x| is integral and cospi(x) = 1. + * + * 5. Special cases: + * + * cospi(+-0) = 1. + * cospi(n.5) = 0 for n an integer. + * cospi(+-inf) = nan. Raises the "invalid" floating-point exception. + * cospi(nan) = nan. Raises the "invalid" floating-point exception. + */ + +#include "math.h" +#include "math_private.h" +#include "k_cospi.c" +#include "k_sinpi.c" + +/* + * To compute cospi(x) for 0 < x < 1, use trignometric identities to map + * cospi(x) into the [0,0.25] quadrant of sinpi(r) or cospi(r). + */ +static inline double +__compute_cospi(double x) +{ + + if (x < 0.5) { + if (x < 0.25) + return (__kernel_cospi(x)); + return (__kernel_sinpi(0.5 - x)); + } else { + if (x < 0.75) + return (-__kernel_sinpi(x - 0.5)); + return (-__kernel_cospi(1 - x)); + } +} + +double +cospi(double x) +{ + volatile static const double vzero = 0; + static const double huge = 1e+300; + double ax; + uint32_t hx, ix, j0, lx; + + EXTRACT_WORDS(hx, lx, x); + ix = hx & 0x7fffffff; + INSERT_WORDS(ax, ix, lx); + + if (ix < 0x3ff00000) { /* |x| < 1 */ + if (ix < 0x3e200000) { /* |x| < 0x1p-29 */ + if (huge + x > 0) + return (1); + } + if ((ix | lx) == 0x3fe00000) + return (0); + return (__compute_cospi(ax)); + } + + if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */ + /* Determine integer part of ax. */ + j0 = ((ix >> 20) & 0x7ff) - 0x3ff; + if (j0 < 20) { + ix &= ~(0x000fffff >> j0); + lx = 0; + } else { + lx &= ~((uint32_t)0xffffffff >> (j0 - 20)); + } + INSERT_WORDS(x, ix, lx); + + ax -= x; + if (ax == 0.5) + return (0); + + ax = ax == 0 ? 1 : __compute_cospi(ax); + + if (j0 > 30) x -= 0x1p30; + j0 = (uint32_t)x; + + return (j0 & 1 ? -ax : ax); + } + + if (ix >= 0x7f800000) + return (vzero / vzero); + + /* + * |x| >= 0x1p52 is always an even integer, so return 1. + * FIXME: should this raise FE_INEXACT or FE_INVALID? + */ + if (ax + 1 > 1) + ax = 1; + return (ax); +} + +#if LDBL_MANT_DIG == 53 +__weak_reference(cospi, cospil); +#endif Property changes on: lib/msun/src/s_cospi.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/src/s_cospif.c =================================================================== --- lib/msun/src/s_cospif.c (nonexistent) +++ lib/msun/src/s_cospif.c (working copy) @@ -0,0 +1,99 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/s_cospi.c for implementation details. + */ + +#include "math.h" +#include "math_private.h" +#include "k_cospif.c" +#include "k_sinpif.c" + +static inline float +__compute_cospif(float x) +{ + + if (x <= 0.5f) { + if (x <= 0.25f) + return (__kernel_cospif(x)); + return (__kernel_sinpif(0.5f - x)); + } else { + if (x <= 0.75f) + return (-__kernel_sinpif(x - 0.5f)); + return (-__kernel_cospif(1 - x)); + } +} + +float +cospif(float x) +{ + volatile static const float vzero = 0; + static const float huge = 1e30; + float ax; + uint32_t ix, j0; + + GET_FLOAT_WORD(ix, x); + ix = ix & 0x7fffffff; + SET_FLOAT_WORD(ax, ix); + + if (ix < 0x3f800000) { /* |x| < 1 */ + if (ix < 0x39000000) { /* |x| < 0x1p-13 */ + if (huge + ax > 0) /* Raise inexact iff != 0. */ + return (1); + } + return (__compute_cospif(ax)); + } + + if (ix < 0x4b000000) { /* 1 <= |x| < 0x1p23 */ + /* Determine integer part of ax. */ + j0 = ((ix >> 23) & 0xff) - 0x7f; + ix &= ~(0x007fffff >> j0); + SET_FLOAT_WORD(x, ix); + + ax -= x; + if (ax == 0.5f) + return (0); + + ax = ax == 0 ? 1 : __compute_cospif(ax); + + ix = (uint32_t)x; + + return (ix & 1 ? -ax : ax); + } + + /* x = +-inf or nan. */ + if (ix >= 0x7f800000) + return (vzero / vzero); + + /* + * |x| >= 0x1p23 is always an even integer, so return 1. + * FIXME: should this raise FE_INEXACT or FE_INVALID. + */ + if (ax + 1 > 1) + ax = 1; + return (ax); +} Property changes on: lib/msun/src/s_cospif.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/src/s_sinpi.c =================================================================== --- lib/msun/src/s_sinpi.c (nonexistent) +++ lib/msun/src/s_sinpi.c (working copy) @@ -0,0 +1,154 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/** + * sinpi(x) computes sin(pi*x) without multiplication by pi (almost). First, + * note that sinpi(-x) = -sinpi(x), so the algorithm considers only |x| and + * includes reflection symmetry by considering the sign of x on output. The + * method used depends on the magnitude of x. + * + * 1. For small |x|, sinpi(x) = pi * x where a sloppy threshold is used. The + * threshold is |x| < 0x1pN with N = -(P/2+M). P is the precision of the + * floating-point type and M = 3 or 4. To achieve high accuracy, pi is + * decomposed into high and low parts with the high part containing a + * number of trailing zero bits. x is also split into high and low parts. + * + * 2. For |x| < 1, argument reduction is not required and sinpi(x) is + * computed by polynomial approximations in the the interval [0,0.25]. + * See k_sinpi.c and k_cospi.c for the details for these polynomial + * approximations. + * + * 3. For 1 <= |x| < 0x1p(P-1), argument reduction is required where + * |x| = j0 + r with j0 an integer and the remainder r satisifies + * 0 <= r < 1. With the given domain, a simplified inline floor(x) + * is used. + * + * sinpi(x) = sin(pi*(j0+r)) + * = sin(pi*j0) * cos(pi*r) + cos(pi*j0) * sin(pi*r) + * = cos(pi*j0) * sin(pi*r) + * = +-sinpi(r) + * + * If j0 is even, then cos(pi*j0) = 1. If j0 is odd, then cos(pi*j0) = -1. + * sinpi(r) is then computed via polynomial approximation. + * + * 4. For |x| >= 0x1p(P-1), |x| is integral and sinpi(x) = copysign(0,x). + * + * 5. Special cases: + * + * sinpi(+-0) = +-0 + * sinpi(+-n) = +-0, for positive integers n. + * sinpi(+-inf) = nan. Raises the "invalid" floating-point exception. + * sinpi(nan) = nan. Raises the "invalid" floating-point exception. + */ + +#include "math.h" +#include "math_private.h" +#include "k_cospi.c" +#include "k_sinpi.c" + +static const double pihi = 3.1415926814079285e+00; /* 0x400921fb 0x58000000 */ +static const double pilo =-2.7818135228334233e-08; /* 0xbe5dde97 0x3dcb3b3a */ + +/* + * To compute sinpi(x) for 0 < x < 1, use trignometric identities to map + * sinpi(x) into the [0,0.25] quadrant of sinpi(r) or cospi(r). + */ +static inline double +__compute_sinpi(double x) +{ + + if (x < 0.5) { + if (x < 0.25) + return (__kernel_sinpi(x)); + return (__kernel_cospi(0.5 - x)); + } else { + if (x < 0.75) + return (__kernel_cospi(x - 0.5)); + return (__kernel_sinpi(1 - x)); + } +} + +double +sinpi(double x) +{ + volatile static const double vzero = 0; + double ax; + uint32_t hx, ix, j0, lx; + + EXTRACT_WORDS(hx, lx, x); + ix = hx & 0x7fffffff; + INSERT_WORDS(ax, ix, lx); + + if (ix < 0x3ff00000) { /* |x| < 1 */ + if (ix < 0x3e300000) { /* |x| < 0x1p-28 */ + if (x == 0) + return (x); + INSERT_WORDS(ax, hx, 0); + x -= ax; + return (pilo * x + pihi * x + pilo * ax + pihi * ax); + } + ax = __compute_sinpi(ax); + return ((hx & 0x80000000) ? -ax : ax); + } + + if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */ + /* Determine integer part of ax. */ + j0 = ((ix >> 20) & 0x7ff) - 0x3ff; + if (j0 < 20) { + ix &= ~(0x000fffff >> j0); + lx = 0; + } else { + lx &= ~((uint32_t)0xffffffff >> (j0 - 20)); + } + INSERT_WORDS(x, ix, lx); + + ax -= x; + if (ax == 0) { + ax = 0; + } else { + ax = __compute_sinpi(ax); + if (j0 > 30) x -= 0x1p30; + lx = (uint32_t)x; + if (lx & 1) ax = -ax; + } + return ((hx & 0x80000000) ? -ax : ax); + } + + if (ix >= 0x7f800000) + return (vzero / vzero); + + /* + * |x| >= 0x1p52 is always an integer, so return +-0. + * FIXME: should this raise FE_INEXACT or FE_INVALID? + */ + if (ax + 1 > 1) + ax = copysign(0, x); + return (ax); +} + +#if LDBL_MANT_DIG == 53 +__weak_reference(sinpi, sinpil); +#endif Property changes on: lib/msun/src/s_sinpi.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/src/s_sinpif.c =================================================================== --- lib/msun/src/s_sinpif.c (nonexistent) +++ lib/msun/src/s_sinpif.c (working copy) @@ -0,0 +1,105 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/s_sinpi.c for implementation details. + */ + +#include "math.h" +#include "math_private.h" +#include "k_cospif.c" +#include "k_sinpif.c" + +static inline float +__compute_sinpif(float x) +{ + + if (x <= 0.5f) { + if (x <= 0.25f) + return (__kernel_sinpif(x)); + return (__kernel_cospif(0.5f - x)); + } else { + if (x <= 0.75f) + return (__kernel_cospif(x - 0.5f)); + return (__kernel_sinpif(1 - x)); + } +} + +static const float pihi = 3.14160156e+00f; /* 0x40491000 */ +static const float pilo = -8.90890988e-06f; /* 0xb715777a */ + +float +sinpif(float x) +{ + volatile static const float vzero = 0; + float ax; + uint32_t hx, ix, j0; + + GET_FLOAT_WORD(hx, x); + ix = hx & 0x7fffffff; + SET_FLOAT_WORD(ax, ix); + + if (ix < 0x3f800000) { /* |x| < 1 */ + if (ix < 0x38000000) { /* |x| < 0x1p-15 */ + if (x == 0) + return (x); + SET_FLOAT_WORD(ax, (hx >> 13) << 13); + x -= ax; + return (pilo * x + pihi * x + pilo * ax + pihi * ax); + } + ax = __compute_sinpif(ax); + return ((hx & 0x80000000) ? -ax : ax); + } + + if (ix < 0x4b000000) { /* 1 <= |x| < 0x1p23 */ + /* Determine integer part of ax. */ + j0 = ((ix >> 23) & 0xff) - 0x7f; + ix &= ~(0x007fffff >> j0); + SET_FLOAT_WORD(x, ix); + + ax -= x; + if (ax == 0) { + ax = 0; + } else { + ix = (uint32_t)x; + ax = __compute_sinpif(ax); + if (ix & 1) ax = -ax; + } + return ((hx & 0x80000000) ? -ax : ax); + } + + /* x = +-inf or nan. */ + if (ix >= 0x7f800000) + return (vzero / vzero); + + /* + * |x| >= 0x1p23 is always an integer, so return +-0. + * FIXME: should this raise FE_INEXACT or FE_INVALID? + */ + if (ax + 1 > 1) + ax = copysignf(0, x); + return (ax); +} Property changes on: lib/msun/src/s_sinpif.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/src/s_tanpi.c =================================================================== --- lib/msun/src/s_tanpi.c (nonexistent) +++ lib/msun/src/s_tanpi.c (working copy) @@ -0,0 +1,224 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/** + * tanpi(x) computes tan(pi*x) without multiplication by pi (almost). First, + * note that tanpi(-x) = -tanpi(x), so the algorithm considers only |x| and + * includes reflection symmetry by considering the sign of x on output. The + * method used depends on the magnitude of x. + * + * 1. For small |x|, tanpi(x) = pi * x where a sloppy threshold is used. The + * threshold is |x| < 0x1pN with N = -(P/2+M). P is the precision of the + * floating-point type and M = 3 or 4. To achieve high accuracy, pi is + * decomposed into high and low parts with the high part containing a + * number of trailing zero bits. x is also split into high and low parts. + * + * 2. For |x| < 1, argument reduction is not required and tanpi(x) is + * computed by either a polynomial approximation or the definition of + * tanpi in terms of sinpi and cospi. For |x| in [0,0.25), a polynomial + * approximation is used. For |x| in (0.25, 0.5], tanpi(x) is computed + * by tanpi(x) = cospi(0.5 - x) / sinpi(0.5 - x). + * + * 3. For 1 <= |x| < 0x1p(P-1), argument reduction is required where + * |x| = j0 + r with j0 an integer and the remainder r satisfies + * 0 <= r < 1. With the given domain, a simplified inline floor(x) + * is used. + * tan(pi*j0) + tan(pi*r) + * tanpi(x) = tan(pi*(j0+r)) = ---------------------------- = tanpi(r) + * 1 - tan(pi*j0) * tan(pi*r) + * + * 4. For |x| >= 0x1p(P-1), |x| is integral and tanpi(x) = copysign(0,x). + * + * 5. Special cases: + * + * tanpi(+-0) = +-0 + * tanpi(+-n) = +-0, for positive integers n. + * tanpi(+-n+1/4) = +-1, for positive integers n. + * tanpi(+-n+1/2) = NaN, for positive integers n. + * tanpi(+-inf) = nan. Raises the "invalid" floating-point exception. + * tanpi(nan) = nan. Raises the "invalid" floating-point exception. + * + * Compute tanpi(x) = tan(pi*x) via a polynomial approximation. Coefficients + * were determined for the following function: + * + * f(x) = tan(pi*x) / x = t0 + t1*x**2 + t2*x**4 + ... + * + * with an Remes algorithm. To accumulate the result with sufficient accuracy, + * the t0, t1, and t2 coefficients and final multiplications and additions + * needed to be handled specially. + * + * x2 = x * x + * T(x2) = t3 + t4*x2 + t5*x2**2 + ... + * = t3 + (t4 + (t5 ...) * x2) * x2 + * tanpi(x) = x * (t0 + x2*(t1 + x2*(t2 + x2*T(x2)))) + * + * The final multiplications and additions in the above expression for + * tanpi(x) are done in an extra precision arithmetic. + */ + +#include "math.h" +#include "math_private.h" +#include "k_cospi.c" +#include "k_sinpi.c" + +static const double pihi = 3.1415926814079285e+00; /* 0x400921fb 0x58000000 */ +static const double pilo =-2.7818135228334233e-08; /* 0xbe5dde97 0x3dcb3b3a */ + +static const double +t00 = 3.1415926535897931e+00, /* 0x400921fb 0x54442d18 */ +t00lo = 1.2267625636978282e-16, /* 0x3ca1adf4 0x7b8ee56b */ +t01 = 1.0335425560099939e+01, /* 0x4024abbc 0xe625be52 */ +t01lo = -6.0289590484334045e-16, /* 0xbcc5b8bb 0xb4f3d850 */ +t02 = 4.0802624638040442e+01, /* 0x404466bc 0x6775ac7c */ +t02lo = -3.1181465177803440e-15, /* 0xbcec15f4 0xd355f491 */ +t03 = 1.6299995197351416e+02, /* 0x40645fff 0x9b47fa0c */ +t04 = 6.5190975669363422e+02, /* 0x40845f47 0x2e8473cf */ +t05 = 2.6075994007466215e+03, /* 0x40a45f32 0xe4a797e0 */ +t06 = 1.0430393632420044e+04, /* 0x40c45f32 0x628c115e */ +t07 = 4.1720374420664804e+04, /* 0x40e45f0b 0xfb410bc9 */ +t08 = 1.6695720713150888e+05, /* 0x41046169 0xa8349085 */ +t09 = 6.6429163117898861e+05, /* 0x412445c7 0x4329e474 */ +t10 = 2.7802499382503941e+06, /* 0x4145362c 0xf81896c3 */ +t11 = 7.9185101944778729e+06, /* 0x415e34eb 0x8c725352 */ +t12 = 9.3691291294289947e+07, /* 0x41965676 0x6d2d5a58 */ +t13 = -5.0585651903137898e+08, /* 0xc1be26c2 0x07080874 */ +t14 = 6.8772398519624548e+09, /* 0x41f99ea5 0xa2bf6637 */ +t15 = -3.3094464588263161e+10, /* 0xc21ed255 0xd1310d7a */ +t16 = 1.1673794006564838e+11; /* 0x423b2e1f 0x9a61a5fc */ + +static inline double +__kernel_tanpi(double x) +{ + double hi, lo, t, xhi, xlo; + uint32_t hx, lx; + + if (x < 0.25) { +#if HORNER + double x2; + /* Horner's method. */ + x2 = x * x; + t = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * (x2 * (x2 * (x2 * (x2 + * (x2 * (x2 * t16 + t15) + t14) + t13) + t12) + t11) + + t10) + t09) + t08) + t07) + t06) + t05) + t04; +#else + double a, b, c, d, x2, x4, x6; + /* Sort of Estrin's method. */ + x2 = x * x; + x4 = x2 * x2; + x6 = x4 * x2; + a = (t16 * x4 + t15 * x2 + t14) * x6; + b = (t13 * x4 + t12 * x2 + t11); + c = (t10 * x4 + t09 * x2 + t08) * x6; + d = (t07 * x4 + t06 * x2 + t05); + t = (a + b) * x6 * (x4 * x4) + (c + d) * x2 + t04; +#endif + + EXTRACT_WORDS(hx, lx, x); + INSERT_WORDS(xhi, hx, 0); + xlo = x - xhi; + xlo *= (xlo + xhi + xhi); + xhi *= xhi; + + t = t02lo + x2 * (x2 * t + t03) + t02; + t = t01lo + xlo * t + xhi * t + t01; + t = t00lo + xlo * t + xhi * t + t00; + return (t * x); + } else if (x > 0.25) { + x = 0.5 - x; + t = __kernel_cospi(x) / __kernel_sinpi(x); + } else + t = 1; + return (t); +} + +static inline double +__compute_tanpi(double x) +{ + + return (x < 0.5 ? __kernel_tanpi(x) : -__kernel_tanpi(1 - x)); +} + +double +tanpi(double x) +{ + volatile static const double vzero = 0; + double ax; + uint32_t hx, ix, j0, lx; + + EXTRACT_WORDS(hx, lx, x); + ix = hx & 0x7fffffff; + INSERT_WORDS(ax, ix, lx); + + if (ix < 0x3ff00000) { /* |x| < 1 */ + if (ix < 0x3e200000) { /* |x| < 0x1p-29 */ + if (x == 0) + return (x); + INSERT_WORDS(ax, hx, 0); + x -= ax; + return (pilo * x + pihi * x + pilo * ax + pihi * ax); + } + if (ix == 0x3fe00000 && !lx) + return ((x - x) / (x - x)); + ax = __compute_tanpi(ax); + return ((hx & 0x80000000) ? -ax : ax); + } + + if (ix < 0x43300000) { /* 1 <= |x| < 0x1p52 */ + /* Determine integer part of ax. */ + j0 = ((ix >> 20) & 0x7ff) - 0x3ff; + if (j0 < 20) { + ix &= ~(0x000fffff >> j0); + lx = 0; + } else { + lx &= ~(((uint32_t)(0xffffffff)) >> (j0 - 20)); + } + INSERT_WORDS(x,ix,lx); + + ax -= x; + + if (ax == 0.5) + return ((x - x) / (x - x)); + + ax = ax == 0 ? 0 : __compute_tanpi(ax); + return ((hx & 0x80000000) ? -ax : ax); + } + + /* x = +-inf or nan. */ + if (ix >= 0x7f800000) + return (vzero / vzero); + + /* + * |x| >= 0x1p52 is always an integer, so return +-0. + * FIXME: should this raise FE_INEXACT or FE_INVALID? + */ + if (ax + 1 > 1) + ax = copysign(0, x); + return (ax); +} + +#if LDBL_MANT_DIG == 53 +__weak_reference(tanpi, tanpil); +#endif Property changes on: lib/msun/src/s_tanpi.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: lib/msun/src/s_tanpif.c =================================================================== --- lib/msun/src/s_tanpif.c (nonexistent) +++ lib/msun/src/s_tanpif.c (working copy) @@ -0,0 +1,141 @@ +/*- + * Copyright (c) 2017 Steven G. Kargl + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice unmodified, this list of conditions, and the following + * disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR + * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. + * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, + * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT + * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF + * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + */ + +/* + * See ../src/s_tanpi.c for implementation details. + */ + +#include "math.h" +#include "math_private.h" +#include "k_cospif.c" +#include "k_sinpif.c" + +static const float pihi = 3.14160156e+00f; /* 0x40491000 */ +static const float pilo = -8.90890988e-06f; /* 0xb715777a */ + +static const float +t0 = 3.14159274e+00f, /* 0x40490fdb */ +t0lo = -8.71231194e-08f, /* 0xb3bb1871 */ +t1 = 1.03354244e+01f, /* 0x41255de6 */ +t1lo = 3.83117083e-07f, /* 0x34cdaf36 */ +t2 = 4.08029366e+01f, /* 0x42233635 */ +t2lo = -1.28256147e-07f, /* 0xb409b6c8 */ +t3 = 1.62950455e+02f, /* 0x4322f351 */ +t3lo = 4.45962769e-06f, /* 0x3695a3e9 */ +t4 = 6.55820984e+02f, /* 0x4423f48b */ +t5 = 2.43586987e+03f, /* 0x45183deb */ +t6 = 1.47717539e+04f, /* 0x4666cf04 */ +t7 = -1.96741250e+04f, /* 0xc699b440 */ +t8 = 5.87398688e+05f; /* 0x490f686b */ + +static inline float +__kernel_tanpif(float x) +{ + float hi, lo, t, x2, xhi, xlo; + uint32_t ix; + + if (x < 0.25f) { + x2 = x * x; + t = t3lo + x2 * (x2 * (x2 * (x2 * (x2 * t8 + t7) + t6) + t5) + + t4) + t3; + + GET_FLOAT_WORD(ix, x); + SET_FLOAT_WORD(xhi, (ix >> 14) << 14); + xlo = x - xhi; + xlo *= (xlo + xhi + xhi); + xhi *= xhi; + + t = t2lo + xlo * t + xhi * t + t2; + t = t1lo + xlo * t + xhi * t + t1; + t = t0lo + xlo * t + xhi * t + t0; + t *= x; + } else if (x > 0.25f) { + x = 0.5f - x; + t = __kernel_cospif(x) / __kernel_sinpif(x); + } else + t = 1; + return (t); +} + +static inline float +__compute_tanpif(float x) +{ + + return (x < 0.5f ? __kernel_tanpif(x) : -__kernel_tanpif(1 - x)); +} + +float +tanpif(float x) +{ + volatile static const float vzero = 0; + float ax; + uint32_t hx, ix, j0; + + GET_FLOAT_WORD(hx, x); + ix = hx & 0x7fffffff; + SET_FLOAT_WORD(ax, ix); + + if (ix < 0x3f800000) { /* |x| < 1 */ + if (ix < 0x38800000) { /* |x| < 0x1p-14 */ + if (x == 0) + return (x); + SET_FLOAT_WORD(ax, (hx >> 13) << 13); + x -= ax; + return (pilo * x + pihi * x + pilo * ax + pihi * ax); + } + if (ix == 0x3f000000) + return ((x - x) / (x - x)); + ax = __compute_tanpif(ax); + return ((hx & 0x80000000) ? -ax : ax); + } + + if (ix < 0x4b000000) { /* 1 <= |x| < 0x1p23 */ + /* Determine integer part of ax. */ + j0 = ((ix >> 23) & 0xff) - 0x7f; + ix &= ~(0x007fffff >> j0); + SET_FLOAT_WORD(x, ix); + + ax -= x; + + if (ax == 0.5f) + return ((x - x) / (x - x)); + + ax = ax == 0 ? 0 : __compute_tanpif(ax); + return ((hx & 0x80000000) ? -ax : ax); + } + + /* x = +-inf or nan. */ + if (ix >= 0x7f800000) + return (vzero / vzero); + + /* + * |x| >= 0x1p23 is always an integer, so return +-0. + * FIXME: should this raise FE_INEXACT or FE_INVALID. + */ + if (ax + 1 > 1) + ax = copysignf(0, x); + return (ax); +} Property changes on: lib/msun/src/s_tanpif.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property From owner-freebsd-numerics@freebsd.org Fri Apr 28 18:40:26 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A38CED53A4E; Fri, 28 Apr 2017 18:40:26 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 523BBBBF; Fri, 28 Apr 2017 18:40:25 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 82D483C5E0A; Sat, 29 Apr 2017 04:40:18 +1000 (AEST) Date: Sat, 29 Apr 2017 04:40:14 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: Bruce Evans , freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170428165658.GA17560@troutmask.apl.washington.edu> Message-ID: <20170429035131.E3406@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=VbSHBBh9 c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=baPn34At2Hashz5YGC0A:9 a=-D1BXJ0ZKV1LJ0pY:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 18:40:26 -0000 On Fri, 28 Apr 2017, Steve Kargl wrote: > On Fri, Apr 28, 2017 at 07:13:16PM +1000, Bruce Evans wrote: >> On Thu, 27 Apr 2017, Steve Kargl wrote: >>> For those curious about testing, here are some numbers >>> for the Estrin's method. These were obtained on an AMD >>> FX(tm)-8350 @ 4018.34-MHz. The times are in microseconds >>> and the number in parentheses are estimated cycles. >>> >>> | cospi | sinpi | tanpi >>> ------------+--------------+--------------+-------------- >>> float | 0.0089 (37) | 0.0130 (54) | 0.0194 (80) >>> double | 0.0134 (55) | 0.0160 (66) | 0.0249 (102) >>> long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393) >>> ------------+--------------+--------------+-------------- >>> >>> The timing tests are done on the interval [0,0.25] as >>> this is the interval where the polynomial approximations >>> apply. Limited accuracy testing gives >> >> These still seem slow. I did a quick test of naive evaluations of >> these functions as standard_function(Pi * x) and get times a bit faster >> on Haswell, except 2-4 times faster for the long double case, with the >> handicaps of using gcc-3.3.3 and i386. Of course, the naive evaluation >> is inaccurate, especially near multiples of Pi/2. > > The slowness comes from handling extra precision arithmetic. For > sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial > approximation and x2 = x * x. One needs to take care in the ordering > on the evaluation where x and x2 are split into high and low parts. > See the source code posted in response to your other email. See logl() for how to do this almost as well as possible using the 2sum primitives. Don't forget to test on i386 with i387, where you will need lots of slow STRICT_ASSIGN()s or careful use of float_t and double_t as in my uncomitted logf() and logl()i or more relevantly clog() (cplex.c) to avoid problems with extra precision (2sum requires this). Calculating the polynomial as x * p(x) seems wrong. That is too inaccurate even for sin(x), and probably slower too. __kernel_sin(x) uses x + x2*S(x2), with complications except in the first quadrant needed to improve accuracy. The arrangement of the second term is dubious too, even for sin(x). It unimproves acuracy but is apparently used because it is faster. I would try S(x2) * x4 + x2 * (x * S3) + x, where x2 * (x * S3) + x must all be evaluated in extra precision. >>> x in [0,0.25] | tanpif | tanpi | tanpil >>> -----------------+------------+------------+------------- >>> MAX ULP | 1.37954760 | 1.37300168 | 1.38800823 >> >> Just use the naive evaluation to get similar errors in this >> range. It is probably faster too. > > I haven't checked tanpif, but naive evaluation is faster > for sinpif(x). But getting the worry answer fast seems > to be counter to what one may expect from a math library. > Consider the two naive implementations: But the above is inaccurate too. > float > bde1(float x) > { > return (sinf((float)M_PI * x)); > } sinpif() with range reduction in float precision and multiplication by M_PI and __kernel_sin() to finish off would be much faster and more accurate than doing everything in float precision. On i386/i387, it could be arranged that naive users get M_PI and __kernel_sin() almost automatically. This is essentially what is already done for sinf(). Similarly for sinpi() on arches with extra precision for double. > float > bde2(float x) > { > return (sinf((float)(M_PI * x))); > } > > float > bde3(float x) > { > return ((float)sin(M_PI * x)); > } > > x in [0,0.25] | sinpif | bde1(x) | bde2(x) | bde3(x) > -----------------+-------------+-------------+--------------+------------ > Time usec (cyc) | 0.0130 (54) | 0.0084 (35) | 0.0093 (38) | 0.0109 (45) > MAX ULP | 0.80103672 | 1.94017851 | 1.46830785 | 0.5 > 1.0 < ULP | 0 | 736496 | 47607 | 0 > 0.9 < ULP <= 1.0 | 0 | 563122 | 101162 | 0 > 0.8 < ULP <= 0.9 | 1 | 751605 | 386128 | 0 > 0.7 < ULP <= 0.8 | 727 | 942349 | 753647 | 0 > 0.6 < ULP <= 0.7 | 19268 | 1169719 | 1123518 | 0 > > x in [3990,3992] | sinpif | bde1(x) | bde2(x) | bde3(x) > -----------------+-------------+-------------+-------------+------------ > Time usec (cyc) | 0.0161 (66) | 0.0137 (56) | 0.0144 (59) | 0.0211 (87) > MAX ULP | 0.65193975 | 10458803. | 6471461. | 0.5 > 1.0 < ULP | 0 | 16773117 | 16762878 | 0 > 0.9 < ULP <= 1.0 | 0 | 0 | 0 | 0 > 0.8 < ULP <= 0.9 | 0 | 0 | 0 | 0 > 0.7 < ULP <= 0.8 | 0 | 0 | 0 | 0 > 0.6 < ULP <= 0.7 | 19268 | 2047 | 2047 | 0 > > So bde3(x) is best, but if you're going to use sin(M_PI*x) for > float sinpif, then use sinpi((double)x) which is faster than bde3 > and just as accurate. That is a naive version, with explicit pesssimation and destruction of extra precision by casting to float. A fast version would do return (__kernel_sindf(M_PI * x)) after reducing the range of x (switch to +- and cosf for other quadrants). Perhaps optimize the range reduction for small quadrant numbers for sinf(). It is easier to subtract 1 from x than to subtract Pi/2 to adjust the quadrant, so this optimization might not be so useful. Bruce From owner-freebsd-numerics@freebsd.org Fri Apr 28 20:15:24 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3B16CD549B6; Fri, 28 Apr 2017 20:15:24 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 170831269; Fri, 28 Apr 2017 20:15:24 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3SKFMnx033229 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 28 Apr 2017 13:15:22 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3SKFMZ8033228; Fri, 28 Apr 2017 13:15:22 -0700 (PDT) (envelope-from sgk) Date: Fri, 28 Apr 2017 13:15:22 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170428201522.GA32785@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170429035131.E3406@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 20:15:24 -0000 On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote: > On Fri, 28 Apr 2017, Steve Kargl wrote: > > > On Fri, Apr 28, 2017 at 07:13:16PM +1000, Bruce Evans wrote: > >> On Thu, 27 Apr 2017, Steve Kargl wrote: > >>> For those curious about testing, here are some numbers > >>> for the Estrin's method. These were obtained on an AMD > >>> FX(tm)-8350 @ 4018.34-MHz. The times are in microseconds > >>> and the number in parentheses are estimated cycles. > >>> > >>> | cospi | sinpi | tanpi > >>> ------------+--------------+--------------+-------------- > >>> float | 0.0089 (37) | 0.0130 (54) | 0.0194 (80) > >>> double | 0.0134 (55) | 0.0160 (66) | 0.0249 (102) > >>> long double | 0.0557 (228) | 0.0698 (287) | 0.0956 (393) > >>> ------------+--------------+--------------+-------------- > >>> > >>> The timing tests are done on the interval [0,0.25] as > >>> this is the interval where the polynomial approximations > >>> apply. Limited accuracy testing gives > >> > >> These still seem slow. I did a quick test of naive evaluations of > >> these functions as standard_function(Pi * x) and get times a bit faster > >> on Haswell, except 2-4 times faster for the long double case, with the > >> handicaps of using gcc-3.3.3 and i386. Of course, the naive evaluation > >> is inaccurate, especially near multiples of Pi/2. > > > > The slowness comes from handling extra precision arithmetic. For > > sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial > > approximation and x2 = x * x. One needs to take care in the ordering > > on the evaluation where x and x2 are split into high and low parts. > > See the source code posted in response to your other email. > > See logl() for how to do this almost as well as possible using the 2sum > primitives. Don't forget to test on i386 with i387, where you will need > lots of slow STRICT_ASSIGN()s or careful use of float_t and double_t > as in my uncomitted logf() and logl()i or more relevantly clog() (cplex.c) > to avoid problems with extra precision (2sum requires this). > > Calculating the polynomial as x * p(x) seems wrong. It starts life as the Remes approximation for p(x) = sin(pi*x) / x. Yes, I also looked into using p(x) = sin(pi*x) / (pi*x). At some point, one must do sinpi(x) = sin(pi*x) = x * p(x). One could do, as is done with sin(x), f(x) = (sin(pi*x) - pi*x)/(pi*x)**3 then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds the complication that pi*x must be computed in extra precision. For float, with the algorithm I have developed, one can simply do everything in double and then cast the result to float. In fact, my __kernel_sinpif was initially written in double with a max ULP less than 0.55. I then looked at what portions of the polynomial could be computed in float without reducing the 0.55. This then reveals what needs to be handle specially. But, what is one to do with double and long double functions? > That is too inaccurate even for sin(x), and probably slower too. I'm not sure why you're claiming that it is too inaccurate. Let's look at sinpi(x) and sin(x) on (roughly) the same interval to remove the different precisions involved in sinpif(x) and sinf(x). This is my sinpi(x) troutmask:kargl[221] ./testd -S -m 0.00 -M 0.25 -n 24 MAX ULP: 0.79851085 Total tested: 16777214 0.7 < ULP <= 0.8: 1075 0.6 < ULP <= 0.7: 22722 This is libm sin(x) troutmask:kargl[224] ./testd -S -m 0.00 -M 0.7853981634 -n 24 MAX ULP: 0.72922199 Total tested: 16777214 0.7 < ULP <= 0.8: 36 0.6 < ULP <= 0.7: 12193 The numbers above really aren't too different, and someone that actually understands numerical analysis might be able to take my code and (marginally) improve the numbers. > > float > > bde1(float x) > > { > > return (sinf((float)M_PI * x)); > > } > > sinpif() with range reduction in float precision and multiplication > by M_PI and __kernel_sin() to finish off would be much faster and > more accurate than doing everything in float precision. On i386/i387, Multiplication by M_PI and __kernel_sin() are double precision intermediates. That works for float. What are you going to do for double and long double? One of the benefits of writing sinpif(x) in float precision is that the exacts same algorithm is used in sinpi(x) and sinpil(x). One of these can be exhaustively tested. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Fri Apr 28 22:09:46 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 14850D547CD; Fri, 28 Apr 2017 22:09:46 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id B7D39913; Fri, 28 Apr 2017 22:09:45 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 437AA104C43F; Sat, 29 Apr 2017 08:09:42 +1000 (AEST) Date: Sat, 29 Apr 2017 08:09:39 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: Bruce Evans , freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170428201522.GA32785@troutmask.apl.washington.edu> Message-ID: <20170429070036.A4005@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=VbSHBBh9 c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=2JjXyQGd_aUmJn-FHg8A:9 a=SzcEAFoTqt-hgAuy:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 22:09:46 -0000 On Fri, 28 Apr 2017, Steve Kargl wrote: > On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote: >> On Fri, 28 Apr 2017, Steve Kargl wrote: >>> ... >>> The slowness comes from handling extra precision arithmetic. For >>> sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial >>> approximation and x2 = x * x. One needs to take care in the ordering >>> on the evaluation where x and x2 are split into high and low parts. >>> See the source code posted in response to your other email. >> ... >> Calculating the polynomial as x * p(x) seems wrong. > > It starts life as the Remes approximation for p(x) = sin(pi*x) / x. > Yes, I also looked into using p(x) = sin(pi*x) / (pi*x). At some > point, one must do sinpi(x) = sin(pi*x) = x * p(x). I forgot that the linear term needs to be multiplied by Pi. Actually, the problem is only in the notation and the above description. s0 looks like a constant, but it is actually Pi*x in extra precision, and no extra precision is needed or done for x2. __kernel_cospi() seems to be buggy. It splits x2, but doesn't use extra precision for calculating x2. Is extra precision really needed for the x2 term? It is like the x3 term for sin*(), with the critical difference that the coeffient is much larger (-1/2 instead of -1/6 before multiplication by pi). This requires extra accuracy for cos() too, but not for the calculating the term. I optimized this by changing a slow fdlibm method to essentially 2sum. The x2/2 term is inaccurate but no significant further inaccuracies occur for adding terms. You have Pi*Pi*x2/2 which is more inaccurate. I guess the extra half an ulp accuracy here is too much. You avoid this half an ulp using extra precision, and seem to use lots of extra precision instead of the final 2sum step. Probably much slower. Here is my modified method: x2 = x*x; /* hopefully really don't need extra prec */ (hi, lo) = split(x2); resid = S(x2); hi = C2_hi * hi; lo = C2_lo * hi + C1_hi * lo + C1_lo * lo; lo = lo + resid; #ifdef hopefully_not_needed Re-split hi+lo if lo is too large. #endif return 3sum(1, hi, lo) 3sum is in math_private.h and is supposed to be used instead of magic- looking open code. Here it would reduce to the same expressions as in __kernel_sin(). 1+hi = hi_1 + lo_1; /* by hi-lo decomp. */ return (lo + lo_1 + hi_1); /* safely add the lo's and "carry" */ > One could do, as is done with sin(x), > > f(x) = (sin(pi*x) - pi*x)/(pi*x)**3 > > then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds > the complication that pi*x must be computed in extra precision. Isn't that what is done now? Of course pi*x must be evaluated in extra precision, but hopefully no more is needed. The x3 term would be much harder to do all in extra precision than the x2 term for cospi(). >> That is too inaccurate even for sin(x), and probably slower too. > > I'm not sure why you're claiming that it is too inaccurate. Writing sin(x) as x * (1 + C2*x*x + ...) without extra precision. This gives at best an error of half an ulp for the part in parentheses and another half an ulp for the multiplication. The final step must be an addition of a large term with a term at least 2 times smaller so that the error can be controlled (this depends on 1 of the terms being exact). > ... >>> float >>> bde1(float x) >>> { >>> return (sinf((float)M_PI * x)); >>> } >> >> sinpif() with range reduction in float precision and multiplication >> by M_PI and __kernel_sin() to finish off would be much faster and >> more accurate than doing everything in float precision. On i386/i387, > > Multiplication by M_PI and __kernel_sin() are double precision > intermediates. That works for float. That's what I mean by naive code having a chance of working. It takes foot-shooting to reduce M_PI to float. Unfortunately, the API + ABI does reduce (M_PI * x) to float in sinf(M_PI * x) if sinf() is extern. But if sinf() is in hardware, the compiler might not destroy the extra precision before or after using the hardware. > What are you going to > do for double and long double? One of the benefits of writing > sinpif(x) in float precision is that the exacts same algorithm is > used in sinpi(x) and sinpil(x). One of these can be exhaustively > tested. That is about the only benefit of the float functions, so I resist optimizing them using doubles. It would be better to use extra precision automatically, but not force it. Bruce From owner-freebsd-numerics@freebsd.org Fri Apr 28 23:36:00 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2CA5DD55E45; Fri, 28 Apr 2017 23:36:00 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 146D41D75; Fri, 28 Apr 2017 23:36:00 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3SNZqXN034900 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 28 Apr 2017 16:35:52 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3SNZqtZ034899; Fri, 28 Apr 2017 16:35:52 -0700 (PDT) (envelope-from sgk) Date: Fri, 28 Apr 2017 16:35:52 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170428233552.GA34580@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170429070036.A4005@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Apr 2017 23:36:00 -0000 On Sat, Apr 29, 2017 at 08:09:39AM +1000, Bruce Evans wrote: > On Fri, 28 Apr 2017, Steve Kargl wrote: > > > On Sat, Apr 29, 2017 at 04:40:14AM +1000, Bruce Evans wrote: > >> On Fri, 28 Apr 2017, Steve Kargl wrote: > >>> ... > >>> The slowness comes from handling extra precision arithmetic. For > >>> sinpi(x) = x * p(x) where p(x) = s0 + x2 * S(x2) is the polynomial > >>> approximation and x2 = x * x. One needs to take care in the ordering > >>> on the evaluation where x and x2 are split into high and low parts. > >>> See the source code posted in response to your other email. > >> ... > >> Calculating the polynomial as x * p(x) seems wrong. > > > > It starts life as the Remes approximation for p(x) = sin(pi*x) / x. > > Yes, I also looked into using p(x) = sin(pi*x) / (pi*x). At some > > point, one must do sinpi(x) = sin(pi*x) = x * p(x). > > I forgot that the linear term needs to be multiplied by Pi. > > Actually, the problem is only in the notation and the above description. > s0 looks like a constant, but it is actually Pi*x in extra precision, > and no extra precision is needed or done for x2. I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. static const float s0hi = 3.14062500e+00f, /* 0x40490000 */ s0lo = 9.67653585e-04f, /* 0x3a7daa22 */ s1 = -5.16771269e+00f, /* 0xc0a55de7 */ s2 = 2.55016255e+00f, /* 0x402335dd */ s3 = -5.99202096e-01f, /* 0xbf19654f */ s4 = 8.10018554e-02f; /* 0x3da5e44d */ static inline float __kernel_sinpif(float x) { double s; float p, x2; x2 = x * x; p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; s = x * (x2 * (double)p + s0hi + s0lo); return ((float)s); } x2 and p are computed in float but is accumulates the final result in double. Using a higher precision works for float, but double and long double becomes problematic. > __kernel_cospi() seems to be buggy. It splits x2, but doesn't use > extra precision for calculating x2. Is extra precision really needed > for the x2 term? Yes. The last part of __kernel_cospif() is a fused-multiple and add. For small |x2|, we can do a normal summation. That's this block of code GET_FLOAT_WORD(ix, x2); if (ix < 0x3c800000) /* |x2| < 0x1p-6 */ return (c0 + c * x2); If |x2| > 0x1p-6, then c0 + c * x2 is computed with a FMA, but its a little trickier. This splits c in clo and chi, and adds in a correction for c1 GET_FLOAT_WORD(ix, c); SET_FLOAT_WORD(chi, (ix >> 14) << 14); clo = c - chi + c1lo; In the final expression, c = clo * x2lo + chi * x2lo + clo * x2hi + (chi * x2hi + c0); chi * c2hi is exact, and the parenthesis are required to achieve max ULP < ~0.88. The grouping of the other terms does not seem to matter. > It is like the x3 term for sin*(), with the critical > difference that the coeffient is much larger (-1/2 instead of -1/6 > before multiplication by pi). This requires extra accuracy for cos() > too, but not for the calculating the term. I optimized this by changing > a slow fdlibm method to essentially 2sum. The x2/2 term is inaccurate > but no significant further inaccuracies occur for adding terms. You > have Pi*Pi*x2/2 which is more inaccurate. I guess the extra half an > ulp accuracy here is too much. You avoid this half an ulp using extra > precision, and seem to use lots of extra precision instead of the final > 2sum step. Probably much slower. > > Here is my modified method: > > x2 = x*x; /* hopefully really don't need extra prec */ > (hi, lo) = split(x2); > resid = S(x2); > hi = C2_hi * hi; > lo = C2_lo * hi + C1_hi * lo + C1_lo * lo; > lo = lo + resid; > #ifdef hopefully_not_needed > Re-split hi+lo if lo is too large. > #endif > return 3sum(1, hi, lo) > > 3sum is in math_private.h and is supposed to be used instead of magic- > looking open code. I completely missed 3sum in math_private.h. I did try using your 2sum and 2sumF macros, but did not recall seeing any difference from my hand-rolled code. I probably didn't spend enough time looking into if I was using 2sum[F] properly. I'll see if I can adapt my code to use these macros. > Here it would reduce to the same expressions as in > __kernel_sin(). > > 1+hi = hi_1 + lo_1; /* by hi-lo decomp. */ > return (lo + lo_1 + hi_1); /* safely add the lo's and "carry" */ > > > One could do, as is done with sin(x), > > > > f(x) = (sin(pi*x) - pi*x)/(pi*x)**3 > > > > then sinpi(x) = pi*x + (pi*x)**3 * f(x), but this then adds > > the complication that pi*x must be computed in extra precision. > > Isn't that what is done now? Of course pi*x must be evaluated in > extra precision, but hopefully no more is needed. The x3 term > would be much harder to do all in extra precision than the x2 > term for cospi(). Well, yes, I supposed that's one way to look at it. In the approximation of sinpif(x) = x * (s0 + x2 * S(x2)), the s0 term is nearly pi. The s1 coefficient, as determined by Remes, is s1 = -5.16771269, which happens to be close to -pi**3/6 = -5.16771278 (i.e, next term in McClauren series of sin(pi*x) = pi*x - (pi**3/6)*x**3 + .... So, the Remes coefficients have absorbed the factors of pi. > >> That is too inaccurate even for sin(x), and probably slower too. > > > > I'm not sure why you're claiming that it is too inaccurate. > > Writing sin(x) as x * (1 + C2*x*x + ...) without extra precision. > This gives at best an error of half an ulp for the part in parentheses > and another half an ulp for the multiplication. The final step > must be an addition of a large term with a term at least 2 times > smaller so that the error can be controlled (this depends on 1 > of the terms being exact). Ah, okay, I understand what you're getting at. I played a number of games with grouping intermediate results and trying to accumulate high and low parts. You're seeing the best that I can do. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Sat Apr 29 00:59:26 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B511FD52424; Sat, 29 Apr 2017 00:59:26 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 7F5581ABD; Sat, 29 Apr 2017 00:59:26 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3T0xPk4037996 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Fri, 28 Apr 2017 17:59:25 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3T0xOBe037995; Fri, 28 Apr 2017 17:59:24 -0700 (PDT) (envelope-from sgk) Date: Fri, 28 Apr 2017 17:59:24 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170429005924.GA37947@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170428233552.GA34580@troutmask.apl.washington.edu> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 00:59:26 -0000 On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote: > > I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. > > static const float > s0hi = 3.14062500e+00f, /* 0x40490000 */ > s0lo = 9.67653585e-04f, /* 0x3a7daa22 */ > s1 = -5.16771269e+00f, /* 0xc0a55de7 */ > s2 = 2.55016255e+00f, /* 0x402335dd */ > s3 = -5.99202096e-01f, /* 0xbf19654f */ > s4 = 8.10018554e-02f; /* 0x3da5e44d */ > > static inline float > __kernel_sinpif(float x) > { > double s; > float p, x2; > x2 = x * x; > p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; > s = x * (x2 * (double)p + s0hi + s0lo); > return ((float)s); > } > Well, starting with above and playing the splitting game with x, x2, and p. I've manage to reduce the max ULP in the region testd. Testing against MPFR with sin(pi*x) computed in 5*24-bit precision gives MAX ULP: 0.73345101 Total tested: 33554427 0.7 < ULP <= 0.8: 90 0.6 < ULP <= 0.7: 23948 Exhaustive testing with my older sinpi(x) as the reference gives ./testf -S -m 0x1p-14 -M 0.25 -d -e MAX ULP: 0.73345101 Total tested: 100663296 0.7 < ULP <= 0.8: 45 0.6 < ULP <= 0.7: 11977 The code is slightly slower than my current best kernel. sinpif time is 0.0147 usec per call (60 cycles). static inline float __kernel_sinpif(float x) { float p, phi, x2, x2hi, x2lo, xhi, xlo; uint32_t ix; x2 = x * x; p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; GET_FLOAT_WORD(ix, p); SET_FLOAT_WORD(phi, (ix >> 14) << 14); GET_FLOAT_WORD(ix, x2); SET_FLOAT_WORD(x2hi, (ix >> 14) << 14); x2lo = s0lo + x2 * (p - phi) + (x2 - x2hi) * phi; x2hi *= phi; GET_FLOAT_WORD(ix, x); SET_FLOAT_WORD(xhi, (ix >> 14) << 14); xlo = x - xhi; xlo = xlo * (x2lo + x2hi) + (xlo * s0hi + xhi * x2lo); return (xlo + xhi * x2hi + xhi * s0hi); } -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Sat Apr 29 07:54:36 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4555BD55D2D; Sat, 29 Apr 2017 07:54:36 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 9E8DBA21; Sat, 29 Apr 2017 07:54:35 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id C542AD69213; Sat, 29 Apr 2017 17:54:25 +1000 (AEST) Date: Sat, 29 Apr 2017 17:54:21 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170429005924.GA37947@troutmask.apl.washington.edu> Message-ID: <20170429151457.F809@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=nNk0bZpkwH5g55qFUBQA:9 a=tF56D53I6aLQB65K:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 07:54:36 -0000 On Fri, 28 Apr 2017, Steve Kargl wrote: > On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote: >> >> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. Comments on this below. This is all rather over-engineered. Optimizing these functions is unimportant comparing with finishing cosl() and sinl() and optimizing all of the standard trig functions better, but we need correctness. But I now see many simplifications and improvements: (1) There is no need for new kernels. The standard kernels already handle extra precision using approximations like: sin(x+y) ~= sin(x) + (1-x*x/2)*y. Simply reduce x and write Pi*x = hi+lo. Then sin(Pi*x) = __kernel_sin(hi, lo, 1). I now see how to do the extra-precision calculations without any multiplications. For Pi*x, suppose that x is near 1/4. Choose a nice constant value x0 as far below 1/4 as works and reduce x further using this: x = x0 + (x - x0) // plain '=' means exact y0 ~= Pi*x0 = hi0 + lo0 // in extra prec at compile time y ~= Pi*(x-x0) // small, so doesn't need extra prec Pi*x ~= hi0 + lo0 + y = hi1 + lo1 // use 2sum sin(Pi*x) ~= __kernel_sin(hi1, lo1, 1). but it is better to use a special polynomial approximation about y0 on y. We have reduced x to the significantly smaller y, so the polynominal can have fewer terms and not need any extra precision except for the additive term. In infinite precision, sin(y+y0) = S0 + S1*y + ... Write S0 = hi2+lo2 and choose x0 so that lo2 can be replaced by 0 without significant loss of accuracy (see some exp2's for this method). At worst, lo2 is nonzero and we have to do just 1 extra-precision operation. Otherwise, we have a normal polyomial evaluation. Unlike for sin() near 0, there are even terms and and not nice fractional coefficients, but since y is small there aren't so many terms. It is important to avoid needing extra precision for many of the coefficients. Higher ones don't need it since y is small, and S0 might not need it. The Intel ia64 math library uses the reduction x = y+y0 at several interval endpoints below Pi/4 for ordinary sin(). Previously, I couldn't get this method to work as well as using the single large interval [-Pi/4, Pi/4], since the even terms take a while to calculate and I didn't understand the hi+lo decomposition methods so well. Reminder on why cos(x) needs extra precison but sin(x) doesn't when we use the brute force expansion up to Pi/4: we use cos(x) = 1 - x*x/2 + ... and the only problem is to subtract these terms accurately enough. There is a problem since Pi/4 is slightly larger than sqrt(1/2), so x*x/2 can be slightly larger than 1/4. If it were larger than 1/2, then the subtraction would lose a full bit to cancelation, so extra inaccuracy could be more than 1 ulp, but we want it to be less 0.5 ulps so that the combined inaccuracy is less than 1 ulp. Similarly, when x*x/2 is larger than 1/4 but below 1/2, the extra inaccuracy can be larger than 0.5 ulps though smaller than 1 ulp. Above 0.5 is still to high, so we need some extra precision. This problem doesn't affect sin(x) since the ratio of the terms is -x*x/6 so the extra inaccuracy is 3 times smaller so below 1/3 ulps. For cos(y+y0) ~= C0 + C1*y + C2*y*y, y must be small enough for the extra inaccuracy from the first addition to be a little smaller tha 0.5 ulps. This is not as easy as I was thinking, since the lineat term is relatively large. But there is no problem for cos(y+1/2): cos(y+1/2) ~= 0.88 - 0.48*y, and the ratio of terms with y = Pi/4-1/2 is ~0.16. However, sin(y+1/2) ~= 0.48 + 0.88*y, so the maximum ratio of terms is ~0.53 -- too much. >> static const float >> s0hi = 3.14062500e+00f, /* 0x40490000 */ >> s0lo = 9.67653585e-04f, /* 0x3a7daa22 */ >> s1 = -5.16771269e+00f, /* 0xc0a55de7 */ >> s2 = 2.55016255e+00f, /* 0x402335dd */ >> s3 = -5.99202096e-01f, /* 0xbf19654f */ >> s4 = 8.10018554e-02f; /* 0x3da5e44d */ >> >> static inline float >> __kernel_sinpif(float x) >> { >> double s; >> float p, x2; >> x2 = x * x; >> p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; >> s = x * (x2 * (double)p + s0hi + s0lo); >> return ((float)s); >> } I don't like the style being completely different. Why lower-case constants? ... > Well, starting with above and playing the splitting game > with x, x2, and p. I've manage to reduce the max ULP in > the region testd. Testing against MPFR with sin(pi*x) > computed in 5*24-bit precision gives It already cheats by using (double)p, but is limited by using float to caclulate p. p should have an inaccuracy of not much more than 1 ulp, and we don't care if this is a little over 1. Then when we do the final caclulations in either extra precision or using the double hack, we don't get any more inaccuracy. Extra precision is not very useful for x2 * p -- it only increases the error by another half an ulp. So errors for that part are 0.5 ulps for x2, ~1 ulp for p and 0.5 for a float multiplication -- 2.0 althogther -- and the the double multiplication reduces this to 1.5. This gets scaled down as usual by the term ratio x*x/6 ~= 0.2. 0.2 of 1.5 = 0.3 is below the magic 0.5, but so is 0.2 of 2.0 = 0.4. But I would prefer below the magic 0.25. > MAX ULP: 0.73345101 > Total tested: 33554427 > 0.7 < ULP <= 0.8: 90 > 0.6 < ULP <= 0.7: 23948 You can even reverse this to calculate the maximum of the term ratio fairly accuractely. The rounding error is 0.233. We had an error of about 1.5 before the final addition. The ratio 0.233/1.5 = 0.16 gives the reduction ratio for the final addition. It approximates the term ratio. (I'm surprised that it is lower than the term ratio instead of higher. In fact, I don't believe the 0.733. According to the term ratio, the worst case must be at least 0.800.) > Exhaustive testing with my older sinpi(x) as the reference > gives > > ./testf -S -m 0x1p-14 -M 0.25 -d -e > MAX ULP: 0.73345101 > Total tested: 100663296 > 0.7 < ULP <= 0.8: 45 > 0.6 < ULP <= 0.7: 11977 Perhaps it really is 0.733. I estimated an error of 1 ulp for p, but its term ratio is much lower than for the first 2 terms -- 20 times lower from the factorial in the denominator -- so 0.55 ulps would be a better estimate. This gives the estimate term_ratio * 1.05 = 0.21 extra ulps which matches 0.233 almost perfectly. Now it seems too perfect to be true -- I usually forget to add up all the half- ulps in calculations like this. > The code is slightly slower than my current best kernel. > sinpif time is 0.0147 usec per call (60 cycles). It still seems slow :-). Full sin(x) takes 43 cycles on amd64 (freefall xeon) for x where its extra precision is used (iy != 0), and calculating only Pi*x in extra precision shouldn't add much to that, and the simpler range reduction should subtract a lot (sin(x) takes 23 cycles below 2Pi where its range reduction is simple). > static inline float > __kernel_sinpif(float x) > { > float p, phi, x2, x2hi, x2lo, xhi, xlo; > uint32_t ix; > > x2 = x * x; > p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; > > GET_FLOAT_WORD(ix, p); > SET_FLOAT_WORD(phi, (ix >> 14) << 14); > > GET_FLOAT_WORD(ix, x2); > SET_FLOAT_WORD(x2hi, (ix >> 14) << 14); I expect that these GET/SET's are the slowest part. They are quite fast in float prec, but in double prec on old i386 CPUs compilers generate bad code which can have penalties of 20 cycles per GET/SET. Why the strange reduction? The double shift is just a manual optimization or pssimization (usually the latter) for clearing low bits. Here it is used to clear 14 low bits instead of the usual 12. This is normally written using just a mask of 0xffff0000, unless you want a different number of bits in the hi terms for technical reasons. Double precision can benefit more from asymmetric splitting of terms since 53 is not divisible by 2; 1 hi term must have less than 26.5 bits and the other term can hold an extra bit. Here is an example of fairly refined splitting and use of 2sum withit: X float X log10f(float x) X { X struct Float r; X float_t hi, lo; X float fhi; hi+lo decompositions need float_t in general, for both correctness and efficiency. X int32_t hx; X X DOPRINT_START(&x); X k_logf(x, &r); X if (!r.lo_set) X RETURNP(r.hi); X if (sizeof(float_t) > sizeof(float)) X RETURNP((invln10_lo + (float_t)invln10_hi) * (r.lo + r.hi)); X _2sumF(r.hi, r.lo); X fhi = r.hi; X GET_FLOAT_WORD(hx, fhi); X SET_FLOAT_WORD(fhi, hx & 0xfffff000); It is hard to do the splitting better than this. This method subtly masks bugs if float_t is not used. On i386/i387 it forces much the same slow loads and stores that are needed to discard any extra precision when float_t is not used. X hi = fhi; X lo = r.lo + (r.hi - hi); The code tries to optimize things with careful assignments between float_t and float. This has further subtleties from float_t being used. When there is extra precision, float_t should be different from float, and compilers should not be broken enough to elide conversions between float and float_t. The above 2 lines are basic splitting, and the subtleties are what is needed for this to work. struct Float is just hi and lo in a struct, each with type float, but a further subtlety is that you want the compiler to keep all variables in registers, with floats in extended precision in the registers; then it must be arranged that hi+lo decompositions in registers have no extra precision... X RETURN2P(invln10_hi * hi, X (invln10_lo + invln10_hi) * lo + invln10_lo * hi); X } ... but when the hi+lo values are used in expressions like this, we want the extra precision back. Here the hi*hi term is exact (that is the point of the decomposition), and extra precision gives some free extra accuracy for the rest of the calculation. > > x2lo = s0lo + x2 * (p - phi) + (x2 - x2hi) * phi; > x2hi *= phi; > > GET_FLOAT_WORD(ix, x); > SET_FLOAT_WORD(xhi, (ix >> 14) << 14); > xlo = x - xhi; > xlo = xlo * (x2lo + x2hi) + (xlo * s0hi + xhi * x2lo); > > return (xlo + xhi * x2hi + xhi * s0hi); > } Yet another splitting is going to be slow. This one is necessary. Are the first 2 really necessary? The above analysis might show that they really aren't -- didn't we get an error of 0.733 ulps (which seems too low) using both methods? Otherwise, this code is very similar to my log10f(). k_logf() does most of the work. It evaluates logf(x) in extra precision. log10f(x) is just C*logf(x) in extra precision. k_logf() arranges to return the extra precision. For sinpi*(), you need to scale the arg instead of the result. This is actually easier, since the arg is not in extra precision and the extra precision is very easy to handle in the kernel (and even already handled in standard kernels). The above logf() runs acceptably fast using a kernel not especially optimized for it. Signficantly slower than my logf(). Only slightly faster than fdlibm+cygnus log10f(), but it is much more accurate. Bruce From owner-freebsd-numerics@freebsd.org Sat Apr 29 10:49:25 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DCCB4D5597F; Sat, 29 Apr 2017 10:49:25 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au [211.29.132.97]) by mx1.freebsd.org (Postfix) with ESMTP id 720191A07; Sat, 29 Apr 2017 10:49:24 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id B764110390F; Sat, 29 Apr 2017 20:19:27 +1000 (AEST) Date: Sat, 29 Apr 2017 20:19:23 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Steve Kargl , freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170429151457.F809@besplex.bde.org> Message-ID: <20170429194239.P3294@besplex.bde.org> References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=OsaYX3Fg_SImqZmlGUkA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 10:49:26 -0000 On Sat, 29 Apr 2017, Bruce Evans wrote: > On Fri, 28 Apr 2017, Steve Kargl wrote: > >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote: >>> >>> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. > > Comments on this below. > > This is all rather over-engineered. Optimizing these functions is > unimportant comparing with finishing cosl() and sinl() and optimizing > all of the standard trig functions better, but we need correctness. > But I now see many simplifications and improvements: > > (1) There is no need for new kernels. The standard kernels already handle > extra precision using approximations like: > > sin(x+y) ~= sin(x) + (1-x*x/2)*y. > > Simply reduce x and write Pi*x = hi+lo. Then > > sin(Pi*x) = __kernel_sin(hi, lo, 1). > > I now see how to do the extra-precision calculations without any > multiplications. But that is over-engineered too. Using the standard kernels is easy and works well: XX #include XX #include XX XX #include "math_private.h" XX XX static const double XX pi_hi = 3.1415926218032837e+00, /* 0x400921fb 0x50000000 */ XX pi_lo = 3.1786509547050787e-08; /* 0x3e6110b4 0x611a5f14 */ XX XX /* Only for |x| <= ~0.25 (missing range reduction. */ XX XX double XX cospi(double x) XX { XX double_t hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return __kernel_cos(hi, lo); XX } XX XX double XX sinpi(double x) XX { XX double_t hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return __kernel_sin(hi, lo, 1); XX } XX XX double XX tanpi(double x) XX { XX double_t hi, lo; XX XX hi = (float)x; XX lo = x - hi; XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; XX hi = pi_hi * hi; XX _2sumF(hi, lo); XX return __kernel_tan(hi, lo, 1); XX } I only did a sloppy accuracy test for sinpi(). It was 0.03 ulps less accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for sin(). Efficiency is very good in some cases, but anomalous in others: all times in cycles, on i386, on the range [0, 0.25] athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 cos: 61-62 44 43 cospi: 69-71 (8-9 extra) 78 (anomalous...) 42 (faster to do more!) sin: 59-60 51 37 sinpi: 67-68 (8 extra) 80 42 tan: 136-172 93-195 67-94 tanpi: 144-187 (8-15 extra) 145-176 61-189 That was a throughput test. Latency is not so good. My latency test doesn't use serializing instructions, but uses random args and the partial serialization of making each result depend on the previous one. athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 cos: 84-85 69 79 cospi: 103-104 (19-21 extra) 117 94 sin: 75-76 89 77 sinpi: 105-106 (30 extra) 116 90 tan: 168-170 167-168 147 tanpi: 191-194 (23-24 extra) 191 154 This also indicates that the longest times for tan in the throughput test are what happens when the function doesn't run in parallel with itself. The high-degree polynomial and other complications in tan() are too complicated for much cross-function parallelism. Anywyay, it looks like the cost of using the kernel is at most 8-9 in the parallel case and at most 30 in the serial case. The extra- precision code has about 10 dependent instructions, so it s is doing OK to take 30. Bruce From owner-freebsd-numerics@freebsd.org Sat Apr 29 18:10:24 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8E065D55BE9; Sat, 29 Apr 2017 18:10:24 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 76320D7A; Sat, 29 Apr 2017 18:10:24 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3TIAN63041484 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sat, 29 Apr 2017 11:10:23 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3TIAMGF041483; Sat, 29 Apr 2017 11:10:22 -0700 (PDT) (envelope-from sgk) Date: Sat, 29 Apr 2017 11:10:22 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170429181022.GA41420@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170429151457.F809@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 18:10:24 -0000 On Sat, Apr 29, 2017 at 05:54:21PM +1000, Bruce Evans wrote: > On Fri, 28 Apr 2017, Steve Kargl wrote: > > > On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote: > >> > >> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. > > Comments on this below. Way too much information to digest in one reading. I'll answer the things that are easy ... > >> x2 = x * x; > >> p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; > >> s = x * (x2 * (double)p + s0hi + s0lo); > >> return ((float)s); > >> } > > I don't like the style being completely different. Why lower-case constants? > ... Having teethed on musty FORTRAN code, I have an adversion to uppercase. I also do not like camel case, which seems to plague modern languages. > > Well, starting with above and playing the splitting game > > with x, x2, and p. I've manage to reduce the max ULP in > > the region testd. Testing against MPFR with sin(pi*x) > > computed in 5*24-bit precision gives > > It already cheats by using (double)p, but is limited by using float to > caclulate p. Yes, the above is cheating to the extent that the above code started out as completely written in double. This gives max ULP < 0.51. I then started reducing the double to float with the intent of finding when max ULP exceeds some chosen threshold. The above gave something like max ULP < 0.55. I then moved the (double) cast to different terms/factors, observed how the max ULP changed. This then gives information about what requires extra precision. > > static inline float > > __kernel_sinpif(float x) > > { > > float p, phi, x2, x2hi, x2lo, xhi, xlo; > > uint32_t ix; > > > > x2 = x * x; > > p = x2 * (x2 * (x2 * s4 + s3) + s2) + s1; > > > > GET_FLOAT_WORD(ix, p); > > SET_FLOAT_WORD(phi, (ix >> 14) << 14); > > > > GET_FLOAT_WORD(ix, x2); > > SET_FLOAT_WORD(x2hi, (ix >> 14) << 14); > > I expect that these GET/SET's are the slowest part. They are quite fast > in float prec, but in double prec on old i386 CPUs compilers generate bad > code which can have penalties of 20 cycles per GET/SET. > > Why the strange reduction? The double shift is just a manual optimization > or pssimization (usually the latter) for clearing low bits. Here it is > used to clear 14 low bits instead of the usual 12. This is normally > written using just a mask of 0xffff0000, unless you want a different > number of bits in the hi terms for technical reasons. Double precision > can benefit more from asymmetric splitting of terms since 53 is not > divisible by 2; 1 hi term must have less than 26.5 bits and the other term > can hold an extra bit. Because I didn't think about using a mask. :-) It's easy to change 14 to 13 or 11 or ..., while I would need to write out zeros and one to come up with 0xffff8000, etc. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Sat Apr 29 18:42:11 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B8BCDD567C2; Sat, 29 Apr 2017 18:42:11 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id A043F1D4B; Sat, 29 Apr 2017 18:42:11 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3TIg9KM041637 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sat, 29 Apr 2017 11:42:09 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3TIg95N041636; Sat, 29 Apr 2017 11:42:09 -0700 (PDT) (envelope-from sgk) Date: Sat, 29 Apr 2017 11:42:09 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170429184209.GB41420@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> <20170429194239.P3294@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170429194239.P3294@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 18:42:11 -0000 On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote: > On Sat, 29 Apr 2017, Bruce Evans wrote: > > > On Fri, 28 Apr 2017, Steve Kargl wrote: > > > >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote: > >>> > >>> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. > > > > Comments on this below. > > > > This is all rather over-engineered. Optimizing these functions is > > unimportant comparing with finishing cosl() and sinl() and optimizing > > all of the standard trig functions better, but we need correctness. > > But I now see many simplifications and improvements: > > > > (1) There is no need for new kernels. The standard kernels already handle > > extra precision using approximations like: > > > > sin(x+y) ~= sin(x) + (1-x*x/2)*y. > > > > Simply reduce x and write Pi*x = hi+lo. Then > > > > sin(Pi*x) = __kernel_sin(hi, lo, 1). > > > > I now see how to do the extra-precision calculations without any > > multiplications. > > But that is over-engineered too. > > Using the standard kernels is easy and works well: As your code only works on the interval [0,0.25], I took the liberty to use it as a __kernel_sinpi and __kernel_cospi. > XX double > XX cospi(double x) > XX { > XX double_t hi, lo; If sizeof(double_t) indicates what I think it means, This is slow on my Core2 duo (aka ia32 system). > XX hi = (float)x; > XX lo = x - hi; This is the splitting I use in my double version versions with hi and lo as simply double. > XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; > XX hi = pi_hi * hi; > XX _2sumF(hi, lo); > XX return __kernel_cos(hi, lo); > XX } > XX > > I only did a sloppy accuracy test for sinpi(). It was 0.03 ulps less > accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for > sin(). > > Efficiency is very good in some cases, but anomalous in others: all > times in cycles, on i386, on the range [0, 0.25] > > athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 > cos: 61-62 44 43 > cospi: 69-71 (8-9 extra) 78 (anomalous...) 42 (faster to do more!) > sin: 59-60 51 37 > sinpi: 67-68 (8 extra) 80 42 > tan: 136-172 93-195 67-94 > tanpi: 144-187 (8-15 extra) 145-176 61-189 > > That was a throughput test. Latency is not so good. My latency test > doesn't use serializing instructions, but uses random args and the > partial serialization of making each result depend on the previous > one. > > athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 > cos: 84-85 69 79 > cospi: 103-104 (19-21 extra) 117 94 > sin: 75-76 89 77 > sinpi: 105-106 (30 extra) 116 90 > tan: 168-170 167-168 147 > tanpi: 191-194 (23-24 extra) 191 154 I is unclear how you're making your measurements. My timings with my kernels compared to kernels based on your code: | Bruce | Steve ------+--------------+-------------- sinpi | 0.0742 (148) | 0.0633 (126) cospi | 0.0720 (144) | 0.0513 (102) First number is microseconds per call and the (xxx) is the time*cpu_freq. As far as over-engineering, for sinpi I find sinpi Bruce kernel Steve kernel MAX ULP: 0.73021263 0.73955815 Total tested: 33554431 33554431 0.7 < ULP <= 0.8: 154 280 0.6 < ULP <= 0.7: 27650 29197 cospi is much more interesting and as you state above more difficult to get right. I have reworked my kernel, yet, but I find cospi Bruce kernel Steve kernel MAX ULP: 0.78223389 0.89921787 Total tested: 33554431 33554431 0.8 < ULP <= 0.9: 0 3262 0.7 < ULP <= 0.8: 9663 68305 0.6 < ULP <= 0.7: 132948 346214 Perhaps, using double_t would reduce my max ULP at the expense of speed. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Sat Apr 29 19:09:39 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 19F17D56FC5; Sat, 29 Apr 2017 19:09:39 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id A0FADD8A; Sat, 29 Apr 2017 19:09:38 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id A7D7342B17F; Sun, 30 Apr 2017 05:09:27 +1000 (AEST) Date: Sun, 30 Apr 2017 05:09:26 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170429181022.GA41420@troutmask.apl.washington.edu> Message-ID: <20170430042756.A862@besplex.bde.org> References: <20170427231411.GA11346@troutmask.apl.washington.edu> <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> <20170429181022.GA41420@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=4e2EsQlSr8nDh46oqDgA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 19:09:39 -0000 On Sat, 29 Apr 2017, Steve Kargl wrote: > On Sat, Apr 29, 2017 at 05:54:21PM +1000, Bruce Evans wrote: >> On Fri, 28 Apr 2017, Steve Kargl wrote: > ... >>> GET_FLOAT_WORD(ix, p); >>> SET_FLOAT_WORD(phi, (ix >> 14) << 14); >>> >>> GET_FLOAT_WORD(ix, x2); >>> SET_FLOAT_WORD(x2hi, (ix >> 14) << 14); >> >> I expect that these GET/SET's are the slowest part. They are quite fast >> in float prec, but in double prec on old i386 CPUs compilers generate bad >> code which can have penalties of 20 cycles per GET/SET. >> >> Why the strange reduction? The double shift is just a manual optimization >> or pssimization (usually the latter) for clearing low bits. Here it is >> used to clear 14 low bits instead of the usual 12. This is normally >> written using just a mask of 0xffff0000, unless you want a different >> number of bits in the hi terms for technical reasons. Double precision >> can benefit more from asymmetric splitting of terms since 53 is not >> divisible by 2; 1 hi term must have less than 26.5 bits and the other term >> can hold an extra bit. > > Because I didn't think about using a mask. :-) > > It's easy to change 14 to 13 or 11 or ..., while I would > need to write out zeros and one to come up with 0xffff8000, > etc. Here are some examples of more delicate splittings from the uncommitted clog*(). They are usually faster than GET/SET, but slower than converting to lower precision as is often possible for double precision and ld128 only. clog*() can't use the casting method since it needs to split in the middle, and doesn't use GET/SET since it is slow. It uses methods that only work on args that are not too large or too small, and uses a GET earlier to classify the arg size. XX double prec: XX double_t ax, ax2h, ax2l, t; XX XX t = (double)(ax * (0x1p27 + 1)); XX axh = (double)(ax - t) + t; XX axl = ax - axh; Splitting in the middle is required to calculate ax*ax exactly. Since the middle bit is 26.5, splitting in the middle is impossible, but there are tricks to make spitting at bit 26 or 27 work just as well. Multipling by 0x1p27 shifts to a place where the lower bits are canceled by the subtraction. Adding 1 is a trick to create a extra (virtual) bit needed for the exact multiplication. At least ax needs to have the extra-precision type double_t so that casting to double works. Otherwise, this code would need STRICT_ASSIGN() to work around compiler bugfreatures. We assign back to double_t for efficiency. XX float prec: XX float_t ax, axh, axl, t; XX XX t = (float)(ax * (0x1p12F + 1)); XX axh = (float)(ax - t) + t; XX axl = ax - axh; This is simpler because there is a middle. I think adding 1 is not needed. XX long double double prec: XX #if LDBL_MANT_DIG == 64 XX #define MULT_REDUX 0x1p32 /* exponent MANT_DIG / 2 rounded up */ XX #elif LDBL_MANT_DIG == 113 XX #define MULT_REDUX 0x1p57 XX #else XX #error "Unsupported long double format" XX #endif XX XX long double ax, axh, axl, t; XX XX /* Calculate ax*ax and ay*ay exactly using Dekker's algorithm. */ XX t = (long double)(ax * (MULT_REDUX + 1)); XX axh = (long double)(ax - t) + t; XX axl = ax - axh; This is more unsimpler because the middle depends on a parameter and it is fractional in some cases. If adding 1 is not needed for the non- fractional case, then it should be added in MULT_REDUX in the fractional case only. Several functions including expl and e_rem_pio* use this REDUX method with the parameter hard-coded in a different way, usually with a slower STRICT_ASSIGN() and a special conversion to an integer or integer divided by a power of 2. The above is supposed to be more careful with extra precision so that STRICT_ASSIGN() is not needed. The calculation of axh may have problems with double rounding, but I think this doesn't matter here -- whatever axh is, the rest ends up in axl and we just need to ensure that both have 27 lower bits zero. Bruce From owner-freebsd-numerics@freebsd.org Sat Apr 29 19:38:30 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B3AD6D54D63; Sat, 29 Apr 2017 19:38:30 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 9C012AE; Sat, 29 Apr 2017 19:38:30 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3TJcTRx041989 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sat, 29 Apr 2017 12:38:29 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3TJcTNF041988; Sat, 29 Apr 2017 12:38:29 -0700 (PDT) (envelope-from sgk) Date: Sat, 29 Apr 2017 12:38:29 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170429193829.GA41964@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> <20170429181022.GA41420@troutmask.apl.washington.edu> <20170430042756.A862@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170430042756.A862@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 19:38:30 -0000 On Sun, Apr 30, 2017 at 05:09:26AM +1000, Bruce Evans wrote: > On Sat, 29 Apr 2017, Steve Kargl wrote: > > > On Sat, Apr 29, 2017 at 05:54:21PM +1000, Bruce Evans wrote: > >> On Fri, 28 Apr 2017, Steve Kargl wrote: > > ... > >>> GET_FLOAT_WORD(ix, p); > >>> SET_FLOAT_WORD(phi, (ix >> 14) << 14); > >>> > >>> GET_FLOAT_WORD(ix, x2); > >>> SET_FLOAT_WORD(x2hi, (ix >> 14) << 14); > >> > >> I expect that these GET/SET's are the slowest part. They are quite fast > >> in float prec, but in double prec on old i386 CPUs compilers generate bad > >> code which can have penalties of 20 cycles per GET/SET. > >> > >> Why the strange reduction? The double shift is just a manual optimization > >> or pssimization (usually the latter) for clearing low bits. Here it is > >> used to clear 14 low bits instead of the usual 12. This is normally > >> written using just a mask of 0xffff0000, unless you want a different > >> number of bits in the hi terms for technical reasons. Double precision > >> can benefit more from asymmetric splitting of terms since 53 is not > >> divisible by 2; 1 hi term must have less than 26.5 bits and the other term > >> can hold an extra bit. > > > > Because I didn't think about using a mask. :-) > > > > It's easy to change 14 to 13 or 11 or ..., while I would > > need to write out zeros and one to come up with 0xffff8000, > > etc. > > Here are some examples of more delicate splittings from the uncommitted > clog*(). They are usually faster than GET/SET, but slower than converting > to lower precision as is often possible for double precision and ld128 > only. clog*() can't use the casting method since it needs to split in the > middle, and doesn't use GET/SET since it is slow. It uses methods that > only work on args that are not too large or too small, and uses a GET > earlier to classify the arg size. I didn't know about these other splitting methods. Thanks for pointing them out to me. I updated by k_sinpif.c to use the standard masking with 0xffff0000. It has no effect on the timing on Core2 dou. It did however effect the max ULP. With exhaustive testing in [0x1p-14,0.25] I now have MAX ULP: 0.68287528 Total tested: 100663296 0.6 < ULP <= 0.7: 5607 the older version with the shifts by 14 bits gives MAX ULP: 0.73345101 Total tested: 100663296 0.7 < ULP <= 0.8: 45 0.6 < ULP <= 0.7: 11977 The value of 14 is a holdover from an earlier version. Getting back to the use of float_t and double_t. If one wants the performance penalty, these then work well. Changing types to float_t in k_cospif.c, I find a slowdown of for cospif, but I also find MAX ULP: 0.64679509 Total tested: 1048576000 0.6 < ULP <= 0.7: 31598 with exhaustive testing in [0,0.25]. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow From owner-freebsd-numerics@freebsd.org Sat Apr 29 20:13:00 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E8B38D55A65; Sat, 29 Apr 2017 20:13:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 952EC16C1; Sat, 29 Apr 2017 20:13:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 2DB851041D4A; Sun, 30 Apr 2017 06:12:46 +1000 (AEST) Date: Sun, 30 Apr 2017 06:12:43 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl cc: Bruce Evans , freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions In-Reply-To: <20170429184209.GB41420@troutmask.apl.washington.edu> Message-ID: <20170430051230.X1000@besplex.bde.org> References: <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> <20170429194239.P3294@besplex.bde.org> <20170429184209.GB41420@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=AYLBJzfG c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=ZDInmu-daXXT0OxnvG4A:9 a=dM07vABZqfTYfvwQ:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 20:13:01 -0000 On Sat, 29 Apr 2017, Steve Kargl wrote: > On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote: >. ... >> Using the standard kernels is easy and works well: > > As your code only works on the interval [0,0.25], I took > the liberty to use it as a __kernel_sinpi and __kernel_cospi. You already wrote the reduction part, and lgamma already has it to. I just checked lgamma. Its reduction part is short (especially after your improvements to it). Then it is sloppy and doesn't caclulate Pi*x in extra precision. I think this is justified since lgamma generally has errors much larger than 1 ulp. it does some micro-optimizations involving minus signs. There is some danger that with the kernels inline, the compiler will generate large inline code for multiple copies whose difference is just the placement of a single instruction to do the negation. If this is an optimization, then it is small. > >> XX double >> XX cospi(double x) >> XX { >> XX double_t hi, lo; > > If sizeof(double_t) indicates what I think it means, > This is slow on my Core2 duo (aka ia32 system). I doubt that it is what you think it is, since you neglect testing on i386 :-). It makes no difference on amd64, but is needed on i386 to optimized and give correct code. However, double_t is broken with clang on i386 with certain CFLAGS that give SSE. Then clang produces code that is incompatible with FLT_EVAL_METHOD and double_t. This makes clangs use of SSE a pessimization. It dodn't seem to break correctness (the result is the same as using long double instead of double_t). >> XX hi = (float)x; >> XX lo = x - hi; > > This is the splitting I use in my double version versions > with hi and lo as simply double. It will work without double_t for the splitting. double_t (or lots of slow STRICT_ASSIGN()s is needed for 2sum. 2sum is like the above except it casts to double instead of float, and compilers intentionally break casts. >> ... >> Efficiency is very good in some cases, but anomalous in others: all >> times in cycles, on i386, on the range [0, 0.25] >> >> athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 >> cos: 61-62 44 43 >> cospi: 69-71 (8-9 extra) 78 (anomalous...) 42 (faster to do more!) >> ... > > I is unclear how you're making your measurements. My timings > with my kernels compared to kernels based on your code: Basically by timing a for loop, but with many refinements to make sure that I am timing the function and not the loop, or at least to make sure that the loop timing doesn't change. For example, on amd64 and i386, the stack is aligned at 4K plus a smaller adjustable parameter since otherwise the timing for long doubles is too variable. Also, rebuilding all the object files used with identical CFLAGS (usually -O -march=better-than-native). This probably makes my tests run 20-50 cycles per iteration faster (the simplest non-inline function takes 8 or 9 cycles and that takes running it in parallel with itself). > | Bruce | Steve > ------+--------------+-------------- > sinpi | 0.0742 (148) | 0.0633 (126) > cospi | 0.0720 (144) | 0.0513 (102) > > First number is microseconds per call and the (xxx) is the time*cpu_freq. The difference might be loop overhead or loss of parallelism. IIRC, you use something to serialize after every function. I have 4 variations and rarely use the serialization ones. Or thie difference might be more that the kernels are not inlined. I forgot to inline them too. I always rebuild the kernels for every test, so they get compiled with non-default CFLAGS, but they don't get inlined without #include statements. > As far as over-engineering, for sinpi I find > > sinpi Bruce kernel Steve kernel > MAX ULP: 0.73021263 0.73955815 > Total tested: 33554431 33554431 > 0.7 < ULP <= 0.8: 154 280 > 0.6 < ULP <= 0.7: 27650 29197 See, the kernels are already tuned almost as much as possible for accuracy/efficiency tradeoffs. > cospi is much more interesting and as you state above more > difficult to get right. I have reworked my kernel, yet, > but I find > > cospi Bruce kernel Steve kernel > MAX ULP: 0.78223389 0.89921787 > Total tested: 33554431 33554431 > 0.8 < ULP <= 0.9: 0 3262 > 0.7 < ULP <= 0.8: 9663 68305 > 0.6 < ULP <= 0.7: 132948 346214 > > Perhaps, using double_t would reduce my max ULP at the expense > of speed. On i386, it should reduce the error at the expense of slowness. On amd64, it makes no difference, but using long double should reduct the error at the expense of speed. Usually the reduction in the maximum error is only about 0.1 ulps unless the algorithm uses extra precision intentionally. This is available on i386, but is not large enough to compensate for the slowness. My example of log10*() unintentionally gave an example of using extra precision intentionally for efficiency. Here it is for log10(): XX double XX log10(double x) XX { XX struct Double r; XX double_t hi, lo; XX XX DOPRINT_START(&x); XX k_log(x, &r); XX if (!r.lo_set) XX RETURNP(r.hi); XX if (0 && sizeof(double_t) > sizeof(double)) XX RETURNP((invln10_lo + (double_t)invln10_hi) * (r.lo + r.hi)); This avoids the emulated extra-precision calculations of possible. This is under 'if (0)' since it is too hard to determine if extra precision is available. The sizeof() calculation doesn't tell you if it is. On i386, double_t is long double, but extra precision is not available under FreeBSD unless the program used fpsetprec(FP_PE) to change the default of double precision. This is not under 'if (0)' for float precision. Then it is assume that the program doesn't use fpsetprec(FP_PS) to change the default to float precision. XX _2sumF(r.hi, r.lo); XX hi = (float)r.hi; XX lo = r.lo + (r.hi - hi); XX RETURN2P(invln10_hi * hi, XX (invln10_lo + invln10_hi) * lo + invln10_lo * hi); XX } i386/i387, when used as designed, can execute this function much faster than SSE by not executing the lines beginning with _2sumF(). These lines take 20-40 cycles (latency) on modern SSE. Throughput is better. This would probably be a good optimization for amd64 too (using long double instead of double_t for the extra precision). The Intel ia64 math library uses this method of switching to extended precision often. ia64 has a better FP design. I don't know its details, but it seems to combine the best features of i387 and SSE. The FP registers should have extra precision as on i387, but the instructions should make it efficient to not use the extra precision for languages and programmers that don't want it. Bruce