From owner-freebsd-numerics@freebsd.org Tue Apr 25 06:36:45 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 37F36D4F513; Tue, 25 Apr 2017 06:36:45 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 2199F1D18; Tue, 25 Apr 2017 06:36:45 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3P6ah72046094 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 24 Apr 2017 23:36:43 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3P6ahPV046093; Mon, 24 Apr 2017 23:36:43 -0700 (PDT) (envelope-from sgk) Date: Mon, 24 Apr 2017 23:36:43 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170425063643.GA45906@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170409220809.GA25076@troutmask.apl.washington.edu> <20170413171248.GA86780@troutmask.apl.washington.edu> <20170424232743.GA54621@troutmask.apl.washington.edu> <20170425143120.R933@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170425143120.R933@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Apr 2017 06:36:45 -0000 On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote: > On Mon, 24 Apr 2017, Steve Kargl wrote: > > > I have what appears to be working versions of asinpi[fl]. It > > was suggested elsewhere that using an Estrin's method to > > sum the polynomial approximations instead of Horner's method > > may allow modern CPUs to better schedule generated code. > > I have implemented an Estrin's-like method for sinpi[l] > > and cospi[l], and indeed the generated code is faster on > > my Intel core2 duo with only a slight degradation in the > > observed max ULP. I'll post new versions to bugzilla in > > the near future. > > I didn't remember Estrin, but checked Knuth and see that his > method is a subset of what I use routinely. I might have learned > it from Knuth. I tried all methods in Knuth, but they didn't seem > to be any better at first. Later I learned more about scheduling. > Estrin's method now seems routine as a subset of scheduling. I think I came across this in Knuth as well, but I may have picked of the name Estrin from Mueller's book (or google :). > > You already used it in expl: for ld80: > I recall that that is your handy work. > X x2 = x * x; > X x4 = x2 * x2; > > Unlike in Knuth's description of Estrin, we don't try to order the > operations to suit the CPU(s) in the code, but just try to keep them > independent, as required for them to execute independently. Compilers > will rearrange them anyway, sometimes even correctly, but it makes > little difference (on OOE CPUs) to throw all of the operations into > the pipeline and see how fast something comes out). > > X q = x4 * (x2 * (x4 * > X /* > X * XXX the number of terms is no longer good for > X * pairwise grouping of all except B3, and the > X * grouping is no longer from highest down. > X */ > X (x2 * B12 + (x * B11 + B10)) + > X (x2 * (x * B9 + B8) + (x * B7 + B6))) + > X (x * B5 + B4.e)) + x2 * x * B3.e; > > This is arranged to look a bit like Horner's method at first (operations > not written in separate statements), but it is basically Estrin's method > with no extensions to more complicated sub-expressions than Estrin's, > but complications for accuracy: > - I think accuracy for the B3 term is unimportant, but for some reason > (perhaps just the one in the comment) it is grouped separately. Estrin > might use x3 * (B3.e + x * B4.e). Yes, it is most likely an accuracy issue. I found something similar with sinpi/cospi. My k_sinpi.c has #if 0 double x2; /* Horner's method. */ x2 = x * x; sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3) + s2) + s1; #else double x2, x4; /* Sort of Estrin's method. */ x2 = x * x; x4 = x2 * x2; sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4 + (s2 + s3 * x2 + s4 * x4) * x2 + s1; #endif The addition of s1 must occur last, or I get greater 1 ULP for a number of values. My timing show on Intel core2 duo /* Horner's methods. */ laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22 sinpi time: 0.066 usec per call cospi time: 0.063 usec per call /* Estrin's methods. */ laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22 sinpi time: 0.059 usec per call cospi time: 0.055 usec per call Here, -n 22 means to test 2**22 - 1 values in the interval [0,0.25]. For the older functions that were committed a few years ago, I should probably go back to recheck timings and accuracy with and without Estrin's method. -- Steve 20161221 https://www.youtube.com/watch?v=IbCHE-hONow