From owner-freebsd-numerics@freebsd.org Sat Apr 29 18:42:11 2017 Return-Path: Delivered-To: freebsd-numerics@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B8BCDD567C2; Sat, 29 Apr 2017 18:42:11 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.95.76.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "troutmask", Issuer "troutmask" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id A043F1D4B; Sat, 29 Apr 2017 18:42:11 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost [127.0.0.1]) by troutmask.apl.washington.edu (8.15.2/8.15.2) with ESMTPS id v3TIg9KM041637 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sat, 29 Apr 2017 11:42:09 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.15.2/8.15.2/Submit) id v3TIg95N041636; Sat, 29 Apr 2017 11:42:09 -0700 (PDT) (envelope-from sgk) Date: Sat, 29 Apr 2017 11:42:09 -0700 From: Steve Kargl To: Bruce Evans Cc: freebsd-hackers@freebsd.org, freebsd-numerics@freebsd.org Subject: Re: Implementation of half-cycle trignometric functions Message-ID: <20170429184209.GB41420@troutmask.apl.washington.edu> Reply-To: sgk@troutmask.apl.washington.edu References: <20170428010122.GA12814@troutmask.apl.washington.edu> <20170428183733.V1497@besplex.bde.org> <20170428165658.GA17560@troutmask.apl.washington.edu> <20170429035131.E3406@besplex.bde.org> <20170428201522.GA32785@troutmask.apl.washington.edu> <20170429070036.A4005@besplex.bde.org> <20170428233552.GA34580@troutmask.apl.washington.edu> <20170429005924.GA37947@troutmask.apl.washington.edu> <20170429151457.F809@besplex.bde.org> <20170429194239.P3294@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170429194239.P3294@besplex.bde.org> User-Agent: Mutt/1.7.2 (2016-11-26) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 29 Apr 2017 18:42:11 -0000 On Sat, Apr 29, 2017 at 08:19:23PM +1000, Bruce Evans wrote: > On Sat, 29 Apr 2017, Bruce Evans wrote: > > > On Fri, 28 Apr 2017, Steve Kargl wrote: > > > >> On Fri, Apr 28, 2017 at 04:35:52PM -0700, Steve Kargl wrote: > >>> > >>> I was just backtracking with __kernel_sinpi. This gets a max ULP < 0.61. > > > > Comments on this below. > > > > This is all rather over-engineered. Optimizing these functions is > > unimportant comparing with finishing cosl() and sinl() and optimizing > > all of the standard trig functions better, but we need correctness. > > But I now see many simplifications and improvements: > > > > (1) There is no need for new kernels. The standard kernels already handle > > extra precision using approximations like: > > > > sin(x+y) ~= sin(x) + (1-x*x/2)*y. > > > > Simply reduce x and write Pi*x = hi+lo. Then > > > > sin(Pi*x) = __kernel_sin(hi, lo, 1). > > > > I now see how to do the extra-precision calculations without any > > multiplications. > > But that is over-engineered too. > > Using the standard kernels is easy and works well: As your code only works on the interval [0,0.25], I took the liberty to use it as a __kernel_sinpi and __kernel_cospi. > XX double > XX cospi(double x) > XX { > XX double_t hi, lo; If sizeof(double_t) indicates what I think it means, This is slow on my Core2 duo (aka ia32 system). > XX hi = (float)x; > XX lo = x - hi; This is the splitting I use in my double version versions with hi and lo as simply double. > XX lo = (pi_lo + pi_hi) * lo + pi_lo * hi; > XX hi = pi_hi * hi; > XX _2sumF(hi, lo); > XX return __kernel_cos(hi, lo); > XX } > XX > > I only did a sloppy accuracy test for sinpi(). It was 0.03 ulps less > accurate than sin() on the range [0, 0.25] for it and [0, Pi/4] for > sin(). > > Efficiency is very good in some cases, but anomalous in others: all > times in cycles, on i386, on the range [0, 0.25] > > athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 > cos: 61-62 44 43 > cospi: 69-71 (8-9 extra) 78 (anomalous...) 42 (faster to do more!) > sin: 59-60 51 37 > sinpi: 67-68 (8 extra) 80 42 > tan: 136-172 93-195 67-94 > tanpi: 144-187 (8-15 extra) 145-176 61-189 > > That was a throughput test. Latency is not so good. My latency test > doesn't use serializing instructions, but uses random args and the > partial serialization of making each result depend on the previous > one. > > athlon-xp, gcc-3.3 Haswell, gcc-3.3 Haswell, gcc-4.2.1 > cos: 84-85 69 79 > cospi: 103-104 (19-21 extra) 117 94 > sin: 75-76 89 77 > sinpi: 105-106 (30 extra) 116 90 > tan: 168-170 167-168 147 > tanpi: 191-194 (23-24 extra) 191 154 I is unclear how you're making your measurements. My timings with my kernels compared to kernels based on your code: | Bruce | Steve ------+--------------+-------------- sinpi | 0.0742 (148) | 0.0633 (126) cospi | 0.0720 (144) | 0.0513 (102) First number is microseconds per call and the (xxx) is the time*cpu_freq. As far as over-engineering, for sinpi I find sinpi Bruce kernel Steve kernel MAX ULP: 0.73021263 0.73955815 Total tested: 33554431 33554431 0.7 < ULP <= 0.8: 154 280 0.6 < ULP <= 0.7: 27650 29197 cospi is much more interesting and as you state above more difficult to get right. I have reworked my kernel, yet, but I find cospi Bruce kernel Steve kernel MAX ULP: 0.78223389 0.89921787 Total tested: 33554431 33554431 0.8 < ULP <= 0.9: 0 3262 0.7 < ULP <= 0.8: 9663 68305 0.6 < ULP <= 0.7: 132948 346214 Perhaps, using double_t would reduce my max ULP at the expense of speed. -- Steve 20170425 https://www.youtube.com/watch?v=VWUpyCsUKR4 20161221 https://www.youtube.com/watch?v=IbCHE-hONow