From owner-freebsd-standards@freebsd.org  Tue Apr 25 08:28:08 2017
Return-Path: <owner-freebsd-standards@freebsd.org>
Delivered-To: freebsd-standards@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A9E72D4EB9C;
 Tue, 25 Apr 2017 08:28:08 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id 4040C1C43;
 Tue, 25 Apr 2017 08:28:07 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au
 [122.106.153.191])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id B19C0104877B;
 Tue, 25 Apr 2017 18:27:57 +1000 (AEST)
Date: Tue, 25 Apr 2017 18:27:52 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
cc: freebsd-numerics@freebsd.org, freebsd-standards@freebsd.org
Subject: Re: Implementation of half-cycle trignometric functions
In-Reply-To: <20170425063643.GA45906@troutmask.apl.washington.edu>
Message-ID: <20170425173542.S1401@besplex.bde.org>
References: <20170409220809.GA25076@troutmask.apl.washington.edu>
 <20170413171248.GA86780@troutmask.apl.washington.edu>
 <20170424232743.GA54621@troutmask.apl.washington.edu>
 <20170425143120.R933@besplex.bde.org>
 <20170425063643.GA45906@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=KeqiiUQD c=1 sm=1 tr=0
 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17
 a=kj9zAlcOel0A:10 a=pMMk53J5QUvvM_BoBYoA:9 a=JWkNqyrBzBJkAUbv:21
 a=d3hanlBOIU_nJgXb:21 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-standards@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: Standards compliance <freebsd-standards.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-standards>, 
 <mailto:freebsd-standards-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-standards/>
List-Post: <mailto:freebsd-standards@freebsd.org>
List-Help: <mailto:freebsd-standards-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
 <mailto:freebsd-standards-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Apr 2017 08:28:08 -0000

On Mon, 24 Apr 2017, Steve Kargl wrote:

> On Tue, Apr 25, 2017 at 03:13:44PM +1000, Bruce Evans wrote:

>> [for expl]
>> This is arranged to look a bit like Horner's method at first (operations
>> not written in separate statements), but it is basically Estrin's method
>> with no extensions to more complicated sub-expressions than Estrin's,
>> but complications for accuracy:
>> - I think accuracy for the B3 term is unimportant, but for some reason
>>    (perhaps just the one in the comment) it is grouped separately.  Estrin
>>    might use x3 * (B3.e + x * B4.e).
>
> Yes, it is most likely an accuracy issue.  I found something
> similar with sinpi/cospi.  My k_sinpi.c has
>
> #if 0
>        double x2;
>        /* Horner's method. */
>        x2 = x * x;
>        sm = x2 * (x2 * (x2 * (x2 * (x2 * (x2 * s7 + s6) + s5) + s4) + s3)
>            + s2) + s1;

Horner's of course has about as many dependencies as operations (13), so
on modern x86 this is limited by the latency of 3-5 cycles/instructions
or 39-65 cycles.  Multiplications take an extra cycle in more cases, so
it is better to have more additions, but unfortunately Horner gives 1
more multuplication.

With no dependencies, the limit would be throughput -- 2 instructions/cycle
of 6.5 cycles rounded up to 7; then add latency for the last operation to
get about 11.

With practical dependencies, it is hard to get more than a 2-fold speedup.

> #else
>        double x2, x4;
>        /* Sort of Estrin's method. */
>        x2 = x * x;
>        x4 = x2 * x2;
>        sm = (s5 + s6 * x2 + s7 * x4) * x4 * x4
>            + (s2 + s3 * x2 + s4 * x4) * x2 + s1;
> #endif

This looks OK.  The only obvious improvement is to try grouping x4*x4
so that it can be evaluated separately.  This reduces the length of
dependency chains by 1, but increases early pressure on multiplication.
Don't write this as x8 = x4*x4.  Compilers will do the equivalent.
You can also replace x4 by (x2*x2) but there are many x4's so this
might be less readable.

> The addition of s1 must occur last, or I get greater 1 ULP
> for a number of values.

Indeed.  This prevents using the best order, which is the opposite.
You used the best order for the other inner terms, but not so obviously
for the first outer addition.  However, the compiler is free to reorder
that: consider:

 	T2 + T1 + T0

T2 is more complicated and everything depends on it, so should be started
first, but possibly T1 can be started first because it is simpler.  Here
it is not significantly simpler.  However, for:

 	T3 + T2 + T1 + T0

the natural left-to-right grouping is usually pessimial.  This should be
written as:

 	T3 + (T2 + T1) + T0

Unfortunately, we can't write this naturally and usually optimally as:

 	T0 + T1 + T2 + T3

due to the accuracy problem with T0.  Maybe right this as:

 	T1 + T2 + T3 + T0	// only T0 in unnatural order

For expl, I tried all orders.

> My timing show on Intel core2 duo
>
> /* Horner's methods. */
> laptop-kargl:kargl[219] ./testd -S -s -m 0 -M 0.25 -n 22
> sinpi time: 0.066 usec per call
> cospi time: 0.063 usec per call
>
> /* Estrin's methods. */
> laptop-kargl:kargl[212] ./testd -S -s -m 0 -M 0.25 -n 22
> sinpi time: 0.059 usec per call
> cospi time: 0.055 usec per call
>
> Here, -n 22 means to test 2**22 - 1 values in the interval
> [0,0.25].

The improvement is marginal.   Only worth doing if it doesn't complicate
the code much.  Older CPUs can't exploit te parallelism so well, so I
would expect a relatively larger improvement.  Don't measure in usec,
since they are hard to scale.  0.66 usec and 1.86 GHz is more than 100
cycles.  This is quite slow.  I get similar times for cos and sin on an
older Athlon-XP (~40 in float prec, ~160 in double and ~270 in long
double).  Newer CPUs are about 4 times faster in double prec (only 2
times fast in float prec) in cycles despite having similar nominal
timing, by unclogging their pipes.

We should also try to use the expression A*x + B more often so that
things with be faster when this is done by fma, but this happens
naturally anyway.

Bruce