From owner-freebsd-numerics@FreeBSD.ORG  Sun Aug  5 10:06:02 2012
Return-Path: <owner-freebsd-numerics@FreeBSD.ORG>
Delivered-To: freebsd-numerics@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0F7DF106564A
	for <freebsd-numerics@FreeBSD.org>;
	Sun,  5 Aug 2012 10:06:02 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail26.syd.optusnet.com.au (mail26.syd.optusnet.com.au
	[211.29.133.167])
	by mx1.freebsd.org (Postfix) with ESMTP id 682C88FC1D
	for <freebsd-numerics@FreeBSD.org>;
	Sun,  5 Aug 2012 10:05:59 +0000 (UTC)
Received: from c122-106-171-246.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-246.carlnfd1.nsw.optusnet.com.au [122.106.171.246])
	by mail26.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q75A5ruA001658
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 5 Aug 2012 20:05:54 +1000
Date: Sun, 5 Aug 2012 20:05:53 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Stephen Montgomery-Smith <stephen@missouri.edu>
In-Reply-To: <501D9C36.2040207@missouri.edu>
Message-ID: <20120805175106.X3574@besplex.bde.org>
References: <5017111E.6060003@missouri.edu> <501C361D.4010807@missouri.edu>
	<20120804165555.X1231@besplex.bde.org> <501D51D7.1020101@missouri.edu>
	<20120805030609.R3101@besplex.bde.org> <501D9C36.2040207@missouri.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-numerics@FreeBSD.org, Bruce Evans <brde@optusnet.com.au>
Subject: Re: Complex arg-trig functions
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
	<freebsd-numerics.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
	<mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
	<mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 05 Aug 2012 10:06:02 -0000

On Sat, 4 Aug 2012, Stephen Montgomery-Smith wrote:

> On 08/04/2012 03:45 PM, Bruce Evans wrote:
>>>
>>>> I made good progress with clog[fl]().  The accuracy problems near 0
>>>> are long solved and proved to be solved.
>
> I am starting to go through your code.
>
> What is the purpose of the macros norm and spadd?

It is to provide easy-to-use maximally efficient primitives for
multi-precision arithmetic.  Full multi-precision is too slow, so old
fdlibm code uses subsets of it tailored to the context.  This is about
as efficient as possible, but every use looks different so it is hard
to read, and context-sensitive tricks are hard to write too.  norm()
and spadd() are the only primitives that I have found generally useful
and efficient.

I used to have a primitive for hi+lo splitting, but found it not general
enough to use.  For splitting, you often have a suitable subset of the
bits in a register and don't want to use a general interface that would
have to copy all the bits from another variable.  Splitting is
inherently inefficient, and my primitive originally focused on putting
MD code in it to optimize it.  I eventually found ways to avoid it or
to optimize the MI code, and the MI code turned out to be quite good
but is still not optimized well by compilers.

Here are my full versions of these interfaces.  Documenting them makes
them unreadable :-).  The following changes are needed:
- make everything type-generic.  Already done for the cut-down versions
   for clog*().  I don't know how to put the debugging code in type-generic
   macros.
- deverbosify the comments.
- due to the compiler bugfeatures whose description forms the main part
   of the comments, type-generic macros won't actually work if the arg
   type is float or double and float differs from float_t or double
   differs from double_t.  These non-type-generic functions tend to
   cause compile-time warnings when there is a mismatch.  How can this
   even be described without incrementing the verbosity instead of
   descreasing it?

% /*-
%  * "Normalize" the terms in the infinite-precision expression (*a + *b)
%  * for the sum of 2 doubles so that *b is as small as possible relative
%  * to *a.  (The resulting *a is the value of the expression in double
%  * precision and the resulting *b is the rounding error.)  |*a| must
%  * be >= |*b|, and exponent overflow or underflow must not occur.  This
%  * uses a Theorem of Dekker.  See Knuth (1981) 4.2.2 Theorem C.
%  *
%  * The doubles must actually be double_t's (spelled as __double_t for
%  * namespace reasons) to avoid the following problems with plain
%  * doubles:
%  *
%  * - If expressions are evaluated in extra precision, then Theorem C
%  *   doesn't apply unless the extra precision is discarded after every
%  *   addition and subtraction.
%  * - C specifies that casts and assignments discard any extra precision.
%  *   However, this is very broken in many compilers.
%  * - If we work around the compiler bugs using STRICT_ASSIGN(), then we
%  *   get not only ugly code but also very inefficient code and
%  *   unnecessarily inaccurate code.  The conversions forced by the
%  *   strict assignments are not just inefficient (since they involve
%  *   extra operations); they are usually very inefficient since they
%  *   give extra dependencies.  Discarding the extra precision that we
%  *   got for "free" then reduces the maximum precision "free"ly
%  *   available in final values.
%  * - If we don't work around the compiler bugs, then we get code that
%  *   usually works and is efficient and accurate as possible (all 3!),
%  *   but this is very fragile.  It depends on the compiler bugs being
%  *   so perfect that extra precision is never lost before it is needed
%  *   and callers not having complicated code or even function calls that
%  *   would force loss of the extra precision.
%  *
%  * Using double_t's throughout works essentially by producing the same
%  * code as the "usually works" fragile case, but intentionally.
%  * Callers must pass us double_t's, and should also use double_t's for
%  * almost all their variables (and thus intermediate values), to get
%  * consistent extra precision and to avoid conversions.  If the
%  * values cannot live throughout in registers, then spilling them will
%  * cause much the same inefficiencies as conversions, but the spills
%  * should now preserve extra precision even in the presence of the
%  * normal compiler bugs in this area, so the final result remains as
%  * accurate as possible.
%  */
% static inline void
% onorm(__double_t *a, __double_t *b)
% {
% 	__double_t w;
% 
% #ifdef DEBUG
% 	volatile __double_t ia, ib, r, vw;
% 
% 	ia = *a;
% 	ib = *b;
% 	assert(ia == 0 || fabs(ia) >= fabs(ib));
% #endif
% 	w = *a + *b;
% 	*b = (*a - w) + *b;
% 	*a = w;
% 
% #ifdef DEBUG
% 	assert((long double)ia + ib == (long double)*a + *b);
% 	vw = ia + ib;
% 	r = ia - vw;
% 	r += ib;
% 	assert(vw == *a && r == *b);
% #endif
% }
% 
% /*
%  * We now use the inline version again.  This macro version has been
%  * reduced to a wrapper that hides the messy detail of taking the addresses
%  * (`a' and b must be lvalues, and the macro obfuscates this but keeps the
%  * API stable).  Perhaps taking the addresses should be put back in the
%  * API.  The results should be identical, but the inline version version
%  * is a cycle or 2 faster in more cases than it is a cycle slower.
%  * Apparently, more assignments help the register allocator.
%  */
% #define	norm(a, b)	onorm(&(a), &(b))
% /* Maybe back to macro version. */
% #define	Norm(a, b) do {		\
% 	__double_t w;		\
% 				\
% 	w = (a) + (b);		\
% 	(b) = ((a) - w) + (b);	\
% 	(a) = w;		\
% } while (0)
% 
% /*
%  * Set x += c, where x is represented in extra precision as a + b.
%  * x must be sufficiently normalized and sufficiently larger than c,
%  * and the result is then sufficiently normalized.
%  *
%  * The details of ordering are that |a| must be >= |c| (so that (a, c)
%  * can be normalized without extra work to swap 'a' with c).  The details of
%  * the normalization are that b must be small relative to the normalized a.
%  * Normalization of (a, c) makes the normalized c tiny relative to the
%  * normalized a, so b remains small relative to 'a' in the result.  However,
%  * b need not ever be tiny relative to a.  For example, b might be about
%  * 2**20 times smaller than 'a' to give about 20 extra bits of precision.
%  * That is usually enough, and adding c (which by normalization is about
%  * 2**53 times smaller than a) cannot change b significantly.  However,
%  * cancellation of 'a' with c in normalization of (a, c) may reduce a
%  * significantly relative to b.  The caller must ensure that significant
%  * cancellation doesn't occur, either by having c of the same sign as 'a',
%  * or by having |c| a few percent smaller than |a|.  Pre-normalization of
%  * (a, b) may help.
%  *
%  * This is is a variant of an algorithm of Kahan (see Knuth (1981) 4.2.2
%  * exercise 19).  We gain considerable efficiency by requiring the terms to
%  * be sufficiently normalized and sufficiently increasing.
%  * 
%  * `sp' means that the added term is in "single precision", i.e., not
%  * in multiple precision.
%  */
% #define	spadd(a, b, c) do {	\
% 	__double_t __tmp;	\
% 				\
% 	__tmp = (c);		\
% 	onorm(&__tmp, &(a));	\
% 	(b) += (a);		\
% 	(a) = __tmp;		\
% } while (0)
% 
% /* norm() for float precision. */
% static inline void
% onormf(__float_t *a, __float_t *b)
% {
% 	__float_t w;
% 
% #ifdef DEBUG
% 	volatile __float_t ia, ib, r, vw;
% 
% 	ia = *a;
% 	ib = *b;
% 	assert(ia == 0 || fabsf(ia) >= fabsf(ib));
% #endif
% 	w = *a + *b;
% 	*b = (*a - w) + *b;
% 	*a = w;
% 
% #ifdef DEBUG
% 	assert((long double)ia + ib == (long double)*a + *b);
% 	vw = ia + ib;
% 	r = ia - vw;
% 	r += ib;
% 	assert(vw == *a && r == *b);
% #endif
% }
% 
% #define	normf(a, b)	onormf(&(a), &(b))
% 
% /* spadd() for float precision. */
% #define	spaddf(a, b, c) do {	\
% 	__float_t __tmp;	\
% 				\
% 	__tmp = (c);		\
% 	onormf(&__tmp, &(a));	\
% 	(b) += (a);		\
% 	(a) = __tmp;		\
% } while (0)
% 
% /*
%  * norm() for long double precision.  This is simpler since there is no
%  * extra precision to worry about (or if there is, then we have larger
%  * problems), but without extra precision the last 2 assertions in the
%  * debugging code are even weaker.
%  */
% static inline void
% onorml(long double *a, long double *b)
% {
% 	long double w;
% 
% #ifdef DEBUG
% 	volatile long double ia, ib, r, vw;
% 
% 	ia = *a;
% 	ib = *b;
% 	assert(ia == 0 || fabsl(ia) >= fabsl(ib));
% #endif
% 	w = *a + *b;
% 	*b = (*a - w) + *b;
% 	*a = w;
% 
% #ifdef DEBUG
% 	assert(ia + ib == *a + *b);
% 	vw = ia + ib;
% 	r = ia - vw;
% 	r += ib;
% 	assert(vw == *a && r == *b);
% #endif
% }
% 
% #define	norml(a, b)	onorml(&(a), &(b))
% 
% /* spadd() for long double precision. */
% #define	spaddl(a, b, c) do {	\
% 	long double __tmp;	\
% 				\
% 	__tmp = (c);		\
% 	onorml(&__tmp, &(a));	\
% 	(b) += (a);		\
% 	(a) = __tmp;		\
% } while (0)

<end of full norm(), etc.>

> Also, the proof you provided (which I haven't gone through in detail yet) - 
> do you intend to put them as comments in the code?

No.  It is far too long.

> For example, when you write:
>
> %	EXTRACT_WORDS(hx, lx, ax);
> %	bits = ((uint64_t)hx << 32) | lx;
> %	bits += 0x04000000;
> %	hx = bits >> 32;
> %	lx = bits & 0xf8000000;
> %	INSERT_WORDS(xh, hx, lx);
> %	xl = ax - xh;
>
> it would be very helpful to include a comment like
>
> %	/*
> 	 * xh is ax rounded to nearest 26 bit floating point.
> 	 * xl is such that x = xh + xl.
> 	 * Because xh is rounded and not truncated to 26 bits,
> 	 * only the top 26 bits (not 27=53-26) of the mantissa of
> 	 * xl can be non-zero.
> 	 */
>
> because while that can be deduced from reading the code, it is by no means 
> obvious.

Again, far too long.  This is much more routine than the above.  Splitting
is done without comment all over old fdlibm code.  Only the point about
the last bit is not routine.

> By the way, that rounding to the nearest 26 bits is a very smart idea.

But it is the natural way.  It is just easier and more efficient (saves 1
integer instruction) to lop off the low bits, so lopping should be used
if it works (I need to try it for float and ld80 precisions where it
should work because MANT_DIG is even).

e_hypot.c has hi+lo decomposition that is not as delicate as the above:

%     /* medium size a and b */
% 	w = a-b;
% 	if (w>b) {
% 	    t1 = 0;
% 	    SET_HIGH_WORD(t1,ha);
% 	    t2 = a-t1;
% 	    w  = sqrt(t1*t1-(b*(-b)-t2*(a+t1)));
% 	} else {
% 	    a  = a+a;
% 	    y1 = 0;
% 	    SET_HIGH_WORD(y1,hb);

This lops off the low 32 bits (y1 = b lopped).  The previous 2 lines
can be written as INSERT_WORDS(y1, hb, 0), but old fdlibm code likes
to write it as above.  Both methods have serious efficiency problems
on modern CPUs unless the compiler optimizes them, since they asks for
partial object accesses (2 * 32-bit words instead of 1 64-bit access).
Old fdlibm code accidentally reduces these problems by asking for a
full initialization of y.

% 	    y2 = b - y1;

Remaining 32 bits of b.

% 	    t1 = 0;
% 	    SET_HIGH_WORD(t1,ha+0x00100000);
% 	    t2 = a - t1;

Similarly for a, except a was doubled just above, so ha is out of date
and we fix it up by adding 1 to its exponent bit-field.  This is a
good example of a context-sensitive splitting.  (However, better code
maintain ha next to a and let the compiler optimize away the change
to ha in paths where the change is not used.)

% 	    w  = sqrt(t1*y1-(w*(-w)-(t1*y2+t2*b)));
% 	}

The arg for sqrt() must be a*a + b*b rearranged for accurcy.  Since
w*w is a 32x32 -> 53 bit operation, this is obviously not a full
doubled precision operation.

I noticed a full doubled precision multiplication most recently in
das's s_fma.c:

% /*
%  * Compute a*b exactly, returning the exact result in a struct dd.  We assume
%  * that both a and b are normalized, so no underflow or overflow will occur.
%  * The current rounding mode must be round-to-nearest.
%  */

I think the requirement on the rounding mode is only so that the splitting
rounds to nearest.

% static inline struct dd
% dd_mul(double a, double b)
% {
% 	static const double split = 0x1p27 + 1.0;
% 	struct dd ret;
% 	double ha, hb, la, lb, p, q;
% 
% 	p = a * split;
% 	ha = a - p;
% 	ha += p;

I prefer splitting by addition.  This method seems to be just slower
(1 extra multiplication and giving 1 extra dependence than for ha =
ha + redux - redux).

% 	la = a - ha;

This and the additive method are fragile, and the above is broken on
i386 with rounding precision FP_PE.  clogl() uses the additive method,
but clog() and clogf() use more direct methods to avoid problems with
extra precision.

The assignments to ha are also missing STRICT_ASSIGN().

Corresponding float precision code would be broken on i386 with the
default rounding precision FP_PD.

We currently assume than the rounding precision on i386 is never changed
from its default of FP_PD, except when calling long double functions it
should always be changed to FP_PE, so that code like above code works.
We are more careful in float functions.  My tests can be configured to
test non-default precisions.  This only shows a few double functions that
don't work with extra precision.  But many float functions don't work in
only float precision, because they have been optimized to depend on
double precision.

% 
% 	p = b * split;
% 	hb = b - p;
% 	hb += p;
% 	lb = b - hb;
% 
% 	p = ha * hb;
% 	q = ha * lb + la * hb;

I don't see why this is always exact.  For squaring, it is exact since
it is 2*ha*la.

% 
% 	ret.hi = p + q;
% 	ret.lo = p - ret.hi + q + la * lb;

I don't see why the final addition is exact.  Except for that,
(ret.hi, ret.lo) is norm(p, q), which is exact by Knuth Theorem C.
Note that Theorem C applies because |p| >= |q|.  norm() in the
above depends on this too.  norm() in clog*() uses a the larger
code justified by a nearby Theorem (B?).  s_fma.c also has a
function dd_add() which uses the larger code.  I copied the
larger code in clog*() from there.  This is an example of why
the smaller norm() is a good primitive -- the larger one does
more so it is less primitive, and things should be organized so
that the larger one is rarely needed.

% 	return (ret);
% }

Doubled precision addition is harder than multiplication, and I
couldn't find a general enough version in s_fma.c to copy.  There
is a related problem with naming all of these functions.  norm()
isn't a very descriptive name, but it is the best that I could
think of.  spadd() is closer to sesquied precision than singled
or doubled (it gives doubled + singled -> doubled).  The above
dd_mul() is also only sesquied and closer to singled than is spadd()
(it gives singled x singled -> doubled).  dd_add() is just the
larger norm(), so it gives singled + singled -> doubled.  s_fma.c
gets by without anything like spadd() mainly because it only returns
a singled result.  After a dp_add() or 2, it adds the 2 components
and then has to do a delicate rounding step, but no addition of
doubled precision results.

There are some more delicate and less routine roundings (not for
splitting) at a critical bit in the middle.  Some of these deserve and
have comments.  See s_cbrt.c (my verbose comment), s_exp2.c (no
comment for redux magic; s_exp2l.c might say more) and e_rem_pio2.c
(redux magic obtained from s_exp2.c).  After seeing a few of these
0x1.MpN redux constants, using them becomes routine except for the
problems with extra precision.

Bruce