Date: Sun, 5 Aug 2012 20:05:53 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Stephen Montgomery-Smith <stephen@missouri.edu> Cc: freebsd-numerics@FreeBSD.org, Bruce Evans <brde@optusnet.com.au> Subject: Re: Complex arg-trig functions Message-ID: <20120805175106.X3574@besplex.bde.org> In-Reply-To: <501D9C36.2040207@missouri.edu> References: <5017111E.6060003@missouri.edu> <501C361D.4010807@missouri.edu> <20120804165555.X1231@besplex.bde.org> <501D51D7.1020101@missouri.edu> <20120805030609.R3101@besplex.bde.org> <501D9C36.2040207@missouri.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 4 Aug 2012, Stephen Montgomery-Smith wrote: > On 08/04/2012 03:45 PM, Bruce Evans wrote: >>> >>>> I made good progress with clog[fl](). The accuracy problems near 0 >>>> are long solved and proved to be solved. > > I am starting to go through your code. > > What is the purpose of the macros norm and spadd? It is to provide easy-to-use maximally efficient primitives for multi-precision arithmetic. Full multi-precision is too slow, so old fdlibm code uses subsets of it tailored to the context. This is about as efficient as possible, but every use looks different so it is hard to read, and context-sensitive tricks are hard to write too. norm() and spadd() are the only primitives that I have found generally useful and efficient. I used to have a primitive for hi+lo splitting, but found it not general enough to use. For splitting, you often have a suitable subset of the bits in a register and don't want to use a general interface that would have to copy all the bits from another variable. Splitting is inherently inefficient, and my primitive originally focused on putting MD code in it to optimize it. I eventually found ways to avoid it or to optimize the MI code, and the MI code turned out to be quite good but is still not optimized well by compilers. Here are my full versions of these interfaces. Documenting them makes them unreadable :-). The following changes are needed: - make everything type-generic. Already done for the cut-down versions for clog*(). I don't know how to put the debugging code in type-generic macros. - deverbosify the comments. - due to the compiler bugfeatures whose description forms the main part of the comments, type-generic macros won't actually work if the arg type is float or double and float differs from float_t or double differs from double_t. These non-type-generic functions tend to cause compile-time warnings when there is a mismatch. How can this even be described without incrementing the verbosity instead of descreasing it? % /*- % * "Normalize" the terms in the infinite-precision expression (*a + *b) % * for the sum of 2 doubles so that *b is as small as possible relative % * to *a. (The resulting *a is the value of the expression in double % * precision and the resulting *b is the rounding error.) |*a| must % * be >= |*b|, and exponent overflow or underflow must not occur. This % * uses a Theorem of Dekker. See Knuth (1981) 4.2.2 Theorem C. % * % * The doubles must actually be double_t's (spelled as __double_t for % * namespace reasons) to avoid the following problems with plain % * doubles: % * % * - If expressions are evaluated in extra precision, then Theorem C % * doesn't apply unless the extra precision is discarded after every % * addition and subtraction. % * - C specifies that casts and assignments discard any extra precision. % * However, this is very broken in many compilers. % * - If we work around the compiler bugs using STRICT_ASSIGN(), then we % * get not only ugly code but also very inefficient code and % * unnecessarily inaccurate code. The conversions forced by the % * strict assignments are not just inefficient (since they involve % * extra operations); they are usually very inefficient since they % * give extra dependencies. Discarding the extra precision that we % * got for "free" then reduces the maximum precision "free"ly % * available in final values. % * - If we don't work around the compiler bugs, then we get code that % * usually works and is efficient and accurate as possible (all 3!), % * but this is very fragile. It depends on the compiler bugs being % * so perfect that extra precision is never lost before it is needed % * and callers not having complicated code or even function calls that % * would force loss of the extra precision. % * % * Using double_t's throughout works essentially by producing the same % * code as the "usually works" fragile case, but intentionally. % * Callers must pass us double_t's, and should also use double_t's for % * almost all their variables (and thus intermediate values), to get % * consistent extra precision and to avoid conversions. If the % * values cannot live throughout in registers, then spilling them will % * cause much the same inefficiencies as conversions, but the spills % * should now preserve extra precision even in the presence of the % * normal compiler bugs in this area, so the final result remains as % * accurate as possible. % */ % static inline void % onorm(__double_t *a, __double_t *b) % { % __double_t w; % % #ifdef DEBUG % volatile __double_t ia, ib, r, vw; % % ia = *a; % ib = *b; % assert(ia == 0 || fabs(ia) >= fabs(ib)); % #endif % w = *a + *b; % *b = (*a - w) + *b; % *a = w; % % #ifdef DEBUG % assert((long double)ia + ib == (long double)*a + *b); % vw = ia + ib; % r = ia - vw; % r += ib; % assert(vw == *a && r == *b); % #endif % } % % /* % * We now use the inline version again. This macro version has been % * reduced to a wrapper that hides the messy detail of taking the addresses % * (`a' and b must be lvalues, and the macro obfuscates this but keeps the % * API stable). Perhaps taking the addresses should be put back in the % * API. The results should be identical, but the inline version version % * is a cycle or 2 faster in more cases than it is a cycle slower. % * Apparently, more assignments help the register allocator. % */ % #define norm(a, b) onorm(&(a), &(b)) % /* Maybe back to macro version. */ % #define Norm(a, b) do { \ % __double_t w; \ % \ % w = (a) + (b); \ % (b) = ((a) - w) + (b); \ % (a) = w; \ % } while (0) % % /* % * Set x += c, where x is represented in extra precision as a + b. % * x must be sufficiently normalized and sufficiently larger than c, % * and the result is then sufficiently normalized. % * % * The details of ordering are that |a| must be >= |c| (so that (a, c) % * can be normalized without extra work to swap 'a' with c). The details of % * the normalization are that b must be small relative to the normalized a. % * Normalization of (a, c) makes the normalized c tiny relative to the % * normalized a, so b remains small relative to 'a' in the result. However, % * b need not ever be tiny relative to a. For example, b might be about % * 2**20 times smaller than 'a' to give about 20 extra bits of precision. % * That is usually enough, and adding c (which by normalization is about % * 2**53 times smaller than a) cannot change b significantly. However, % * cancellation of 'a' with c in normalization of (a, c) may reduce a % * significantly relative to b. The caller must ensure that significant % * cancellation doesn't occur, either by having c of the same sign as 'a', % * or by having |c| a few percent smaller than |a|. Pre-normalization of % * (a, b) may help. % * % * This is is a variant of an algorithm of Kahan (see Knuth (1981) 4.2.2 % * exercise 19). We gain considerable efficiency by requiring the terms to % * be sufficiently normalized and sufficiently increasing. % * % * `sp' means that the added term is in "single precision", i.e., not % * in multiple precision. % */ % #define spadd(a, b, c) do { \ % __double_t __tmp; \ % \ % __tmp = (c); \ % onorm(&__tmp, &(a)); \ % (b) += (a); \ % (a) = __tmp; \ % } while (0) % % /* norm() for float precision. */ % static inline void % onormf(__float_t *a, __float_t *b) % { % __float_t w; % % #ifdef DEBUG % volatile __float_t ia, ib, r, vw; % % ia = *a; % ib = *b; % assert(ia == 0 || fabsf(ia) >= fabsf(ib)); % #endif % w = *a + *b; % *b = (*a - w) + *b; % *a = w; % % #ifdef DEBUG % assert((long double)ia + ib == (long double)*a + *b); % vw = ia + ib; % r = ia - vw; % r += ib; % assert(vw == *a && r == *b); % #endif % } % % #define normf(a, b) onormf(&(a), &(b)) % % /* spadd() for float precision. */ % #define spaddf(a, b, c) do { \ % __float_t __tmp; \ % \ % __tmp = (c); \ % onormf(&__tmp, &(a)); \ % (b) += (a); \ % (a) = __tmp; \ % } while (0) % % /* % * norm() for long double precision. This is simpler since there is no % * extra precision to worry about (or if there is, then we have larger % * problems), but without extra precision the last 2 assertions in the % * debugging code are even weaker. % */ % static inline void % onorml(long double *a, long double *b) % { % long double w; % % #ifdef DEBUG % volatile long double ia, ib, r, vw; % % ia = *a; % ib = *b; % assert(ia == 0 || fabsl(ia) >= fabsl(ib)); % #endif % w = *a + *b; % *b = (*a - w) + *b; % *a = w; % % #ifdef DEBUG % assert(ia + ib == *a + *b); % vw = ia + ib; % r = ia - vw; % r += ib; % assert(vw == *a && r == *b); % #endif % } % % #define norml(a, b) onorml(&(a), &(b)) % % /* spadd() for long double precision. */ % #define spaddl(a, b, c) do { \ % long double __tmp; \ % \ % __tmp = (c); \ % onorml(&__tmp, &(a)); \ % (b) += (a); \ % (a) = __tmp; \ % } while (0) <end of full norm(), etc.> > Also, the proof you provided (which I haven't gone through in detail yet) - > do you intend to put them as comments in the code? No. It is far too long. > For example, when you write: > > % EXTRACT_WORDS(hx, lx, ax); > % bits = ((uint64_t)hx << 32) | lx; > % bits += 0x04000000; > % hx = bits >> 32; > % lx = bits & 0xf8000000; > % INSERT_WORDS(xh, hx, lx); > % xl = ax - xh; > > it would be very helpful to include a comment like > > % /* > * xh is ax rounded to nearest 26 bit floating point. > * xl is such that x = xh + xl. > * Because xh is rounded and not truncated to 26 bits, > * only the top 26 bits (not 27=53-26) of the mantissa of > * xl can be non-zero. > */ > > because while that can be deduced from reading the code, it is by no means > obvious. Again, far too long. This is much more routine than the above. Splitting is done without comment all over old fdlibm code. Only the point about the last bit is not routine. > By the way, that rounding to the nearest 26 bits is a very smart idea. But it is the natural way. It is just easier and more efficient (saves 1 integer instruction) to lop off the low bits, so lopping should be used if it works (I need to try it for float and ld80 precisions where it should work because MANT_DIG is even). e_hypot.c has hi+lo decomposition that is not as delicate as the above: % /* medium size a and b */ % w = a-b; % if (w>b) { % t1 = 0; % SET_HIGH_WORD(t1,ha); % t2 = a-t1; % w = sqrt(t1*t1-(b*(-b)-t2*(a+t1))); % } else { % a = a+a; % y1 = 0; % SET_HIGH_WORD(y1,hb); This lops off the low 32 bits (y1 = b lopped). The previous 2 lines can be written as INSERT_WORDS(y1, hb, 0), but old fdlibm code likes to write it as above. Both methods have serious efficiency problems on modern CPUs unless the compiler optimizes them, since they asks for partial object accesses (2 * 32-bit words instead of 1 64-bit access). Old fdlibm code accidentally reduces these problems by asking for a full initialization of y. % y2 = b - y1; Remaining 32 bits of b. % t1 = 0; % SET_HIGH_WORD(t1,ha+0x00100000); % t2 = a - t1; Similarly for a, except a was doubled just above, so ha is out of date and we fix it up by adding 1 to its exponent bit-field. This is a good example of a context-sensitive splitting. (However, better code maintain ha next to a and let the compiler optimize away the change to ha in paths where the change is not used.) % w = sqrt(t1*y1-(w*(-w)-(t1*y2+t2*b))); % } The arg for sqrt() must be a*a + b*b rearranged for accurcy. Since w*w is a 32x32 -> 53 bit operation, this is obviously not a full doubled precision operation. I noticed a full doubled precision multiplication most recently in das's s_fma.c: % /* % * Compute a*b exactly, returning the exact result in a struct dd. We assume % * that both a and b are normalized, so no underflow or overflow will occur. % * The current rounding mode must be round-to-nearest. % */ I think the requirement on the rounding mode is only so that the splitting rounds to nearest. % static inline struct dd % dd_mul(double a, double b) % { % static const double split = 0x1p27 + 1.0; % struct dd ret; % double ha, hb, la, lb, p, q; % % p = a * split; % ha = a - p; % ha += p; I prefer splitting by addition. This method seems to be just slower (1 extra multiplication and giving 1 extra dependence than for ha = ha + redux - redux). % la = a - ha; This and the additive method are fragile, and the above is broken on i386 with rounding precision FP_PE. clogl() uses the additive method, but clog() and clogf() use more direct methods to avoid problems with extra precision. The assignments to ha are also missing STRICT_ASSIGN(). Corresponding float precision code would be broken on i386 with the default rounding precision FP_PD. We currently assume than the rounding precision on i386 is never changed from its default of FP_PD, except when calling long double functions it should always be changed to FP_PE, so that code like above code works. We are more careful in float functions. My tests can be configured to test non-default precisions. This only shows a few double functions that don't work with extra precision. But many float functions don't work in only float precision, because they have been optimized to depend on double precision. % % p = b * split; % hb = b - p; % hb += p; % lb = b - hb; % % p = ha * hb; % q = ha * lb + la * hb; I don't see why this is always exact. For squaring, it is exact since it is 2*ha*la. % % ret.hi = p + q; % ret.lo = p - ret.hi + q + la * lb; I don't see why the final addition is exact. Except for that, (ret.hi, ret.lo) is norm(p, q), which is exact by Knuth Theorem C. Note that Theorem C applies because |p| >= |q|. norm() in the above depends on this too. norm() in clog*() uses a the larger code justified by a nearby Theorem (B?). s_fma.c also has a function dd_add() which uses the larger code. I copied the larger code in clog*() from there. This is an example of why the smaller norm() is a good primitive -- the larger one does more so it is less primitive, and things should be organized so that the larger one is rarely needed. % return (ret); % } Doubled precision addition is harder than multiplication, and I couldn't find a general enough version in s_fma.c to copy. There is a related problem with naming all of these functions. norm() isn't a very descriptive name, but it is the best that I could think of. spadd() is closer to sesquied precision than singled or doubled (it gives doubled + singled -> doubled). The above dd_mul() is also only sesquied and closer to singled than is spadd() (it gives singled x singled -> doubled). dd_add() is just the larger norm(), so it gives singled + singled -> doubled. s_fma.c gets by without anything like spadd() mainly because it only returns a singled result. After a dp_add() or 2, it adds the 2 components and then has to do a delicate rounding step, but no addition of doubled precision results. There are some more delicate and less routine roundings (not for splitting) at a critical bit in the middle. Some of these deserve and have comments. See s_cbrt.c (my verbose comment), s_exp2.c (no comment for redux magic; s_exp2l.c might say more) and e_rem_pio2.c (redux magic obtained from s_exp2.c). After seeing a few of these 0x1.MpN redux constants, using them becomes routine except for the problems with extra precision. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120805175106.X3574>