From owner-freebsd-numerics@FreeBSD.ORG Sat Oct 6 01:47:33 2012 Return-Path: Delivered-To: freebsd-numerics@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 1667F106566B for ; Sat, 6 Oct 2012 01:47:33 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id 5F54D8FC12 for ; Sat, 6 Oct 2012 01:47:32 +0000 (UTC) Received: from c122-106-157-84.carlnfd1.nsw.optusnet.com.au (c122-106-157-84.carlnfd1.nsw.optusnet.com.au [122.106.157.84]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q961lMhd002428 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 6 Oct 2012 11:47:23 +1000 Date: Sat, 6 Oct 2012 11:47:22 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Peter Jeremy In-Reply-To: <20121005213217.GA90440@vps.rulingia.com> Message-ID: <20121006091156.D1385@besplex.bde.org> References: <20121005213217.GA90440@vps.rulingia.com> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-961849994-1349488042=:1385" Cc: freebsd-numerics@FreeBSD.org Subject: Re: Implementing cpow(3) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 06 Oct 2012 01:47:33 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-961849994-1349488042=:1385 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Sat, 6 Oct 2012, Peter Jeremy wrote: > I have been meditating on cpow(3) for some time without achieving > enlightenment. The following is somewhat of a braindump in the > hope that someone will advise whether I am on the right track or > have gone off the rails. > ... > - Because exp(x) pushes fraction bits into the exponent, achieving 53 bit= s > of precision in the result requires about 63 bits of input fraction. Look at the real pow() to see how it gets enough (?) extra precision, and also for how it handles exceptional and other special cases. Look at both the fdlibm version and the old BSD libm version (in the Attic, unless svn lost it). I've barely looked at them, but noticed that they do the following for extra precision: - fdlibm duplicates most of exp() and log() manually inline in exp(), so as to get at their internals. It is hard to see how this provides enough extra precision, since the internals are only accurate to half an ulp. The power series are accurate to 1/32 ulps, but it is hard to evaluate them that accurately. Well, maybe pow() does the usual extra-precision things for the highest term of the power series only. - old BSD libm uses the extra-precision functions __exp__D() and __log__D(). These have a chance of providing an extra 25 bits of precision. It is much easier to understand, so I will include parts of it here: % * Required kernel functions: % *=09exp__D(a,c)=09=09=09exp(a + c) for |a| << |c| % *=09struct d_double dlog(x)=09=09r.a + r.b, |r.b| < |r.a| % ... % double pow(x,y)=20 % double x,y; % { % =09double t; % =09if (y=3D=3Dzero) % =09=09return (one); % =09else if (y=3D=3Done || (_IEEE && x !=3D x)) % =09=09return (x);=09=09/* if x is NaN or y=3D1 */ % =09else if (_IEEE && y!=3Dy)=09=09/* if y is NaN */ % =09=09return (y); % =09else if (!finite(y))=09=09/* if y is INF */ % =09=09if ((t=3Dfabs(x))=3D=3Done)=09/* +-1 ** +-INF is NaN */ % =09=09=09return (y - y); % =09=09else if (t>one) % =09=09=09return ((y<0)? zero : ((x0)? zero : ((x<0)? y-y : -y)); % =09else if (y=3D=3Dtwo) % =09=09return (x*x); % =09else if (y=3D=3Dnegone) % =09=09return (one/x); % /* x > 0, x =3D=3D +0 */ % =09else if (copysign(one, x) =3D=3D one) % =09=09return (pow_P(x, y)); %=20 % /* sign(x)=3D -1 */ % =09/* if y is an even integer */ % =09else if ( (t=3Ddrem(y,two)) =3D=3D zero) % =09=09return (pow_P(-x, y)); %=20 % =09/* if y is an odd integer */ % =09else if (copysign(t,one) =3D=3D one) % =09=09return (-pow_P(-x, y)); Apart from exceptional cases, it only has these special cases for integers. This is not much different from fdlibm pow(). %=20 % =09/* Henceforth y is not an integer */ % =09else if (x=3D=3Dzero)=09/* x is -0 */ % =09=09return ((y>zero)? -x : one/(-x)); % =09else if (_IEEE) % =09=09return (zero/zero); % =09else % =09=09return (infnan(EDOM)); % } % /* kernel function for x >=3D 0 */ % static double % #ifdef _ANSI_SOURCE % pow_P(double x, double y) % #else % pow_P(x, y) double x, y; % #endif % { % =09struct Double s, t, __log__D(); % =09double __exp__D(), huge =3D 1e300, tiny =3D 1e-300; %=20 % =09if (x =3D=3D zero) % =09=09if (y > zero) % =09=09=09return (zero); % =09=09else if (_IEEE) % =09=09=09return (huge*huge); % =09=09else % =09=09=09return (infnan(ERANGE)); % =09if (x =3D=3D one) % =09=09return (one); % =09if (!finite(x)) % =09=09if (y < zero) % =09=09=09return (zero); % =09=09else if (_IEEE) % =09=09=09return (huge*huge); % =09=09else % =09=09=09return (infnan(ERANGE)); % =09if (y >=3D 7e18)=09=09/* infinity */ % =09=09if (x < 1) % =09=09=09return(tiny*tiny); % =09=09else if (_IEEE) % =09=09=09return (huge*huge); % =09=09else % =09=09=09return (infnan(ERANGE)); Don't know why it handles some exceptional cases here instead of all in the main function. %=20 % =09/* Return exp(y*log(x)), using simulated extended */ % =09/* precision for the log and the multiply.=09 */ %=20 % =09s =3D __log__D(x); % =09t.a =3D y; % =09TRUNC(t.a); % =09t.b =3D y - t.a; % =09t.b =3D s.b*y + t.b*s.a; % =09t.a *=3D s.a; % =09s.a =3D t.a + t.b; % =09s.b =3D (t.a - s.a) + t.b; % =09return (__exp__D(s.a, s.b)); % } It handles all non-special cases here, using the formula with extra precision. Note that __log__D() starts with standard precision and produces extra precision, while __exp__D() starts with extra precision and produces standard precision. This is exactly what is needed here. These functions are still in libm, since I brought them back to use with old BSD tgamma(). But they are not very good. My log*() works similarly to __log__D(). It currently only provides a static inline interface for exporting the extra precision that it produces. It produces only about 8 extra bits. __log__D() has a chance of producing 26 extra, but I expect it barely produces 8. It is fairly easy to change tables and polynomial degrees to produce 26 if required. Current expl() has similar extra precision internally to __log__D(), but it produces it instead of starting with it, so it would be not so easy to convert it to __exp__D()'s interface. However, some versions of exp2*() are a good base for such a conversion, since exp2(x) =3D=3D pow(2, x) so it has similar requirements. It would be generally useful to have versions of exp() and log() that both take and return extra precision. My log1pl() almost does this internally. > - Since both sin() and cos() have a range of [-1,1], rrad can potentially > exceed DBL_MAX even though the return value is finite. This is a solved problem. See how cexp() and ccosh() handle it. > ... > Moving back a line, cabs(z) can avoid spurious overflow & provide a > split result via: > > struct split { > =09int n; > =09double f; > } cabs(double complex z) > { > =09double rz, iz; > =09int nrz, niz; > =09double frz, fiz; > =09double y; > =09int ny; > > =09rz =3D creal(z); > =09nrz =3D ilogb(rz); > =09frz =3D scalbn(rz, -nrz); > =09iz =3D cimag(z); > =09niz =3D ilogb(iz); > =09/* > =09 * sqrt(a*a + b*b) =3D abs(a) * sqrt(1 + (b/a)**2) > =09 * For x << 1, sqrt(1 + x*x) is approximately (1 + x*x/2) > =09 * therefore the "sqrt(1 + (b/a)**2)" term can be ignored > =09 * if b is more than half the size of the mantissa smaller > =09 * than a. > =09 */ > =09if (nrz > niz + DBL_MANT_DIG/2) > =09=09return ({nrz, fabs(frz)}); > =09if (niz > nrz + DBL_MANT_DIG/2) > =09=09return ({niz, fabs(scalbn(iz, -niz))}); > > =09fiz =3D scalbn(iz, -nrz); > =09/* No underflow/overflow is possible due to restricted range */ > =09y =3D sqrt(frz*frz + fiz*fiz); > =09ny =3D ilogb(y); > =09return ({nrz + ny, scalbn(y, -ny)}); > } Here you are reinventing a bit too much of hypot(), or rather clog(). Scaling is certainly necessary to avoid spurious overflow, so we can't just use these functions. There are also large problems with accuracy, which the above doesn't solve and clog() probably doesn't solve well enough for use here. I snipped too much above so I quote the pseudo-code again: > double complex cpow(double complex z, double complex w) > { > =09double rw, iw; > =09double rad, lrad, th; > =09double rrad, rth; >=20 > =09rw =3D creal(w); > =09iw =3D cimag(w); > =09rad =3D cabs(z); > =09lrad =3D log(rad); log(rad) cancels up to ~156 bits (3 * DBL_MAX less 3) when |z| is near 1, so this won't work without tripled or quadrupled double precision. clog() has to work hard to produce 52 bits of precision using only doubled double precision. > =09th =3D carg(z); >=20 > =09rrad =3D exp(rw * lrad - iw * th); There will be additional cancelations here. It seems to hard to avoid them completely, but it would be good to start with at least full precision in lrad and th. I know how to get full precision for lrad: in clog(), where it evaluates log1p(hi + lo) in the critical case, use an extra-precision log1p() on (hi, lo). This would also give an extra- precision result if hi+lo has extra precision. The doubled double precision caclulations usually give lots of extra precision in hi+lo, but I forget if they give much extra when |z| is near 1. > =09rth =3D rw * th + iw * lrad; Can cancel too. >=20 > =09return (cpack(rrad * cos(rth), rrad * sin(rth))); > } No cancelations here. This is "easy", since it is just like the solved problem for ccosh(). We just have to scale the exp() factor for both. > A potential enhancement would be to merge log(cabs(z)), noting that > log(sqrt(x)) =3D=3D log(x)/2 > and > log(sqrt(x*x + y*y)) =3D=3D log(abs(x) * sqrt(1 + (y*y)/(x*x))) > =3D=3D log(abs(x)) + log(1 + (y*y)/(x*x))/2 > =3D=3D log(abs(x)) + log1p((y*y)/(x*x))/2 > (Note that the use of the log1p() approach would imply an additional > lrad2 term that is not handled in the following) clog() does stuff like this, especially in discarded versions. |y/x| must be fairly small, else this just gives an extra ulp or 2 of error. If |y/x| is very small, then the above can be evaluated efficiently using log1p(ratio) ~=3D ratio. I tried this, but it didn't improve the average speed, although the test emphasized unusual cases with small or large ratios. So I only use this when |y/x| is so small that it would underflow, so the log1p() term goes away as a side effect of avoiding underflow. cpow() could only do better than clog() by inlining clog() and adding complications. Seems too hard. I now see that the overflow avoidance in clog() is enough for calculating lrad for use here, but then the expressions for rrad and rth may overflow. These expressions just give ordinary complex multiplication, which may overflow, and one reason we are writing them as real expressions is to possibly avoid overflow. It can be avoided by scaling |lrad| and |th| to <=3D 0.5, with complications when one of them would underflow. > Looking at "rth =3D rw * th + iw * lrad;": This is used solely as an > argument for sin() or cos() - both of which have a period of 2=CF=80, > therefore rth can be evaluated modulo 2=CF=80, as can each term in the > expression. Assuming lrad is split as (lrad0 + lrad1) and mod_2pi() > performs a high-precision modulo 2=CF=80: > rth =3D mod_2pi(rw) * th + mod_2pi(iw) * (mod_2pi(lrad0) + mod_2pi(lrad1= )); > This should minimise the catastrophic cancellation. I think Stephen said that this only works for special args. The mod_2pi operation is difficult, but is supported by __kernel_rem_pio2(). But this point shows that the cancelation problem is larger than usual -- it is not just at 0, but at all multiples of 2*Pi (or Pi/2). But I think it is much larger at 0, since other multiples are hard to get close to. Easier with 2 variables than 1 though. > This leaves "rrad =3D exp(rw * lrad - iw * th);" or, given the lrad split= : > x =3D rw * log(2) * nrad + rw * lrad1 - iw * th; > rrad =3D exp(x); I snipped the part about the rad split. I think it wouldn't work too well. You would have to evaluate the above long expression for x mostly in exitra precision. It's easier to have an extra-precision log(). In the above, the full extra precision expression is: x =3D rw * (ln2_hi + ln2_lo) * nrad + rw * (lrad1_hi + lrad1_lo - iw * (th_hi + th_lo);=09/* not literal; also need hi+lo for rw... */ We can let the extra-precision log() do some of the additions, and "only" have to evaluate: x =3D rw * (lrad_hi + lrad_lo) - iw * (th_hi + th_lo);=09/* not literal.= =2E. */ > The ranges of the various terms are: > rw, iw: Any double > [... unreadable part with binary characters] > ... > In order to achieve the required range/precision, x needs to be expressed= as: > x =3D k * log(2) + r, |r| =E2=89=A4 log(2) (ideally =E2=89=A4 log(2)/2) > Giving > rrad =3D 2**k * exp(r) > or > double t =3D exp(r); > return (cpack(scalbn(t * cos(rth), k), scalbn(t * sin(rth), k))); > > Unfortunately, it's not immediately obvious to me how to get from > x =3D rw * log(2) * nrad + rw * lrad1 - iw * th; > to > x =3D k * log(2) + r, |r| =E2=89=A4 log(2) (ideally =E2=89=A4 log(2)/2) > with the range/precision requirements. In particular, all > multiplications need to be able to return values exceeding DBL_MAX and > have sufficient precision to provide a 53-bit final result in the face > of catastrophic cancellation. See cexp(). It calls __ldexp_cexp() to handle all the scaling details. Oops, __ldexp_cexp() cannot handle extra precision that we may have in x, we need significant extra precision in x since exp() will expand the error. So we need an __ldexp_cexp() corresponding to __exp__D(). But that is easi= er than getting enough precision in x. I would try evaluating it all in merel= y doubled double precision. clog() shows how to do this for the simpler expression ax*ax + ay*ay - 1. Bruce --0-961849994-1349488042=:1385--