From owner-freebsd-numerics@FreeBSD.ORG Fri Oct 5 21:32:28 2012 Return-Path: Delivered-To: freebsd-numerics@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 738D8106566B for ; Fri, 5 Oct 2012 21:32:28 +0000 (UTC) (envelope-from peter@vps.rulingia.com) Received: from vps.rulingia.com (host-122-100-2-194.octopus.com.au [122.100.2.194]) by mx1.freebsd.org (Postfix) with ESMTP id B40E18FC0A for ; Fri, 5 Oct 2012 21:32:26 +0000 (UTC) Received: from vps.rulingia.com (localhost [127.0.0.1]) by vps.rulingia.com (8.14.5/8.14.5) with ESMTP id q95LWIRD090469 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Sat, 6 Oct 2012 07:32:18 +1000 (EST) (envelope-from peter@vps.rulingia.com) Received: (from peter@localhost) by vps.rulingia.com (8.14.5/8.14.5/Submit) id q95LWHc3090468 for freebsd-numerics@freebsd.org; Sat, 6 Oct 2012 07:32:17 +1000 (EST) (envelope-from peter) Date: Sat, 6 Oct 2012 07:32:17 +1000 From: Peter Jeremy To: freebsd-numerics@freebsd.org Message-ID: <20121005213217.GA90440@vps.rulingia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.5.21 (2010-09-15) Subject: Implementing cpow(3) X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Oct 2012 21:32:28 -0000 I have been meditating on cpow(3) for some time without achieving enlightenment. The following is somewhat of a braindump in the hope that someone will advise whether I am on the right track or have gone off the rails. A naive solution to cpow(3) looks like: double complex cpow(double complex z, double complex w) { return (cexp(w * clog(z)); } but this suffers from both spurious exceptions and precision loss. A somewhat expanded approach makes this more obvious: double complex cpow(double complex z, double complex w) { double rw, iw; double rad, lrad, th; double rrad, rth; rw = creal(w); iw = cimag(w); rad = cabs(z); lrad = log(rad); th = carg(z); rrad = exp(rw * lrad - iw * th); rth = rw * th + iw * lrad; return (cpack(rrad * cos(rth), rrad * sin(rth))); } This approach has the advantage of not requiring any complex arithmetic and it's also easier to explicitly handle exception cases. Working through the spurious exceptions and precision loss issues: - cabs(z) can overflow when z has large but finite real & imaginary parts. - log(x) pushes exponent bits into the fraction, thus a double precision log(x) loses about 10 bits of precision. - The '-' & '+' expressions can both suffer catastrophic cancellation. - Because exp(x) pushes fraction bits into the exponent, achieving 53 bits of precision in the result requires about 63 bits of input fraction. - Since both sin() and cos() have a range of [-1,1], rrad can potentially exceed DBL_MAX even though the return value is finite. Note that from here on, the code is intended to demonstrate the general approach, rather than as committable code, and assumes that 0, Inf and NaN are special cased externally. Note that denormals need special handling to do ilogb() and scalbn() via bit-manipulation. The precision loss in log(rad) can be avoided by separately handling the exponent bits - ie split rad into int nrad and double frad such that: rad = 2**nrad * frad, 1 <= frad < 2 Then: lrad = log(rad) = log(2**nrad * frad) = log(2) * nrad + log(frad) The limited range of frad means that log(frad) should not lose any precision and the two sides of the addition can be stored separately. For later use, assume: double lrad0 = log(2.) * (double)nrad; double lrad1 = log(frad); Note that for x << 1, log(1 + x) =~ x. Therefore lrad1 is either 0 or 1p-53 ≤ lrad1 < log(2) ~= 0.693. Moving back a line, cabs(z) can avoid spurious overflow & provide a split result via: struct split { int n; double f; } cabs(double complex z) { double rz, iz; int nrz, niz; double frz, fiz; double y; int ny; rz = creal(z); nrz = ilogb(rz); frz = scalbn(rz, -nrz); iz = cimag(z); niz = ilogb(iz); /* * sqrt(a*a + b*b) = abs(a) * sqrt(1 + (b/a)**2) * For x << 1, sqrt(1 + x*x) is approximately (1 + x*x/2) * therefore the "sqrt(1 + (b/a)**2)" term can be ignored * if b is more than half the size of the mantissa smaller * than a. */ if (nrz > niz + DBL_MANT_DIG/2) return ({nrz, fabs(frz)}); if (niz > nrz + DBL_MANT_DIG/2) return ({niz, fabs(scalbn(iz, -niz))}); fiz = scalbn(iz, -nrz); /* No underflow/overflow is possible due to restricted range */ y = sqrt(frz*frz + fiz*fiz); ny = ilogb(y); return ({nrz + ny, scalbn(y, -ny)}); } A potential enhancement would be to merge log(cabs(z)), noting that log(sqrt(x)) == log(x)/2 and log(sqrt(x*x + y*y)) == log(abs(x) * sqrt(1 + (y*y)/(x*x))) == log(abs(x)) + log(1 + (y*y)/(x*x))/2 == log(abs(x)) + log1p((y*y)/(x*x))/2 (Note that the use of the log1p() approach would imply an additional lrad2 term that is not handled in the following) Looking at "rth = rw * th + iw * lrad;": This is used solely as an argument for sin() or cos() - both of which have a period of 2π, therefore rth can be evaluated modulo 2π, as can each term in the expression. Assuming lrad is split as (lrad0 + lrad1) and mod_2pi() performs a high-precision modulo 2π: rth = mod_2pi(rw) * th + mod_2pi(iw) * (mod_2pi(lrad0) + mod_2pi(lrad1)); This should minimise the catastrophic cancellation. This leaves "rrad = exp(rw * lrad - iw * th);" or, given the lrad split: x = rw * log(2) * nrad + rw * lrad1 - iw * th; rrad = exp(x); The ranges of the various terms are: rw, iw: Any double -π ≤ th ≤ π -1074 ≤ nrad ≤ 1024 (allowing for denormals & z = DBL_MAX * (1 + i)) lrad1 = 0 or 1p-53 ≤ lrad1 < log(2) ~= 0.693. Since the final result is return (rrad * cis(rth)); and the range of cis(t) is |cis(t)| ≤ 1 or, alternatively ignoring 0 values: -1074 ≤ log₂(|cis(t)|) ≤ 0 this implies rrad needs to have a range -1074 ≤ log₂(|rrad|) ≤ 2097 (the latter value allows for rrad * DBL_MIN = DBL_MAX) and provide 53 bits of precision over -1023 < log₂(|rrad|) < 2047 In order to achieve the required range/precision, x needs to be expressed as: x = k * log(2) + r, |r| ≤ log(2) (ideally ≤ log(2)/2) Giving rrad = 2**k * exp(r) or double t = exp(r); return (cpack(scalbn(t * cos(rth), k), scalbn(t * sin(rth), k))); Unfortunately, it's not immediately obvious to me how to get from x = rw * log(2) * nrad + rw * lrad1 - iw * th; to x = k * log(2) + r, |r| ≤ log(2) (ideally ≤ log(2)/2) with the range/precision requirements. In particular, all multiplications need to be able to return values exceeding DBL_MAX and have sufficient precision to provide a 53-bit final result in the face of catastrophic cancellation. Does anyone have any suggestions as to how to proceed? -- Peter Jeremy