From owner-freebsd-standards@FreeBSD.ORG Mon Oct 29 11:07:16 2007 Return-Path: Delivered-To: freebsd-standards@hub.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 710F216A46E for ; Mon, 29 Oct 2007 11:07:13 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 3C6BE13C48A for ; Mon, 29 Oct 2007 11:07:13 +0000 (UTC) (envelope-from owner-bugmaster@FreeBSD.org) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id l9TB7DBQ090266 for ; Mon, 29 Oct 2007 11:07:13 GMT (envelope-from owner-bugmaster@FreeBSD.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.1/8.14.1/Submit) id l9TB7CN4090262 for freebsd-standards@FreeBSD.org; Mon, 29 Oct 2007 11:07:12 GMT (envelope-from owner-bugmaster@FreeBSD.org) Date: Mon, 29 Oct 2007 11:07:12 GMT Message-Id: <200710291107.l9TB7CN4090262@freefall.freebsd.org> X-Authentication-Warning: freefall.freebsd.org: gnats set sender to owner-bugmaster@FreeBSD.org using -f From: FreeBSD bugmaster To: freebsd-standards@FreeBSD.org Cc: Subject: Current problem reports assigned to freebsd-standards@FreeBSD.org X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 29 Oct 2007 11:07:17 -0000 Current FreeBSD problem reports Critical problems Serious problems S Tracker Resp. Description -------------------------------------------------------------------------------- o bin/25542 standards /bin/sh: null char in quoted string o kern/46239 standards posix semaphore implementation errors o stand/54410 standards one-true-awk not POSIX compliant (no extended REs) o stand/82654 standards C99 long double math functions are missing o stand/94729 standards [libc] fcntl() throws undocumented ENOTTY 5 problems total. Non-critical problems S Tracker Resp. Description -------------------------------------------------------------------------------- o bin/21519 standards sys/dir.h should be deprecated some more o bin/24390 standards Replacing old dir-symlinks when using /bin/ln s stand/24590 standards timezone function not compatible witn Single Unix Spec s kern/28260 standards UIO_MAXIOV needs to be made public s stand/36076 standards Implementation of POSIX fuser command o stand/39256 standards snprintf/vsnprintf aren't POSIX-conformant for strings p stand/41576 standards POSIX compliance of ln(1) o stand/44425 standards getcwd() succeeds even if current dir has perm 000. o stand/46119 standards Priority problems for SCHED_OTHER using pthreads o stand/54833 standards [pcvt] more pcvt deficits o stand/54839 standards [pcvt] pcvt deficits p stand/55112 standards glob.h, glob_t's gl_pathc should be "size_t", not "int o stand/56476 standards cd9660 unicode support simple hack o stand/58676 standards grantpt(3) alters storage used by ptsname(3) s stand/62858 standards malloc(0) not C99 compliant s kern/64875 standards [libc] [patch] [feature request] add a system call: fd o stand/66357 standards make POSIX conformance problem ('sh -e' & '+' command- o stand/66531 standards _gettemp uses a far smaller set of filenames than docu o stand/70813 standards [PATCH] ls(1) not Posix compliant o stand/72006 standards floating point formating in non-C locales o stand/79056 standards regex(3) regression tests a stand/80293 standards sysconf() does not support well-defined unistd values o stand/81287 standards [PATCH]: fingerd(8) might send a line not ending in CR o stand/83845 standards [libm] [patch] add log2() and log2f() support for libm o stand/85080 standards output of long double subnormals (with printf) is wron o stand/92360 standards [headers] [patch] Missing TAB3 in kernel headers o stand/92362 standards [headers] [patch] Missing SIGPOLL in kernel headers o kern/93705 standards [headers] [patch] ENODATA and EGREGIOUS (for glibc com o stand/96016 standards [headers] clock_getres et al should be in o stand/96236 standards [PATCH] [POSIX] sed.1 incorrectly describes a function p stand/99517 standards Missing SIGRTMIN and SIGRTMAX signals o stand/99960 standards [Patch] make(1): Add -p flag o stand/100017 standards [Patch] Add fuser(1) functionality to fstat(1) o stand/104743 standards [headers] [patch] Wrong values for _POSIX_ minimal lim o stand/104841 standards [libm] [patch] C99 long double square root. o stand/107561 standards [patch] Missing SUS function tcgetsid() o kern/114578 standards [libc] wide character printing using swprintf(dst, n, o stand/114633 standards /etc/rc.subr: line 511: omits a quotation mark: "force o stand/116081 standards make does not work with the directive sinclude o stand/116221 standards SUS issue -- FreeBSD has not flag WNOWAIT for wait*() o stand/116826 standards [PATCH] sh support for POSIX character classes 41 problems total. From owner-freebsd-standards@FreeBSD.ORG Sat Nov 3 00:13:53 2007 Return-Path: Delivered-To: freebsd-standards@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CDC4916A421 for ; Sat, 3 Nov 2007 00:13:53 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.208.78.105]) by mx1.freebsd.org (Postfix) with ESMTP id 9AB9813C491 for ; Sat, 3 Nov 2007 00:13:53 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu [127.0.0.1]) by troutmask.apl.washington.edu (8.14.1/8.14.1) with ESMTP id lA30D6io083003 for ; Fri, 2 Nov 2007 17:13:06 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.14.1/8.14.1/Submit) id lA30D5Ht083002 for freebsd-standards@freebsd.org; Fri, 2 Nov 2007 17:13:05 -0700 (PDT) (envelope-from sgk) Date: Fri, 2 Nov 2007 17:13:05 -0700 From: Steve Kargl To: freebsd-standards@freebsd.org Message-ID: <20071103001305.GA82775@troutmask.apl.washington.edu> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="RnlQjJ0d97Da+TV1" Content-Disposition: inline User-Agent: Mutt/1.4.2.3i X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Subject: Implementation of expl() X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Nov 2007 00:13:53 -0000 --RnlQjJ0d97Da+TV1 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline With all the success of developing several other missing C99 math routines, I decided to tackle expl() (which I need to tackle tanhl()). As with the other functions, this implementation is known to work on amd64 and should be valid on other 64-bit architectures. As with the previous function, I can only do a limited amount of statistical testing of expl(). The reference values in testing are computing using GMP/MPFR with 265 bits of precision. The reference value is then converted to a long double for comparison to my expl(). Testing shows: Special cases: exp(0) = 1e+00 exp(inf) = inf exp(-inf)= 0e+00 exp(nan) = nan With x in the non-exceptional range of [ln(LDBL_MIN):ln(LDBL_MAX)], I computed 22.7 million values of expl() that span the range. The maximum observed ULP was 1 where only 7378 values (0.0325%) had a ULP exceeding 0.5. In the exceptional range leading to gradual underflow, testing was much more difficult. Computing a ULP seemed troublesome because of the loss of precision as the subnormal numbers approach zero. So, I broke the range into 4 segments and compute the relative maximum error in expl() compared to a value computed by GMP/MPFR. x in [-11366.00:-11355.14] RME > 1e-18 --> (73524/1086300) = 6.77% RME > 1e-10 --> (69314/1086300) = 6.38% RME > 1e-8 --> (69314/1086300) = 6.38% RME > 1e-4 --> (69314/1086300) = 6.38% x in [-11377.00:-11366.00] RME > 1e-18 --> (2/1100000) = 0.00% x in [-11388.00:-11377.00] RME > 1e-18 --> (0/1100000) = 0.00% x in [-11399.50:-11388.00] RME > 1e-18 --> (156152/1149852) = 13.58% RME > 1e-10 --> (156152/1149852) = 13.58% RME > 1e-8 --> (156152/1149852) = 13.58% RME > 1e-4 --> (136328/1149852) = 11.86% The above suggests that in the middle on exceptional range, my expl() gives an accurate anwer. In the last segment, the RME significantly increases and is attributed to the loss of precision. Although I'm not completely satisfied with the gradual underflow behavior, I'm providing the code for comments and possible inclusion into src/msun. Note, a diff against cvs is not provided because of the entanglement of my other patches to math.h, Symbol.map, and Makefile. Enjoy. -- Steve --RnlQjJ0d97Da+TV1-- From owner-freebsd-standards@FreeBSD.ORG Sat Nov 3 05:15:46 2007 Return-Path: Delivered-To: freebsd-standards@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4992A16A418 for ; Sat, 3 Nov 2007 05:15:46 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail35.syd.optusnet.com.au (mail35.syd.optusnet.com.au [211.29.133.51]) by mx1.freebsd.org (Postfix) with ESMTP id E1DF413C491 for ; Sat, 3 Nov 2007 05:15:45 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213]) by mail35.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id lA35FRMG008246 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 3 Nov 2007 16:15:28 +1100 Date: Sat, 3 Nov 2007 16:15:27 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl In-Reply-To: <20071012180959.GA36345@troutmask.apl.washington.edu> Message-ID: <20071103155721.K671@besplex.bde.org> References: <20071012180959.GA36345@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-standards@FreeBSD.org Subject: Re: [PATCH] hypotl, cabsl, and code removal in cabs X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Nov 2007 05:15:46 -0000 On Fri, 12 Oct 2007, Steve Kargl wrote: > The attached patch implements hypotl(), cabsl(), and removes > z_abs() from w_cabs.c. z_abs() is undocumented namespace > pollution in libm, and no header file under /usr/include > provides a prototype. The documentation has been updated, > and in particular an incorrect paragraph in cabs.3 has been > removed. I fear that that paragraph is correct again, since the long double version only seems to take enough care with the overflow cases. % Index: man/hypot.3 % ... % @@ -82,11 +90,6 @@ Consequently % exactly; % in general, hypot and cabs return an integer whenever an % integer might be expected. % -.Pp % -The same cannot be said for the shorter and faster version of hypot % -and cabs that is provided in the comments in cabs.c; its error can % -exceed 1.2 % -.Em ulps . It's silly that on i386's, naive double precision code would work if the default precision were extended, but the double precision code isn't naive; OTOH, long double precision code can never depend on extra precision, but is naive. % Index: src/e_hypotl.c % ... % + if (n >= m) { % + a.e *= a.e; % + m -= n; % + if (m > - PREC) { % + b.bits.exp += m; % + a.e += b.e * b.e; % + } % + a.e = sqrtl(a.e); % + a.bits.exp += n; % + return (a.e); The usual case reduces to just sqrtl(x*x + y*y). fdlibm is more careful in this case (but similar in others): % * So, compute sqrt(x*x+y*y) with some care as % * follows to get the error below 1 ulp: % * % * Assume x>y>0; % * (if possible, set rounding to round-to-nearest) % * 1. if x > 2y use % * x1*x1+(y*y+(x2*(x+x1))) for x*x+y*y % * where x1 = x with lower 32 bits cleared, x2 = x-x1; else % * 2. if x <= 2y use % * t1*y1+((x-y)*(x-y)+(t1*y2+t2*y)) % * where t1 = 2x with lower 32 bits cleared, t2 = 2x-t1, % * y1= y with lower 32 bits chopped, y2 = y-y1. I don't understand the details of this, and couldn't find any cases where this makes a difference. Testing of 2-arg functions is hard since exhaustive testing is impossible even for floats, and the cases where this matters for hypot() seem to be sparse. Bruce From owner-freebsd-standards@FreeBSD.ORG Sat Nov 3 20:14:09 2007 Return-Path: Delivered-To: freebsd-standards@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 3631616A418 for ; Sat, 3 Nov 2007 20:14:09 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.208.78.105]) by mx1.freebsd.org (Postfix) with ESMTP id 164E813C480 for ; Sat, 3 Nov 2007 20:14:09 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu [127.0.0.1]) by troutmask.apl.washington.edu (8.14.1/8.14.1) with ESMTP id lA3JxqCl091988; Sat, 3 Nov 2007 12:59:52 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.14.1/8.14.1/Submit) id lA3Jxqve091987; Sat, 3 Nov 2007 12:59:52 -0700 (PDT) (envelope-from sgk) Date: Sat, 3 Nov 2007 12:59:52 -0700 From: Steve Kargl To: Bruce Evans Message-ID: <20071103195952.GA91682@troutmask.apl.washington.edu> References: <20071012180959.GA36345@troutmask.apl.washington.edu> <20071103155721.K671@besplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20071103155721.K671@besplex.bde.org> User-Agent: Mutt/1.4.2.3i Cc: freebsd-standards@FreeBSD.org Subject: Re: [PATCH] hypotl, cabsl, and code removal in cabs X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Nov 2007 20:14:09 -0000 On Sat, Nov 03, 2007 at 04:15:27PM +1100, Bruce Evans wrote: > > On Fri, 12 Oct 2007, Steve Kargl wrote: > > >The attached patch implements hypotl(), cabsl(), and removes > >z_abs() from w_cabs.c. z_abs() is undocumented namespace > >pollution in libm, and no header file under /usr/include > >provides a prototype. The documentation has been updated, > >and in particular an incorrect paragraph in cabs.3 has been > >removed. > > I fear that that paragraph is correct again, since the long double > version only seems to take enough care with the overflow cases. Ah, no. Underflow is prevented by the if() statement. a.e *= a.e; m -= n; if (m > - PREC) { b.bits.exp += m; a.e += b.e * b.e; } a.e = sqrtl(a.e); a.bits.exp += n; return (a.e); If m < -PREC, then |b|/|a| < 2**(-32), so it skips the "a.e += b.e * b.e;" "a.e. *= a.e" can't underflow because a.e is in [0.5,1), and "a.e = sqrtl(a.e);" can't underflow because, here, a.e is in [0.25,2). > % Index: man/hypot.3 > % ... > % @@ -82,11 +90,6 @@ Consequently > % exactly; > % in general, hypot and cabs return an integer whenever an > % integer might be expected. > % -.Pp > % -The same cannot be said for the shorter and faster version of hypot > % -and cabs that is provided in the comments in cabs.c; its error can > % -exceed 1.2 > % -.Em ulps . This paragraph is useless. There is no cabs.c in the source tree, and in particular, there is no cabs.c in an installed system. Surely, this paragraph isn't referring to src/contrib/libf2c/libF77/cabs.c. It has one comment with a total of one word. > It's silly that on i386's, naive double precision code would work if > the default precision were extended, but the double precision code > isn't naive; OTOH, long double precision code can never depend on extra > precision, but is naive. I don't understand to what the above comment is directed. > % Index: src/e_hypotl.c > % ... > % + if (n >= m) { > % + a.e *= a.e; > % + m -= n; > % + if (m > - PREC) { > % + b.bits.exp += m; > % + a.e += b.e * b.e; > % + } > % + a.e = sqrtl(a.e); > % + a.bits.exp += n; > % + return (a.e); > > The usual case reduces to just sqrtl(x*x + y*y). See my comment above. The usual case isn't the important case. It's the unusual case that is of concern. The above code prevents the unusual, yet provides acceptable performance in the usual case. > fdlibm is more careful in this case (but similar in others): > > % * So, compute sqrt(x*x+y*y) with some care as > % * follows to get the error below 1 ulp: > % * > % * Assume x>y>0; > % * (if possible, set rounding to round-to-nearest) > % * 1. if x > 2y use > % * x1*x1+(y*y+(x2*(x+x1))) for x*x+y*y > % * where x1 = x with lower 32 bits cleared, x2 = x-x1; else > % * 2. if x <= 2y use > % * t1*y1+((x-y)*(x-y)+(t1*y2+t2*y)) > % * where t1 = 2x with lower 32 bits cleared, t2 = 2x-t1, > % * y1= y with lower 32 bits chopped, y2 = y-y1. > > I don't understand the details of this, and couldn't find any cases > where this makes a difference. Testing of 2-arg functions is hard > since exhaustive testing is impossible even for floats, and the > cases where this matters for hypot() seem to be sparse. The above seems more complicated than needed when one considers that x and y can be easily split into |x|=a*2**n and |y|=b*2**m where f,g are in [0.5,1). One then has the comment that is found in the code: * Compute hypotl(x, y) = sqrt(x**2 + y**2) such that underflow and * overflow are avoided. To accomplish the task, write x = a * 2**n * and y = b * 2**m, one then has * * hypotl(x, y) = 2**n * sqrt(a**2 + b**(2*(m - n))) for n >= m * or * hypotl(x, y) = 2**m * sqrt(a**(2*(n - m)) + b**2) for n < m * * where a and b are in [0.5, 1). If (m - n) is less than -32 (for * a 64-bit significand), then b**(2*(m - n)) can be ignored. -- Steve From owner-freebsd-standards@FreeBSD.ORG Sat Nov 3 21:31:45 2007 Return-Path: Delivered-To: freebsd-standards@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id E2F8D16A418 for ; Sat, 3 Nov 2007 21:31:45 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu [128.208.78.105]) by mx1.freebsd.org (Postfix) with ESMTP id BD44513C4A5 for ; Sat, 3 Nov 2007 21:31:45 +0000 (UTC) (envelope-from sgk@troutmask.apl.washington.edu) Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu [127.0.0.1]) by troutmask.apl.washington.edu (8.14.1/8.14.1) with ESMTP id lA3LU02s092922; Sat, 3 Nov 2007 14:30:00 -0700 (PDT) (envelope-from sgk@troutmask.apl.washington.edu) Received: (from sgk@localhost) by troutmask.apl.washington.edu (8.14.1/8.14.1/Submit) id lA3LU035092921; Sat, 3 Nov 2007 14:30:00 -0700 (PDT) (envelope-from sgk) Date: Sat, 3 Nov 2007 14:30:00 -0700 From: Steve Kargl To: Bruce Evans Message-ID: <20071103213000.GA92030@troutmask.apl.washington.edu> References: <20070928152227.GA39233@troutmask.apl.washington.edu> <20071001173736.U1985@besplex.bde.org> <20071002001154.GA3782@troutmask.apl.washington.edu> <20071002172317.GA95181@VARK.MIT.EDU> <20071002173237.GA12586@troutmask.apl.washington.edu> <20071003103519.X14175@delplex.bde.org> <20071010204249.GA7446@troutmask.apl.washington.edu> <20071103144151.U671@besplex.bde.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20071103144151.U671@besplex.bde.org> User-Agent: Mutt/1.4.2.3i Cc: freebsd-standards@freebsd.org Subject: Re: long double broken on i386? X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Nov 2007 21:31:46 -0000 On Sat, Nov 03, 2007 at 03:53:19PM +1100, Bruce Evans wrote: > On Wed, 10 Oct 2007, Steve Kargl wrote: > >I've given up on i386. All of my attempts to make my C implementations > >work on both amd64 and i386 have failed. The code when run on i386 > >results in ULP's that exceed 10^5. Yes, that is 1E5. Results for amd64 > >are reported below. As far as assembly, someone else will need to write > >those routines because I do not know assembly. > > Bad errors are expressed in GULPs (Giga-ULPs) :-). Maybe you are hitting > the compiler bugs that break double-precision arithmetic on i386's when > the rounding precision is changed to long double. The rounding precision > must be changed for long doubles to work at all. Then the compiler bugs > probably affect even long doubles, since you use k_rem_pio2.c and it uses > long doubles and is believed to be affected by the bugs. k_rem_pio2f.c > _was_ affected by the corresponding bugs from float precision being smaller > than the default double precision, and has some fixes, but the fixes may > be incomplete and have not been merged to k_rem_pio2.c. I did some limiting testing of the output from k_rem_pio2.c against argument reduction using MPFR. I found no long double values that gave results which did not agree. I fairly certain that the GULPs are do to improper polynomial approximations for the 53-bit precision long doubles on i386. > >>(4): These functions are useless unless the default is changed. Users > >>have always been able to get equivalent _in_accuracy on i386 by just > >>using the double precision versions. Hacks like #define sinl sin > >>almost work for just getting things to compile. > > > >AFAIK, C99 says the prototype in math.h is "long double sinl(long > >double);". > >The #define hack simply won't cut it. > > I only said that it will almost work. It will give almost > "long double sinl(double);", since "double sin(double);" gives almost > "long double sin(double);". Of course, if a programmer adds the #define to her code, then I can agree with you. My point is that the #define is unacceptable for /usr/include/math.h. C99 is explicit about the prototype. > >>In fact, on i386, due > >>to the bugfeatures outlined in my previous mail, users can even get > >>full long double accuracy using hacks like that, even if the rounding > >>precision is 53 bits: > > > >(code removed) > > > >>On i386, this stores the full long double result for sinl(1) in x, due > >>to the bugfeatures -- sin(1) returns extra precision in a register, and > >>gcc doesn't understand the extra precision so it just stores all bits > >>of the register in x. The behaviour would be similar if sin(1) were used > >>in an expression. > > > >Given GCC's history, relying on the accidental full long double > >accuracy because FP registers are 80 bits seems to be folly. It > > Isn't 20 years of history of doing it like this (slightly wrong) enough? Probably. Unfortunately, I'm also aware of the last 15 years of GCC history where things can suddenly change :) > I only said it is a hack. However, the library can depend on the > current compiler behaviour since it is tied to the compiler. > > >also isn't clear to me what sin(x) will do with x > DBL_MAX. Will > >1e400L be interpreted as +Inf or the long double value 1e400? Yes, > >I'm fully aware that such a large argument may not be the greatest > >idea due to argument reduction. > > sin(x) (and sinl(x) if sinl is #defined as sin) must pass x as a double > like the prototype says. Since the i386 ABI requires passing doubles > on the stack, any extra precision and argument range will be lost. > In particular, values > DBL_MAX will be converted to +Inf (except I'm > not sure if rounding down of a value just slightly larger than DBL_MAX > can give DBL_MAX). > > >>>PS: There is also the minor possibility of giving bde either > >>>a heartache or death via laughter when he finally sees what I > >>>have done. :) > >> > >>I would be unhappy with any versions that aren't either: > >>- as identical as possible in algorithm and style to the double precision > >> versions, in the same way that the float versions are as identical as > >> possible, or > >>- much better in algorithm and style and efficiency and accuracy than the > >> double precision versions. I don't want to maintain different versions. > > > >I'll go with your second option see comments below and attached > >code. BTW, if I add > > So I'm unhappy that the first point hasn't been addressed at all. This > gives quality < the current float and double precision versions. You're moving the goal. You gave me an "either/or" option. I took the "or", so the first point is irrelevant. Well, your observation on quality is subjective because I'd claim my (almost) KNF commented code is much better than the fdlibm code. > >#ifdef sinl > >#undef sinl > >#endif > >#define sinl sin > > > >to the top of my sin_test.c code, I get ULPs that exceed 10^9. > > I guess this is from loss of exponent range for args < DBL_EPSILON. > Gradual underflow (denormals) only reduces this loss a little. After > underflow to 0, I think the loss is about 2^ > ~= 2^(16384-1024) ULPs = many GULPs. Between DBL_EPSILON and DBL_MAX, > the loss is about 2^ = 2^11. Between DBL_MAX > and LDBL_MAX, the loss is infinity ULPs. I admit that I haven't studied the issue too carefully. > >Back to sinl, cosl, tanl. The attached diff contains everything > >that one needs for these functions on an architecture with 64-bit > >significand long double. Note that portions of the diff for Makefile > >and math.h contain irrelevant changes that are only relevant for > >my ccosh, sqrtl, and cbrtl changes. Also note that the Makefile only > >includes my long double routines when LDBL_PREC == 64. My testing on > >FreeBSD-amd64 shows > > > >troutmask:sgk[219] ./sin_test 4000000 10 400000000 > > Number of test loops to run: 10 > >Number of PRN in [0,4.000000e+08] per run: 4000000 > > ULP > >1.0000 (5) > >1.0000 (4) > >1.0000 (6) > >1.0000 (4) > >1.0000 (3) > >1.0000 (7) > >1.0000 (3) > >1.0000 (7) > >1.0000 (5) > >1.0000 (5) > > --- > > 49 > > > >The number in parenthesis is the count of values that give a sinl(x) > >that exceeds 0.5 ULP. On the reduced range of [0,pi/4] where > >__kernel_cosl and __kernel_sinl are used, in all my testing no value > >of x in [0,pi/4] generated a sinl(x) that exceeded 0.5 ULP. Thus, I > >suspect (but haven't tried to confirm), the 1.0 ULP observed above is > >due to the argument reduction. > > The argument reduction seems to begin correctly, but then the extra > precision which is carefully provided by k_rem_pio2.c is thrown away, > unlike in the float and double versions which pass it to the __kernel > functions. I would expect a loss of up to a full ULP from this, giving > a maximum error of 1.5 ULPs (0.5 from the __kernel and 1 from the arg > reduction). I had __kernel_cosl and __kernel_sinl that retained both a high and low portion of the reduced arguments, but in the range tested I did not find any advantage to doing the extra precision arithmetic. These may have been discarded during my i386 GULP experience, so I revisted the issue on amd64. > >My timings on the basic routines using gprof are: > > > > % cumulative self self total > >time seconds seconds calls ms/call ms/call name > > 0.6 804.42 5.67 40000000 0.00 0.00 sinl [18] > > 0.1 934.41 0.74 19998751 0.00 0.00 __kernel_sinl [95] > > 0.1 935.82 0.66 20001249 0.00 0.00 __kernel_cosl [96] > > > >where the system contains a dual-cpu opteron tyan motherboard running > >at 2.2 GHz. > > That seems to be about 175 nS/call. Not too bad. (My current times on > a 2.2 GHz A64*1 in i386 mode on small args ([-2Pi,2Pi]) are 70 nS for > the C double version, 43 nS for the asm double version, and 15 nS for > the C float version). Well, I've been more concerned about correctness over speed. I'd need to read the grof output again, but I recall that argument reduction was the major time consumer. > >... > >The testing of tanl() is a little more interesting. > > > >troutmask:sgk[227] ./tan_test 4000000 10 400000000 > > Number of test loops to run: 10 > >Number of PRN in [0,4.000000e+08] per run: 4000000 > > ULP > >1.0000 (24788) > >1.0000 (24633) > >1.5000 (24626) > >1.0000 (24820) > >1.0000 (24652) > >1.5000 (24526) > >1.0000 (24749) > >1.0000 (24873) > >1.0000 (24656) > >1.0000 (24701) > > ------- > > 247024 > > > > 0.5 911.88 5.80 40000000 0.00 0.00 tanl [16] > > 0.5 933.18 5.01 40000000 0.00 0.00 __kernel_tanl [53] > > > >At first glance, the 1.5 ULP and the total count of values of x > >that generate a result with a ULP that exceeds 0.5 seems disconcerting. > >However, testing of __kernel_tanl() shows that for all tested values of x > >in [0,pi/4] the maximum ULP did exceed 0.5. This suggests that tanl() is > >more sensitive to errors in the argument reduction than sinl() and cosl(). > > This is because tan() has a larger derivative than sin() and cos(). > E.g., tan'(Pi/4) = 2; this doubles any errors in arg reduction near > Pi/2. The fdlibm __kernels take extra care near Pi/4 to reduce these > errors. Also, between Pi/2 and Pi/2, the division in tan(x) = -1 / > tan(Pi/2) gives an extra error of up to 0.5 ULPs. The fdlibm __kernels > take extra care to reduce this error. With 3 or 4 sources of errors > in the final steps, with each error of up to 0.5 ULPs, it takes lots > of extra care to keep the final error below 1 ULP. OK. I'll revisit fdlibm's algorithms to see if I can adapt it to my code. > >Do with the code as you see fit. I'll be moving on to hypothl and > >other functions as I have time. > > I wouldn't use it directly since it doesn't copy fdlibm to get the final > steps right. Its earlier steps are better, but I would prefer those to > copy fdlibm too, so that there aren't many different algorithms to maintain. > Note that the full long double precision versions require about the same > amount of code as the float and double precision versions combined, since > there are 2 very different long double precision formats on supported > machines (80 active bits for most, but 128 bits for sparc64). > > % Index: msun/src/k_cosl.c > % =================================================================== > % RCS file: msun/src/k_cosl.c > % diff -N msun/src/k_cosl.c > % --- /dev/null 1 Jan 1970 00:00:00 -0000 > % +++ msun/src/k_cosl.c 9 Oct 2007 21:52:32 -0000 > % @@ -0,0 +1,62 @@ > % +/* > % + * Compute cos(x) on the interval [-1:1] with a polynomial approximation. > > Interval is larger than necessary. This might result in unnecessary > coefficients. 10 of them seem to be too many for 64 bits of mantissa > but not enough for 113 bits (sparc64). Yes, I know. My original Chebyshev interpolation code assumed a function f(x) is in [-1,1], which of course makes things easy. The code has been updated to allow f(x) in [a,b]. I plan to update the coefficients for the restricted range. > > % + * The soefficients are determined from a Chebyshev interpolation scheme, > > Spelling. > > % + * In practice, this routine is called only for x in the interval > [0:pi/4]. > % + * Limited testing on pseudorandom numbers drawn within [0:pi/4] shows > % + * an accuracy of <= 0.5 ULP. > % + */ > > I think testing is neither necessary nor sufficient. One just finds a > polynomial that has enough accuracy in infinite precision (the poly > should have about 8 more bits of precision than the number of mantissa > bits), then checks that there is no magic in coefficients than causes > problems (a large second coefficient will cause problems, especially > if it is not exactly representable. Here 0.5 is quite large, but is > exactly representable, and for sin*(), 0.333 is just small enough not > to cause problems despite it not being exactly representable). > > % +#define C0 (long double) 0x1.000000000000000p0L > % +#define C1 (long double) -0x8.000000000000000p-4L > > Various style bugs. The casts have no effect. Whoops, sorry about that. This is a holdover from my attempts at double-double precision arithmetic. > > % Index: msun/src/k_tanl.c > % =================================================================== > % RCS file: msun/src/k_tanl.c > % diff -N msun/src/k_tanl.c > % ... > % +#define C33 (long double) 0x1.8f939ed1883f595p-32L > > This has far too many terms for a 64-bit mantissa, but not enough for a > 113-bit mantissa. IIRC, the former requires about 25 terms. This is partly due to the range being larger than required. > % +long double > % +__kernel_tanl(long double x) > % +{ > % + long double t, xs; > % + > % + xs = x * x; > % + > % + t = C0 + xs * (C1 + xs * (C2 + xs * (C3 + xs * (C4 + xs * > % + (C5 + xs * (C6 + xs * (C7 + xs * (C8 + xs * (C9 + xs * > % + (C10 + xs * (C11 + xs * (C12 + xs * (C13 + xs * (C14 + xs * > % + (C15 + xs * (C16 + xs * (C17 + xs * (C18 + xs * (C19 + xs * > % + (C20 + xs * (C21 + xs * (C22 + xs * (C23 + xs * (C24 + xs * > % + (C25 + xs * (C26 + xs * (C27 + xs * (C28 + xs * (C29 + xs * > % + (C30 + xs * (C31 + xs * > % + (C32 + xs * C33)))))))))))))))))))))))))))))))); > > With so many terms, better scheduling is more needed. > > Much more extra care is needed for an accuracy of < 1 ULP for tan*() than > for sin*() and cos*(). Well, I did code both steps 3 and 4 from k_tan.c, but I saw no statistical improvement in my testing. > % Index: msun/src/s_cosl.c > % ... > % +/* > % + * Table of constants for 2/pi, 396 Hex digits (476 decimal) of 2/pi. > % + * The table was taken from e_rem_pio2.c. > % + */ > % +static const int32_t two_over_pi[] = { > % +0xA2F983, 0x6E4E44, 0x1529FC, 0x2757D1, 0xF534DD, 0xC0DB62, > % +0x95993C, 0x439041, 0xFE5163, 0xABDEBB, 0xC561B7, 0x246E3A, > > This gives far too many copies of this table (instead of just too many). > The float and double versions avoid having a copy for each of cos/sin/tan > by doing arg reduction in e_rem_pio2[f].c. The extra layer for this file > is a pessimization but seems to be worth it to avoid the duplication. > > Are 396 hex digits enough? It can't be right for all of the float, double > and long double cases. I hope it is enough for 128-bit long doubles, and > the float case is just sloppy in using the same table as the double case. > If 396 hex digits is enough for all cases, then the table should be in > k_rem_pio2.c. In my testing, 396 hex digits appear to be sufficient for 64-bit long doubles. Without access to 128-bit long double, I won't be investigating the issue. > % +long double > % +cosl(long double x) > % +{ > % + union IEEEl2bits z; > > I want to avoid using this union... > > % + z.e = x; > % + z.bits.sign = 0; > > ... since, while code like this is easy to write, it hard-codes the > assumption that things are layed out in bit-fields. Access macros > are more flexible and don't require so much compiler intelligence > to optimize away the memory accesses that are logically required for > bit-fields. gcc does a mediocre jobs with this optimization, > especially for 80-bit long doubles on amd64 and i386 since the bits > don't fit evenly in registers. math_private.h doesn't have GET_* or SET_* macros for use with 64-bit long double (not to mention 128-bit long double), and the macros in math_private.h use similar unions. AFAICT, all supported archs have an _fpmath.h header. > I see parametrizing the bit-field sizes and other magic numbers (not > quoted) for various formats of long doubles as the main difficulty > here (trig functions are easy except for this and k_rem_pio2.c). Which explains why GET_* and SET_* don't exist. Although these might be present in bdeBSD. :) > % Index: msun/src/s_tanl.c > > Similarly. > > % ... > + switch (e0 & 3) { > + ... > + case 1: > + case 3: > + z.e = - 1.L / __kernel_tanl(z.e); > + break; > > The fdlibm has to do this division in the __kernel since doing it carefully > takes too much code to inline here (there are two cases, one for z.e near > Pi/4 where everything is done differently, and the general case where > things are simple except for the division). > > Note that the fdlibm __kernel code for trig functions is generic except for: > > - my optimizations in the float case > - the magic number for the threshold near Pi/4 for tan() > - the number and values of terms in the polynomial, and the scheduling > (best scheduling depends on the number of terms, and we don't have this > automated). > > Thus it is easier to clone the correct code than to rewrite it incorrectly. It may be easier to clone fdlibm, but I'm not sure I'd call it correct code due to - magic numbers - questionable source formatting (it some of the worse code I've read) - lack of comments -- Steve From owner-freebsd-standards@FreeBSD.ORG Sat Nov 3 22:55:21 2007 Return-Path: Delivered-To: freebsd-standards@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BB01116A418 for ; Sat, 3 Nov 2007 22:55:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx02.syd.optusnet.com.au (fallbackmx02.syd.optusnet.com.au [211.29.133.72]) by mx1.freebsd.org (Postfix) with ESMTP id 170BE13C4AA for ; Sat, 3 Nov 2007 22:55:20 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by fallbackmx02.syd.optusnet.com.au (8.12.11.20060308/8.12.11) with ESMTP id lA34rdNC017032 for ; Sat, 3 Nov 2007 15:53:39 +1100 Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au [211.30.219.213]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id lA34rJJS017331 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 3 Nov 2007 15:53:22 +1100 Date: Sat, 3 Nov 2007 15:53:19 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Steve Kargl In-Reply-To: <20071010204249.GA7446@troutmask.apl.washington.edu> Message-ID: <20071103144151.U671@besplex.bde.org> References: <20070928152227.GA39233@troutmask.apl.washington.edu> <20071001173736.U1985@besplex.bde.org> <20071002001154.GA3782@troutmask.apl.washington.edu> <20071002172317.GA95181@VARK.MIT.EDU> <20071002173237.GA12586@troutmask.apl.washington.edu> <20071003103519.X14175@delplex.bde.org> <20071010204249.GA7446@troutmask.apl.washington.edu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-standards@FreeBSD.org Subject: Re: long double broken on i386? X-BeenThere: freebsd-standards@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Standards compliance List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 03 Nov 2007 22:55:21 -0000 This thread is old, so I quoted too much. On Wed, 10 Oct 2007, Steve Kargl wrote: > On Wed, Oct 03, 2007 at 11:13:38AM +1000, Bruce Evans wrote: >> On Tue, 2 Oct 2007, Steve Kargl wrote: >> >>> On Tue, Oct 02, 2007 at 01:23:18PM -0400, David Schultz wrote: >>>> ... >>>> Anyway, my point is that if you have something that works >>>> reasonably well and doesn't have egregious errors, my suggestion >>>> is to just commit it and not kill yourself over whether the >>>> argument reduction is correct in the last ulp. >> >>> There are a few problems: 1) I don't have commit access. 2) I >>> need to do style(9) clean-up pass. 3) I've only tested these >>> on i386 and amd64, and I know these fail for sparc64. 4) Most >>> importantly, I don't know how the project wants to handle the >>> 53 vs 64 bit default fpu setting on i386. >> >> (3) is most important. The C versions will never even be used for >> i386 or amd64, since these arches have sin and cos in hardware and (I >> believe) it is impossible to beat the hardware on these arches. The >> asm versions have always been trivial to implement (change 2 bytes in >> s_sin.S (compare e_sqrt.S with e_sqrtf.S; the long double changes go >> the other way). amd64 actually uses the C versions for double precision >> becase the C versions are competitive with the hardware for speed and >> beat it for accuracy on large args and on small args aear a multiple >> of pi/2. > > I've given up on i386. All of my attempts to make my C implementations > work on both amd64 and i386 have failed. The code when run on i386 > results in ULP's that exceed 10^5. Yes, that is 1E5. Results for amd64 > are reported below. As far as assembly, someone else will need to write > those routines because I do not know assembly. Bad errors are expressed in GULPs (Giga-ULPs) :-). Maybe you are hitting the compiler bugs that break double-precision arithmetic on i386's when the rounding precision is changed to long double. The rounding precision must be changed for long doubles to work at all. Then the compiler bugs probably affect even long doubles, since you use k_rem_pio2.c and it uses long doubles and is believed to be affected by the bugs. k_rem_pio2f.c _was_ affected by the corresponding bugs from float precision being smaller than the default double precision, and has some fixes, but the fixes may be incomplete and have not been merged to k_rem_pio2.c. >> (4): These functions are useless unless the default is changed. Users >> have always been able to get equivalent _in_accuracy on i386 by just >> using the double precision versions. Hacks like #define sinl sin >> almost work for just getting things to compile. > > AFAIK, C99 says the prototype in math.h is "long double sinl(long double);". > The #define hack simply won't cut it. I only said that it will almost work. It will give almost "long double sinl(double);", since "double sin(double);" gives almost "long double sin(double);". >> In fact, on i386, due >> to the bugfeatures outlined in my previous mail, users can even get >> full long double accuracy using hacks like that, even if the rounding >> precision is 53 bits: > > (code removed) > >> On i386, this stores the full long double result for sinl(1) in x, due >> to the bugfeatures -- sin(1) returns extra precision in a register, and >> gcc doesn't understand the extra precision so it just stores all bits >> of the register in x. The behaviour would be similar if sin(1) were used >> in an expression. > > Given GCC's history, relying on the accidental full long double > accuracy because FP registers are 80 bits seems to be folly. It Isn't 20 years of history of doing it like this (slightly wrong) enough? I only said it is a hack. However, the library can depend on the current compiler behaviour since it is tied to the compiler. > also isn't clear to me what sin(x) will do with x > DBL_MAX. Will > 1e400L be interpreted as +Inf or the long double value 1e400? Yes, > I'm fully aware that such a large argument may not be the greatest > idea due to argument reduction. sin(x) (and sinl(x) if sinl is #defined as sin) must pass x as a double like the prototype says. Since the i386 ABI requires passing doubles on the stack, any extra precision and argument range will be lost. In particular, values > DBL_MAX will be converted to +Inf (except I'm not sure if rounding down of a value just slightly larger than DBL_MAX can give DBL_MAX). >>> PS: There is also the minor possibility of giving bde either >>> a heartache or death via laughter when he finally sees what I >>> have done. :) >> >> I would be unhappy with any versions that aren't either: >> - as identical as possible in algorithm and style to the double precision >> versions, in the same way that the float versions are as identical as >> possible, or >> - much better in algorithm and style and efficiency and accuracy than the >> double precision versions. I don't want to maintain different versions. > > I'll go with your second option see comments below and attached > code. BTW, if I add So I'm unhappy that the first point hasn't been addressed at all. This gives quality < the current float and double precision versions. > #ifdef sinl > #undef sinl > #endif > #define sinl sin > > to the top of my sin_test.c code, I get ULPs that exceed 10^9. I guess this is from loss of exponent range for args < DBL_EPSILON. Gradual underflow (denormals) only reduces this loss a little. After underflow to 0, I think the loss is about 2^ ~= 2^(16384-1024) ULPs = many GULPs. Between DBL_EPSILON and DBL_MAX, the loss is about 2^ = 2^11. Between DBL_MAX and LDBL_MAX, the loss is infinity ULPs. > Back to sinl, cosl, tanl. The attached diff contains everything > that one needs for these functions on an architecture with 64-bit > significand long double. Note that portions of the diff for Makefile > and math.h contain irrelevant changes that are only relevant for > my ccosh, sqrtl, and cbrtl changes. Also note that the Makefile only > includes my long double routines when LDBL_PREC == 64. My testing on > FreeBSD-amd64 shows > > troutmask:sgk[219] ./sin_test 4000000 10 400000000 > Number of test loops to run: 10 > Number of PRN in [0,4.000000e+08] per run: 4000000 > ULP > 1.0000 (5) > 1.0000 (4) > 1.0000 (6) > 1.0000 (4) > 1.0000 (3) > 1.0000 (7) > 1.0000 (3) > 1.0000 (7) > 1.0000 (5) > 1.0000 (5) > --- > 49 > > The number in parenthesis is the count of values that give a sinl(x) > that exceeds 0.5 ULP. On the reduced range of [0,pi/4] where > __kernel_cosl and __kernel_sinl are used, in all my testing no value > of x in [0,pi/4] generated a sinl(x) that exceeded 0.5 ULP. Thus, I > suspect (but haven't tried to confirm), the 1.0 ULP observed above is > due to the argument reduction. The argument reduction seems to begin correctly, but then the extra precision which is carefully provided by k_rem_pio2.c is thrown away, unlike in the float and double versions which pass it to the __kernel functions. I would expect a loss of up to a full ULP from this, giving a maximum error of 1.5 ULPs (0.5 from the __kernel and 1 from the arg reduction). > My timings on the basic routines using gprof are: > > % cumulative self self total > time seconds seconds calls ms/call ms/call name > 0.6 804.42 5.67 40000000 0.00 0.00 sinl [18] > 0.1 934.41 0.74 19998751 0.00 0.00 __kernel_sinl [95] > 0.1 935.82 0.66 20001249 0.00 0.00 __kernel_cosl [96] > > where the system contains a dual-cpu opteron tyan motherboard running > at 2.2 GHz. That seems to be about 175 nS/call. Not too bad. (My current times on a 2.2 GHz A64*1 in i386 mode on small args ([-2Pi,2Pi]) are 70 nS for the C double version, 43 nS for the asm double version, and 15 nS for the C float version). > ... > The testing of tanl() is a little more interesting. > > troutmask:sgk[227] ./tan_test 4000000 10 400000000 > Number of test loops to run: 10 > Number of PRN in [0,4.000000e+08] per run: 4000000 > ULP > 1.0000 (24788) > 1.0000 (24633) > 1.5000 (24626) > 1.0000 (24820) > 1.0000 (24652) > 1.5000 (24526) > 1.0000 (24749) > 1.0000 (24873) > 1.0000 (24656) > 1.0000 (24701) > ------- > 247024 > > 0.5 911.88 5.80 40000000 0.00 0.00 tanl [16] > 0.5 933.18 5.01 40000000 0.00 0.00 __kernel_tanl [53] > > At first glance, the 1.5 ULP and the total count of values of x > that generate a result with a ULP that exceeds 0.5 seems disconcerting. > However, testing of __kernel_tanl() shows that for all tested values of x > in [0,pi/4] the maximum ULP did exceed 0.5. This suggests that tanl() is > more sensitive to errors in the argument reduction than sinl() and cosl(). This is because tan() has a larger derivative than sin() and cos(). E.g., tan'(Pi/4) = 2; this doubles any errors in arg reduction near Pi/2. The fdlibm __kernels take extra care near Pi/4 to reduce these errors. Also, between Pi/2 and Pi/2, the division in tan(x) = -1 / tan(Pi/2) gives an extra error of up to 0.5 ULPs. The fdlibm __kernels take extra care to reduce this error. With 3 or 4 sources of errors in the final steps, with each error of up to 0.5 ULPs, it takes lots of extra care to keep the final error below 1 ULP. > Do with the code as you see fit. I'll be moving on to hypothl and > other functions as I have time. I wouldn't use it directly since it doesn't copy fdlibm to get the final steps right. Its earlier steps are better, but I would prefer those to copy fdlibm too, so that there aren't many different algorithms to maintain. Note that the full long double precision versions require about the same amount of code as the float and double precision versions combined, since there are 2 very different long double precision formats on supported machines (80 active bits for most, but 128 bits for sparc64). % Index: msun/src/k_cosl.c % =================================================================== % RCS file: msun/src/k_cosl.c % diff -N msun/src/k_cosl.c % --- /dev/null 1 Jan 1970 00:00:00 -0000 % +++ msun/src/k_cosl.c 9 Oct 2007 21:52:32 -0000 % @@ -0,0 +1,62 @@ % +/* % + * Compute cos(x) on the interval [-1:1] with a polynomial approximation. Interval is larger than necessary. This might result in unnecessary coefficients. 10 of them seem to be too many for 64 bits of mantissa but not enough for 113 bits (sparc64). % + * The soefficients are determined from a Chebyshev interpolation scheme, Spelling. % + * In practice, this routine is called only for x in the interval [0:pi/4]. % + * Limited testing on pseudorandom numbers drawn within [0:pi/4] shows % + * an accuracy of <= 0.5 ULP. % + */ I think testing is neither necessary nor sufficient. One just finds a polynomial that has enough accuracy in infinite precision (the poly should have about 8 more bits of precision than the number of mantissa bits), then checks that there is no magic in coefficients than causes problems (a large second coefficient will cause problems, especially if it is not exactly representable. Here 0.5 is quite large, but is exactly representable, and for sin*(), 0.333 is just small enough not to cause problems despite it not being exactly representable). % +#define C0 (long double) 0x1.000000000000000p0L % +#define C1 (long double) -0x8.000000000000000p-4L Various style bugs. The casts have no effect. % +long double % +__kernel_cosl(long double x) % +{ % + long double ys, fs; % + % + ys = x * x; % + % + fs = C0 + ys * (C1 + ys * (C2 + ys * (C3 + ys * (C4 + % + ys * (C5 + ys * (C6 + ys * (C7 + ys * (C8 + ys * C9)))))))); Better scheduling is possible. The C1 term does in fact cause problems due to it being large. I think writing the expression as (C0 + ys * C1 + ...) gives more accuracy by taking advantage of C1 being exact, but that is not enough. fdlibm __kernel cos takes extra care with the C1 term. % Index: msun/src/k_sinl.c Similarly. % Index: msun/src/k_tanl.c % =================================================================== % RCS file: msun/src/k_tanl.c % diff -N msun/src/k_tanl.c % ... % +#define C33 (long double) 0x1.8f939ed1883f595p-32L This has far too many terms for a 64-bit mantissa, but not enough for a 113-bit mantissa. IIRC, the former requires about 25 terms. % +long double % +__kernel_tanl(long double x) % +{ % + long double t, xs; % + % + xs = x * x; % + % + t = C0 + xs * (C1 + xs * (C2 + xs * (C3 + xs * (C4 + xs * % + (C5 + xs * (C6 + xs * (C7 + xs * (C8 + xs * (C9 + xs * % + (C10 + xs * (C11 + xs * (C12 + xs * (C13 + xs * (C14 + xs * % + (C15 + xs * (C16 + xs * (C17 + xs * (C18 + xs * (C19 + xs * % + (C20 + xs * (C21 + xs * (C22 + xs * (C23 + xs * (C24 + xs * % + (C25 + xs * (C26 + xs * (C27 + xs * (C28 + xs * (C29 + xs * % + (C30 + xs * (C31 + xs * % + (C32 + xs * C33)))))))))))))))))))))))))))))))); With so many terms, better scheduling is more needed. Much more extra care is needed for an accuracy of < 1 ULP for tan*() than for sin*() and cos*(). % Index: msun/src/s_cosl.c % ... % +/* % + * Table of constants for 2/pi, 396 Hex digits (476 decimal) of 2/pi. % + * The table was taken from e_rem_pio2.c. % + */ % +static const int32_t two_over_pi[] = { % +0xA2F983, 0x6E4E44, 0x1529FC, 0x2757D1, 0xF534DD, 0xC0DB62, % +0x95993C, 0x439041, 0xFE5163, 0xABDEBB, 0xC561B7, 0x246E3A, This gives far too many copies of this table (instead of just too many). The float and double versions avoid having a copy for each of cos/sin/tan by doing arg reduction in e_rem_pio2[f].c. The extra layer for this file is a pessimization but seems to be worth it to avoid the duplication. Are 396 hex digits enough? It can't be right for all of the float, double and long double cases. I hope it is enough for 128-bit long doubles, and the float case is just sloppy in using the same table as the double case. If 396 hex digits is enough for all cases, then the table should be in k_rem_pio2.c. % +long double % +cosl(long double x) % +{ % + union IEEEl2bits z; I want to avoid using this union... % ... % + z.e = x; % + z.bits.sign = 0; ... since, while code like this is easy to write, it hard-codes the assumption that things are layed out in bit-fields. Access macros are more flexible and don't require so much compiler intelligence to optimize away the memory accesses that are logically required for bit-fields. gcc does a mediocre jobs with this optimization, especially for 80-bit long doubles on amd64 and i386 since the bits don't fit evenly in registers. I see parametrizing the bit-field sizes and other magic numbers (not quoted) for various formats of long doubles as the main difficulty here (trig functions are easy except for this and k_rem_pio2.c). % ... % + for (z.e = 0, i = nx - 1; i >= 0; i--) % + z.e += (long double) yd[i]; Here the extra precision that is provided by k_rem_pio2.c is thrown away. The cast has no effect. Casts are not followed by a space in KNF. % + % + switch (e0 & 3) { % + case 0: % + z.e = __kernel_cosl(z.e); % + break; fdlibm __kernels pass an extra arg here to avoid loss of precision (except I changed the float kernels to pass the extra precision in a double). % Index: msun/src/s_sinl.c Similarly. % Index: msun/src/s_tanl.c Similarly. % ... + switch (e0 & 3) { + ... + case 1: + case 3: + z.e = - 1.L / __kernel_tanl(z.e); + break; The fdlibm has to do this division in the __kernel since doing it carefully takes too much code to inline here (there are two cases, one for z.e near Pi/4 where everything is done differently, and the general case where things are simple except for the division). Note that the fdlibm __kernel code for trig functions is generic except for: - my optimizations in the float case - the magic number for the threshold near Pi/4 for tan() - the number and values of terms in the polynomial, and the scheduling (best scheduling depends on the number of terms, and we don't have this automated). Thus it is easier to clone the correct code than to rewrite it incorrectly. We should investigate whether bit fiddling for the threshold and other things in __kernel tan are needed. I think the threshold calculation would be faster without bit fiddling, but bit fiddling is used for it because the bits are needed anyway. Bruce