From owner-freebsd-standards@FreeBSD.ORG  Mon Oct 29 11:07:16 2007
Return-Path: <owner-freebsd-standards@FreeBSD.ORG>
Delivered-To: freebsd-standards@hub.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 710F216A46E
	for <freebsd-standards@hub.freebsd.org>;
	Mon, 29 Oct 2007 11:07:13 +0000 (UTC)
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 3C6BE13C48A
	for <freebsd-standards@hub.freebsd.org>;
	Mon, 29 Oct 2007 11:07:13 +0000 (UTC)
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: from freefall.freebsd.org (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.1/8.14.1) with ESMTP id l9TB7DBQ090266
	for <freebsd-standards@FreeBSD.org>; Mon, 29 Oct 2007 11:07:13 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.14.1/8.14.1/Submit) id l9TB7CN4090262
	for freebsd-standards@FreeBSD.org; Mon, 29 Oct 2007 11:07:12 GMT
	(envelope-from owner-bugmaster@FreeBSD.org)
Date: Mon, 29 Oct 2007 11:07:12 GMT
Message-Id: <200710291107.l9TB7CN4090262@freefall.freebsd.org>
X-Authentication-Warning: freefall.freebsd.org: gnats set sender to
	owner-bugmaster@FreeBSD.org using -f
From: FreeBSD bugmaster <bugmaster@FreeBSD.org>
To: freebsd-standards@FreeBSD.org
Cc: 
Subject: Current problem reports assigned to freebsd-standards@FreeBSD.org
X-BeenThere: freebsd-standards@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Standards compliance <freebsd-standards.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-standards>
List-Post: <mailto:freebsd-standards@freebsd.org>
List-Help: <mailto:freebsd-standards-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 29 Oct 2007 11:07:17 -0000

Current FreeBSD problem reports
Critical problems
Serious problems

S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o bin/25542    standards  /bin/sh: null char in quoted string
o kern/46239   standards  posix semaphore implementation errors
o stand/54410  standards  one-true-awk not POSIX compliant (no extended REs)
o stand/82654  standards  C99 long double math functions are missing
o stand/94729  standards  [libc] fcntl() throws undocumented ENOTTY

5 problems total.

Non-critical problems

S Tracker      Resp.      Description
--------------------------------------------------------------------------------
o bin/21519    standards  sys/dir.h should be deprecated some more
o bin/24390    standards  Replacing old dir-symlinks when using /bin/ln
s stand/24590  standards  timezone function not compatible witn Single Unix Spec
s kern/28260   standards  UIO_MAXIOV needs to be made public
s stand/36076  standards  Implementation of POSIX fuser command
o stand/39256  standards  snprintf/vsnprintf aren't POSIX-conformant for strings
p stand/41576  standards  POSIX compliance of ln(1)
o stand/44425  standards  getcwd() succeeds even if current dir has perm 000.
o stand/46119  standards  Priority problems for SCHED_OTHER using pthreads
o stand/54833  standards  [pcvt] more pcvt deficits
o stand/54839  standards  [pcvt] pcvt deficits
p stand/55112  standards  glob.h, glob_t's gl_pathc should be "size_t", not "int
o stand/56476  standards  cd9660 unicode support simple hack
o stand/58676  standards  grantpt(3) alters storage used by ptsname(3)
s stand/62858  standards  malloc(0) not C99 compliant
s kern/64875   standards  [libc] [patch] [feature request] add a system call: fd
o stand/66357  standards  make POSIX conformance problem ('sh -e' & '+' command-
o stand/66531  standards  _gettemp uses a far smaller set of filenames than docu
o stand/70813  standards  [PATCH] ls(1) not Posix compliant
o stand/72006  standards  floating point formating in non-C locales
o stand/79056  standards  regex(3) regression tests
a stand/80293  standards  sysconf() does not support well-defined unistd values
o stand/81287  standards  [PATCH]: fingerd(8) might send a line not ending in CR
o stand/83845  standards  [libm] [patch] add log2() and log2f() support for libm
o stand/85080  standards  output of long double subnormals (with printf) is wron
o stand/92360  standards  [headers] [patch] Missing TAB3 in kernel headers
o stand/92362  standards  [headers] [patch] Missing SIGPOLL in kernel headers
o kern/93705   standards  [headers] [patch] ENODATA and EGREGIOUS (for glibc com
o stand/96016  standards  [headers] clock_getres et al should be in <time.h>
o stand/96236  standards  [PATCH] [POSIX] sed.1 incorrectly describes a function
p stand/99517  standards  Missing SIGRTMIN and SIGRTMAX signals
o stand/99960  standards  [Patch] make(1): Add -p flag
o stand/100017 standards  [Patch] Add fuser(1) functionality to fstat(1)
o stand/104743 standards  [headers] [patch] Wrong values for _POSIX_ minimal lim
o stand/104841 standards  [libm] [patch] C99 long double square root.
o stand/107561 standards  [patch] Missing SUS function tcgetsid()
o kern/114578  standards  [libc] wide character printing using swprintf(dst, n, 
o stand/114633 standards  /etc/rc.subr: line 511: omits a quotation mark: "force
o stand/116081 standards  make does not work with the directive sinclude
o stand/116221 standards  SUS issue -- FreeBSD has not flag WNOWAIT for wait*() 
o stand/116826 standards  [PATCH] sh support for POSIX character classes

41 problems total.


From owner-freebsd-standards@FreeBSD.ORG  Sat Nov  3 00:13:53 2007
Return-Path: <owner-freebsd-standards@FreeBSD.ORG>
Delivered-To: freebsd-standards@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CDC4916A421
	for <freebsd-standards@freebsd.org>;
	Sat,  3 Nov 2007 00:13:53 +0000 (UTC)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
	[128.208.78.105])
	by mx1.freebsd.org (Postfix) with ESMTP id 9AB9813C491
	for <freebsd-standards@freebsd.org>;
	Sat,  3 Nov 2007 00:13:53 +0000 (UTC)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu
	[127.0.0.1])
	by troutmask.apl.washington.edu (8.14.1/8.14.1) with ESMTP id
	lA30D6io083003 for <freebsd-standards@freebsd.org>;
	Fri, 2 Nov 2007 17:13:06 -0700 (PDT)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
	by troutmask.apl.washington.edu (8.14.1/8.14.1/Submit) id
	lA30D5Ht083002
	for freebsd-standards@freebsd.org; Fri, 2 Nov 2007 17:13:05 -0700 (PDT)
	(envelope-from sgk)
Date: Fri, 2 Nov 2007 17:13:05 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: freebsd-standards@freebsd.org
Message-ID: <20071103001305.GA82775@troutmask.apl.washington.edu>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="RnlQjJ0d97Da+TV1"
Content-Disposition: inline
User-Agent: Mutt/1.4.2.3i
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Subject: Implementation of expl()
X-BeenThere: freebsd-standards@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Standards compliance <freebsd-standards.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-standards>
List-Post: <mailto:freebsd-standards@freebsd.org>
List-Help: <mailto:freebsd-standards-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Nov 2007 00:13:53 -0000


--RnlQjJ0d97Da+TV1
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

With all the success of developing several other missing
C99 math routines, I decided to tackle expl() (which I
need to tackle tanhl()).  As with the other functions, 
this implementation is known to work on amd64 and should
be valid on other 64-bit architectures.

As with the previous function, I can only do a limited
amount of statistical testing of expl().  The reference
values in testing are computing using GMP/MPFR with
265 bits of precision.  The reference value is then 
converted to a long double for comparison to my expl().

Testing shows:
Special cases:
  exp(0)   = 1e+00
  exp(inf) = inf
  exp(-inf)= 0e+00
  exp(nan) = nan

With x in the non-exceptional range of [ln(LDBL_MIN):ln(LDBL_MAX)],
I computed 22.7 million values of expl() that span the range.  The
maximum observed ULP was 1 where only 7378 values (0.0325%) had a
ULP exceeding 0.5.

In the exceptional range leading to gradual underflow, testing was
much more difficult.  Computing a ULP seemed troublesome because of
the loss of precision as the subnormal numbers approach zero.  So,
I broke the range into 4 segments and compute the relative maximum
error in expl() compared to a value computed by GMP/MPFR. 

x in [-11366.00:-11355.14]
  RME > 1e-18 --> (73524/1086300) = 6.77%   
  RME > 1e-10 --> (69314/1086300) = 6.38%
  RME > 1e-8  --> (69314/1086300) = 6.38%
  RME > 1e-4  --> (69314/1086300) = 6.38%

x in [-11377.00:-11366.00]
  RME > 1e-18 --> (2/1100000) = 0.00%

x in [-11388.00:-11377.00]
  RME > 1e-18 --> (0/1100000) = 0.00%

x in [-11399.50:-11388.00]
  RME > 1e-18 --> (156152/1149852) = 13.58%
  RME > 1e-10 --> (156152/1149852) = 13.58%
  RME > 1e-8  --> (156152/1149852) = 13.58%
  RME > 1e-4  --> (136328/1149852) = 11.86%

The above suggests that in the middle on exceptional range, my
expl() gives an accurate anwer.  In the last segment, the RME
significantly increases and is attributed to the loss of precision.

Although I'm not completely satisfied with the gradual underflow
behavior, I'm providing the code for comments and possible inclusion
into src/msun.  Note, a diff against cvs is not provided because of
the entanglement of my other patches to math.h, Symbol.map, and 
Makefile.

Enjoy.

-- 
Steve

--RnlQjJ0d97Da+TV1--

From owner-freebsd-standards@FreeBSD.ORG  Sat Nov  3 05:15:46 2007
Return-Path: <owner-freebsd-standards@FreeBSD.ORG>
Delivered-To: freebsd-standards@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4992A16A418
	for <freebsd-standards@FreeBSD.org>;
	Sat,  3 Nov 2007 05:15:46 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail35.syd.optusnet.com.au (mail35.syd.optusnet.com.au
	[211.29.133.51])
	by mx1.freebsd.org (Postfix) with ESMTP id E1DF413C491
	for <freebsd-standards@FreeBSD.org>;
	Sat,  3 Nov 2007 05:15:45 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au
	[211.30.219.213])
	by mail35.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	lA35FRMG008246
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 3 Nov 2007 16:15:28 +1100
Date: Sat, 3 Nov 2007 16:15:27 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
In-Reply-To: <20071012180959.GA36345@troutmask.apl.washington.edu>
Message-ID: <20071103155721.K671@besplex.bde.org>
References: <20071012180959.GA36345@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-standards@FreeBSD.org
Subject: Re: [PATCH] hypotl, cabsl, and code removal in cabs
X-BeenThere: freebsd-standards@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Standards compliance <freebsd-standards.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-standards>
List-Post: <mailto:freebsd-standards@freebsd.org>
List-Help: <mailto:freebsd-standards-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Nov 2007 05:15:46 -0000


On Fri, 12 Oct 2007, Steve Kargl wrote:

> The attached patch implements hypotl(), cabsl(), and removes
> z_abs() from w_cabs.c.  z_abs() is undocumented namespace
> pollution in libm, and no header file under /usr/include
> provides a prototype.  The documentation has been updated,
> and in particular an incorrect paragraph in cabs.3 has been
> removed.

I fear that that paragraph is correct again, since the long double
version only seems to take enough care with the overflow cases.

% Index: man/hypot.3
% ...
% @@ -82,11 +90,6 @@ Consequently
%  exactly;
%  in general, hypot and cabs return an integer whenever an
%  integer might be expected.
% -.Pp
% -The same cannot be said for the shorter and faster version of hypot
% -and cabs that is provided in the comments in cabs.c; its error can
% -exceed 1.2
% -.Em ulps .

It's silly that on i386's, naive double precision code would work if
the default precision were extended, but the double precision code
isn't naive; OTOH, long double precision code can never depend on extra
precision, but is naive.

% Index: src/e_hypotl.c
% ...
% +	if (n >= m) {
% +		a.e *= a.e;
% +		m -= n;
% +		if (m > - PREC) {
% +			b.bits.exp += m;
% +			a.e += b.e * b.e;
% +		}
% +		a.e = sqrtl(a.e);
% +		a.bits.exp += n;
% +		return (a.e);

The usual case reduces to just sqrtl(x*x + y*y).

fdlibm is more careful in this case (but similar in others):

%  *	So, compute sqrt(x*x+y*y) with some care as 
%  *	follows to get the error below 1 ulp:
%  *
%  *	Assume x>y>0;
%  *	(if possible, set rounding to round-to-nearest)
%  *	1. if x > 2y  use
%  *		x1*x1+(y*y+(x2*(x+x1))) for x*x+y*y
%  *	where x1 = x with lower 32 bits cleared, x2 = x-x1; else
%  *	2. if x <= 2y use
%  *		t1*y1+((x-y)*(x-y)+(t1*y2+t2*y))
%  *	where t1 = 2x with lower 32 bits cleared, t2 = 2x-t1, 
%  *	y1= y with lower 32 bits chopped, y2 = y-y1.

I don't understand the details of this, and couldn't find any cases
where this makes a difference.  Testing of 2-arg functions is hard
since exhaustive testing is impossible even for floats, and the
cases where this matters for hypot() seem to be sparse.

Bruce

From owner-freebsd-standards@FreeBSD.ORG  Sat Nov  3 20:14:09 2007
Return-Path: <owner-freebsd-standards@FreeBSD.ORG>
Delivered-To: freebsd-standards@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3631616A418
	for <freebsd-standards@FreeBSD.org>;
	Sat,  3 Nov 2007 20:14:09 +0000 (UTC)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
	[128.208.78.105])
	by mx1.freebsd.org (Postfix) with ESMTP id 164E813C480
	for <freebsd-standards@FreeBSD.org>;
	Sat,  3 Nov 2007 20:14:09 +0000 (UTC)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu
	[127.0.0.1])
	by troutmask.apl.washington.edu (8.14.1/8.14.1) with ESMTP id
	lA3JxqCl091988; Sat, 3 Nov 2007 12:59:52 -0700 (PDT)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
	by troutmask.apl.washington.edu (8.14.1/8.14.1/Submit) id
	lA3Jxqve091987; Sat, 3 Nov 2007 12:59:52 -0700 (PDT)
	(envelope-from sgk)
Date: Sat, 3 Nov 2007 12:59:52 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Message-ID: <20071103195952.GA91682@troutmask.apl.washington.edu>
References: <20071012180959.GA36345@troutmask.apl.washington.edu>
	<20071103155721.K671@besplex.bde.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20071103155721.K671@besplex.bde.org>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-standards@FreeBSD.org
Subject: Re: [PATCH] hypotl, cabsl, and code removal in cabs
X-BeenThere: freebsd-standards@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Standards compliance <freebsd-standards.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-standards>
List-Post: <mailto:freebsd-standards@freebsd.org>
List-Help: <mailto:freebsd-standards-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Nov 2007 20:14:09 -0000

On Sat, Nov 03, 2007 at 04:15:27PM +1100, Bruce Evans wrote:
> 
> On Fri, 12 Oct 2007, Steve Kargl wrote:
> 
> >The attached patch implements hypotl(), cabsl(), and removes
> >z_abs() from w_cabs.c.  z_abs() is undocumented namespace
> >pollution in libm, and no header file under /usr/include
> >provides a prototype.  The documentation has been updated,
> >and in particular an incorrect paragraph in cabs.3 has been
> >removed.
> 
> I fear that that paragraph is correct again, since the long double
> version only seems to take enough care with the overflow cases.

Ah, no.  Underflow is prevented by the if() statement. 

		a.e *= a.e;
		m -= n;
		if (m > - PREC) {
			b.bits.exp += m;
			a.e += b.e * b.e;
		}
		a.e = sqrtl(a.e);
		a.bits.exp += n;
		return (a.e);

If m < -PREC, then |b|/|a| < 2**(-32), so it skips the "a.e += b.e * b.e;"
"a.e. *= a.e" can't underflow because a.e is in [0.5,1), and 
"a.e = sqrtl(a.e);" can't underflow because, here, a.e is in [0.25,2).

> % Index: man/hypot.3
> % ...
> % @@ -82,11 +90,6 @@ Consequently
> %  exactly;
> %  in general, hypot and cabs return an integer whenever an
> %  integer might be expected.
> % -.Pp
> % -The same cannot be said for the shorter and faster version of hypot
> % -and cabs that is provided in the comments in cabs.c; its error can
> % -exceed 1.2
> % -.Em ulps .

This paragraph is useless.  There is no cabs.c in the source tree,
and in particular, there is no cabs.c in an installed system.  Surely,
this paragraph isn't referring to src/contrib/libf2c/libF77/cabs.c.
It has one comment with a total of one word.

> It's silly that on i386's, naive double precision code would work if
> the default precision were extended, but the double precision code
> isn't naive; OTOH, long double precision code can never depend on extra
> precision, but is naive.

I don't understand to what the above comment is directed.

> % Index: src/e_hypotl.c
> % ...
> % +	if (n >= m) {
> % +		a.e *= a.e;
> % +		m -= n;
> % +		if (m > - PREC) {
> % +			b.bits.exp += m;
> % +			a.e += b.e * b.e;
> % +		}
> % +		a.e = sqrtl(a.e);
> % +		a.bits.exp += n;
> % +		return (a.e);
> 
> The usual case reduces to just sqrtl(x*x + y*y).

See my comment above.  The usual case isn't the important case.
It's the unusual case that is of concern.  The above code prevents
the unusual, yet provides acceptable performance in the usual case.

> fdlibm is more careful in this case (but similar in others):
> 
> %  *	So, compute sqrt(x*x+y*y) with some care as 
> %  *	follows to get the error below 1 ulp:
> %  *
> %  *	Assume x>y>0;
> %  *	(if possible, set rounding to round-to-nearest)
> %  *	1. if x > 2y  use
> %  *		x1*x1+(y*y+(x2*(x+x1))) for x*x+y*y
> %  *	where x1 = x with lower 32 bits cleared, x2 = x-x1; else
> %  *	2. if x <= 2y use
> %  *		t1*y1+((x-y)*(x-y)+(t1*y2+t2*y))
> %  *	where t1 = 2x with lower 32 bits cleared, t2 = 2x-t1, 
> %  *	y1= y with lower 32 bits chopped, y2 = y-y1.
> 
> I don't understand the details of this, and couldn't find any cases
> where this makes a difference.  Testing of 2-arg functions is hard
> since exhaustive testing is impossible even for floats, and the
> cases where this matters for hypot() seem to be sparse.

The above seems more complicated than needed when one considers
that x and y can be easily split into |x|=a*2**n and |y|=b*2**m
where f,g are in [0.5,1).   One then has the comment that is
found in the code:

 * Compute hypotl(x, y) = sqrt(x**2 + y**2) such that underflow and 
 * overflow are avoided.  To accomplish the task, write x = a * 2**n
 * and y = b * 2**m, one then has
 *
 *   hypotl(x, y) = 2**n * sqrt(a**2 + b**(2*(m - n))) for n >= m
 * or
 *   hypotl(x, y) = 2**m * sqrt(a**(2*(n - m)) + b**2) for n < m
 * 
 * where a and b are in [0.5, 1).  If (m - n) is less than -32 (for
 * a 64-bit significand), then b**(2*(m - n)) can be ignored.

-- 
Steve

From owner-freebsd-standards@FreeBSD.ORG  Sat Nov  3 21:31:45 2007
Return-Path: <owner-freebsd-standards@FreeBSD.ORG>
Delivered-To: freebsd-standards@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id E2F8D16A418
	for <freebsd-standards@freebsd.org>;
	Sat,  3 Nov 2007 21:31:45 +0000 (UTC)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (troutmask.apl.washington.edu
	[128.208.78.105])
	by mx1.freebsd.org (Postfix) with ESMTP id BD44513C4A5
	for <freebsd-standards@freebsd.org>;
	Sat,  3 Nov 2007 21:31:45 +0000 (UTC)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: from troutmask.apl.washington.edu (localhost.apl.washington.edu
	[127.0.0.1])
	by troutmask.apl.washington.edu (8.14.1/8.14.1) with ESMTP id
	lA3LU02s092922; Sat, 3 Nov 2007 14:30:00 -0700 (PDT)
	(envelope-from sgk@troutmask.apl.washington.edu)
Received: (from sgk@localhost)
	by troutmask.apl.washington.edu (8.14.1/8.14.1/Submit) id
	lA3LU035092921; Sat, 3 Nov 2007 14:30:00 -0700 (PDT)
	(envelope-from sgk)
Date: Sat, 3 Nov 2007 14:30:00 -0700
From: Steve Kargl <sgk@troutmask.apl.washington.edu>
To: Bruce Evans <brde@optusnet.com.au>
Message-ID: <20071103213000.GA92030@troutmask.apl.washington.edu>
References: <20070928152227.GA39233@troutmask.apl.washington.edu>
	<20071001173736.U1985@besplex.bde.org>
	<20071002001154.GA3782@troutmask.apl.washington.edu>
	<20071002172317.GA95181@VARK.MIT.EDU>
	<20071002173237.GA12586@troutmask.apl.washington.edu>
	<20071003103519.X14175@delplex.bde.org>
	<20071010204249.GA7446@troutmask.apl.washington.edu>
	<20071103144151.U671@besplex.bde.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20071103144151.U671@besplex.bde.org>
User-Agent: Mutt/1.4.2.3i
Cc: freebsd-standards@freebsd.org
Subject: Re: long double broken on i386?
X-BeenThere: freebsd-standards@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Standards compliance <freebsd-standards.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-standards>
List-Post: <mailto:freebsd-standards@freebsd.org>
List-Help: <mailto:freebsd-standards-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Nov 2007 21:31:46 -0000

On Sat, Nov 03, 2007 at 03:53:19PM +1100, Bruce Evans wrote:
> On Wed, 10 Oct 2007, Steve Kargl wrote:
> >I've given up on i386.  All of my attempts to make my C implementations
> >work on both amd64 and i386 have failed.  The code when run on i386
> >results in ULP's that exceed 10^5.  Yes, that is 1E5.  Results for amd64
> >are reported below.  As far as assembly, someone else will need to write
> >those routines because I do not know assembly.
> 
> Bad errors are expressed in GULPs (Giga-ULPs) :-).  Maybe you are hitting
> the compiler bugs that break double-precision arithmetic on i386's when
> the rounding precision is changed to long double.  The rounding precision
> must be changed for long doubles to work at all.  Then the compiler bugs
> probably affect even long doubles, since you use k_rem_pio2.c and it uses
> long doubles and is believed to be affected by the bugs.  k_rem_pio2f.c
> _was_ affected by the corresponding bugs from float precision being smaller
> than the default double precision, and has some fixes, but the fixes may
> be incomplete and have not been merged to k_rem_pio2.c.

I did some limiting testing of the output from k_rem_pio2.c
against argument reduction using MPFR.  I found no long
double values that gave results which did not agree.  I fairly
certain that the GULPs are do to improper polynomial approximations
for the 53-bit precision long doubles on i386.

> >>(4): These functions are useless unless the default is changed.  Users
> >>have always been able to get equivalent _in_accuracy on i386 by just
> >>using the double precision versions.  Hacks like #define sinl sin
> >>almost work for just getting things to compile.
> >
> >AFAIK, C99 says the prototype in math.h is "long double sinl(long 
> >double);".
> >The #define hack simply won't cut it.
> 
> I only said that it will almost work.  It will give almost
> "long double sinl(double);", since "double sin(double);" gives almost
> "long double sin(double);".

Of course, if a programmer adds the #define to her code, then I
can agree with you.  My point is that the #define is unacceptable
for /usr/include/math.h.  C99 is explicit about the prototype.

> >>In fact, on i386, due
> >>to the bugfeatures outlined in my previous mail, users can even get
> >>full long double accuracy using hacks like that, even if the rounding
> >>precision is 53 bits:
> >
> >(code removed)
> >
> >>On i386, this stores the full long double result for sinl(1) in x, due
> >>to the bugfeatures -- sin(1) returns extra precision in a register, and
> >>gcc doesn't understand the extra precision so it just stores all bits
> >>of the register in x.  The behaviour would be similar if sin(1) were used
> >>in an expression.
> >
> >Given GCC's history, relying on the accidental full long double
> >accuracy because FP registers are 80 bits seems to be folly.  It
> 
> Isn't 20 years of history of doing it like this (slightly wrong) enough?

Probably.  Unfortunately, I'm also aware of the last 15 years of 
GCC history where things can suddenly change :)

> I only said it is a hack.  However, the library can depend on the
> current compiler behaviour since it is tied to the compiler.
> 
> >also isn't clear to me what sin(x) will do with x > DBL_MAX.  Will
> >1e400L be interpreted as +Inf or the long double value 1e400?  Yes,
> >I'm fully aware that such a large argument may not be the greatest
> >idea due to argument reduction.
> 
> sin(x) (and sinl(x) if sinl is #defined as sin) must pass x as a double
> like the prototype says.  Since the i386 ABI requires passing doubles
> on the stack, any extra precision and argument range will be lost.
> In particular, values > DBL_MAX will be converted to +Inf (except I'm
> not sure if rounding down of a value just slightly larger than DBL_MAX
> can give DBL_MAX).
> 
> >>>PS: There is also the minor possibility of giving bde either
> >>>a heartache or death via laughter when he finally sees what I
> >>>have done. :)
> >>
> >>I would be unhappy with any versions that aren't either:
> >>- as identical as possible in algorithm and style to the double precision
> >>  versions, in the same way that the float versions are as identical as
> >>  possible, or
> >>- much better in algorithm and style and efficiency and accuracy than the
> >>  double precision versions.  I don't want to maintain different versions.
> >
> >I'll go with your second option see comments below and attached
> >code.  BTW, if I add
> 
> So I'm unhappy that the first point hasn't been addressed at all.  This
> gives quality < the current float and double precision versions.

You're moving the goal.  You gave me an "either/or" option.  I took
the "or", so the first point is irrelevant.  Well, your observation
on quality is subjective because I'd claim my (almost) KNF commented
code is much better than the fdlibm code.

> >#ifdef sinl
> >#undef sinl
> >#endif
> >#define sinl sin
> >
> >to the top of my sin_test.c code, I get ULPs that exceed 10^9.
> 
> I guess this is from loss of exponent range for args < DBL_EPSILON.
> Gradual underflow (denormals) only reduces this loss a little.  After
> underflow to 0, I think the loss is about 2^<max difference in exponents>
> ~= 2^(16384-1024) ULPs = many GULPs.  Between DBL_EPSILON and DBL_MAX,
> the loss is about 2^<difference in precision> = 2^11.  Between DBL_MAX
> and LDBL_MAX, the loss is infinity ULPs.

I admit that I haven't studied the issue too carefully.

> >Back to sinl, cosl, tanl.  The attached diff contains everything
> >that one needs for these functions on an architecture with 64-bit
> >significand long double.  Note that portions of the diff for Makefile
> >and math.h contain irrelevant changes that are only relevant for
> >my ccosh, sqrtl, and cbrtl changes.  Also note that the Makefile only
> >includes my long double routines when LDBL_PREC == 64.  My testing on
> >FreeBSD-amd64 shows
> >
> >troutmask:sgk[219] ./sin_test 4000000 10 400000000
> >     Number of test loops to run: 10
> >Number of PRN in [0,4.000000e+08] per run: 4000000
> > ULP
> >1.0000 (5)
> >1.0000 (4)
> >1.0000 (6)
> >1.0000 (4)
> >1.0000 (3)
> >1.0000 (7)
> >1.0000 (3)
> >1.0000 (7)
> >1.0000 (5)
> >1.0000 (5)
> >      ---
> >       49
> >
> >The number in parenthesis is the count of values that give a sinl(x)
> >that exceeds 0.5 ULP.  On the reduced range of [0,pi/4] where
> >__kernel_cosl and __kernel_sinl are used, in all my testing no value
> >of x in [0,pi/4] generated a sinl(x) that exceeded 0.5 ULP.  Thus, I
> >suspect (but haven't tried to confirm), the 1.0 ULP observed above is
> >due to the argument reduction.
> 
> The argument reduction seems to begin correctly, but then the extra
> precision which is carefully provided by k_rem_pio2.c is thrown away,
> unlike in the float and double versions which pass it to the __kernel
> functions.  I would expect a loss of up to a full ULP from this, giving
> a maximum error of 1.5 ULPs (0.5 from the __kernel and 1 from the arg
> reduction).

I had __kernel_cosl and __kernel_sinl that retained both a high
and low portion of the reduced arguments, but in the range tested
I did not find any advantage to doing the extra precision arithmetic.
These may have been discarded during my i386 GULP experience, so
I revisted the issue on amd64.

> >My timings on the basic routines using gprof are:
> >
> > %   cumulative   self              self     total
> >time   seconds   seconds    calls  ms/call  ms/call  name
> > 0.6     804.42     5.67 40000000     0.00     0.00  sinl [18]
> > 0.1     934.41     0.74 19998751     0.00     0.00  __kernel_sinl [95]
> > 0.1     935.82     0.66 20001249     0.00     0.00  __kernel_cosl [96]
> >
> >where the system contains a dual-cpu opteron tyan motherboard running
> >at 2.2 GHz.
> 
> That seems to be about 175 nS/call.  Not too bad.  (My current times on
> a 2.2 GHz A64*1 in i386 mode on small args ([-2Pi,2Pi]) are 70 nS for
> the C double version, 43 nS for the asm double version, and 15 nS for
> the C float version).

Well, I've been more concerned about correctness over speed.  I'd
need to read the grof output again, but I recall that argument
reduction was the major time consumer.

> >...
> >The testing of tanl() is a little more interesting.
> >
> >troutmask:sgk[227] ./tan_test 4000000 10 400000000
> >     Number of test loops to run: 10
> >Number of PRN in [0,4.000000e+08] per run: 4000000
> > ULP
> >1.0000 (24788)
> >1.0000 (24633)
> >1.5000 (24626)
> >1.0000 (24820)
> >1.0000 (24652)
> >1.5000 (24526)
> >1.0000 (24749)
> >1.0000 (24873)
> >1.0000 (24656)
> >1.0000 (24701)
> >      -------
> >       247024
> >
> > 0.5     911.88     5.80 40000000     0.00     0.00  tanl [16]
> > 0.5     933.18     5.01 40000000     0.00     0.00  __kernel_tanl [53]
> >
> >At first glance, the 1.5 ULP and the total count of values of x
> >that generate a result with a ULP that exceeds 0.5 seems disconcerting.
> >However, testing of __kernel_tanl() shows that for all tested values of x
> >in [0,pi/4] the maximum ULP did exceed 0.5.  This suggests that tanl() is
> >more sensitive to errors in the argument reduction than sinl() and cosl().
> 
> This is because tan() has a larger derivative than sin() and cos().
> E.g., tan'(Pi/4) = 2; this doubles any errors in arg reduction near
> Pi/2.  The fdlibm __kernels take extra care near Pi/4 to reduce these
> errors.  Also, between Pi/2 and Pi/2, the division in tan(x) = -1 /
> tan(Pi/2) gives an extra error of up to 0.5 ULPs.  The fdlibm __kernels
> take extra care to reduce this error.  With 3 or 4 sources of errors
> in the final steps, with each error of up to 0.5 ULPs, it takes lots
> of extra care to keep the final error below 1 ULP.

OK.  I'll revisit fdlibm's algorithms to see if I can adapt it
to my code.

> >Do with the code as you see fit.  I'll be moving on to hypothl and
> >other functions as I have time.
> 
> I wouldn't use it directly since it doesn't copy fdlibm to get the final
> steps right.  Its earlier steps are better, but I would prefer those to
> copy fdlibm too, so that there aren't many different algorithms to maintain.
> Note that the full long double precision versions require about the same
> amount of code as the float and double precision versions combined, since
> there are 2 very different long double precision formats on supported
> machines (80 active bits for most, but 128 bits for sparc64).
> 
> % Index: msun/src/k_cosl.c
> % ===================================================================
> % RCS file: msun/src/k_cosl.c
> % diff -N msun/src/k_cosl.c
> % --- /dev/null	1 Jan 1970 00:00:00 -0000
> % +++ msun/src/k_cosl.c	9 Oct 2007 21:52:32 -0000
> % @@ -0,0 +1,62 @@
> % +/*
> % + * Compute cos(x) on the interval [-1:1] with a polynomial approximation.
> 
> Interval is larger than necessary.  This might result in unnecessary
> coefficients.  10 of them seem to be too many for 64 bits of mantissa
> but not enough for 113 bits (sparc64).

Yes, I know.  My original Chebyshev interpolation code assumed a function
f(x) is in [-1,1], which of course makes things easy.  The code has been
updated to allow f(x) in [a,b].  I plan to update the coefficients for
the restricted range.
> 
> % + * The soefficients are determined from a Chebyshev interpolation scheme,
> 
> Spelling.
> 
> % + * In practice, this routine is called only for x in the interval 
> [0:pi/4].
> % + * Limited testing on pseudorandom numbers drawn within [0:pi/4] shows
> % + * an accuracy of <= 0.5 ULP.
> % + */
> 
> I think testing is neither necessary nor sufficient.  One just finds a
> polynomial that has enough accuracy in infinite precision (the poly
> should have about 8 more bits of precision than the number of mantissa
> bits), then checks that there is no magic in coefficients than causes
> problems (a large second coefficient will cause problems, especially
> if it is not exactly representable.  Here 0.5 is quite large, but is
> exactly representable, and for sin*(), 0.333 is just small enough not
> to cause problems despite it not being exactly representable).
> 
> % +#define C0  (long double)  0x1.000000000000000p0L
> % +#define C1  (long double) -0x8.000000000000000p-4L
> 
> Various style bugs.  The casts have no effect.

Whoops, sorry about that.  This is a holdover from my attempts
at double-double precision arithmetic.  

> 
> % Index: msun/src/k_tanl.c
> % ===================================================================
> % RCS file: msun/src/k_tanl.c
> % diff -N msun/src/k_tanl.c
> % ...
> % +#define C33  (long double)  0x1.8f939ed1883f595p-32L
> 
> This has far too many terms for a 64-bit mantissa, but not enough for a
> 113-bit mantissa.  IIRC, the former requires about 25 terms.

This is partly due to the range being larger than required.

> % +long double
> % +__kernel_tanl(long double x)
> % +{
> % +	long double t, xs;
> % + 
> % +	xs = x * x;
> % +
> % +	t = C0 + xs * (C1 + xs * (C2 + xs * (C3 + xs * (C4 + xs *
> % +	    (C5  + xs * (C6  + xs * (C7  + xs * (C8  + xs * (C9 + xs *
> % +	    (C10 + xs * (C11 + xs * (C12 + xs * (C13 + xs * (C14 + xs *
> % +	    (C15 + xs * (C16 + xs * (C17 + xs * (C18 + xs * (C19 + xs *
> % +	    (C20 + xs * (C21 + xs * (C22 + xs * (C23 + xs * (C24 + xs *
> % +	    (C25 + xs * (C26 + xs * (C27 + xs * (C28 + xs * (C29 + xs *
> % +	    (C30 + xs * (C31 + xs *
> % +	    (C32 + xs * C33))))))))))))))))))))))))))))))));
> 
> With so many terms, better scheduling is more needed.
> 
> Much more extra care is needed for an accuracy of < 1 ULP for tan*() than
> for sin*() and cos*().

Well, I did code both steps 3 and 4 from k_tan.c, but I saw no
statistical improvement in my testing.

> % Index: msun/src/s_cosl.c
> % ...
> % +/*
> % + * Table of constants for 2/pi, 396 Hex digits (476 decimal) of 2/pi.
> % + * The table was taken from e_rem_pio2.c.
> % + */
> % +static const int32_t two_over_pi[] = {
> % +0xA2F983, 0x6E4E44, 0x1529FC, 0x2757D1, 0xF534DD, 0xC0DB62, 
> % +0x95993C, 0x439041, 0xFE5163, 0xABDEBB, 0xC561B7, 0x246E3A,
> 
> This gives far too many copies of this table (instead of just too many).
> The float and double versions avoid having a copy for each of cos/sin/tan
> by doing arg reduction in e_rem_pio2[f].c.  The extra layer for this file
> is a pessimization but seems to be worth it to avoid the duplication.
> 
> Are 396 hex digits enough?  It can't be right for all of the float, double
> and long double cases.  I hope it is enough for 128-bit long doubles, and
> the float case is just sloppy in using the same table as the double case.
> If 396 hex digits is enough for all cases, then the table should be in
> k_rem_pio2.c.

In my testing, 396 hex digits appear to be sufficient for 64-bit 
long doubles.  Without access to 128-bit long double, I won't be
investigating the issue.

> % +long double
> % +cosl(long double x)
> % +{
> % +	union IEEEl2bits z;
> 
> I want to avoid using this union...
>
> % +	z.e = x;
> % +	z.bits.sign = 0;
> 
> ... since, while code like this is easy to write, it hard-codes the
> assumption that things are layed out in bit-fields.  Access macros
> are more flexible and don't require so much compiler intelligence
> to optimize away the memory accesses that are logically required for
> bit-fields.  gcc does a mediocre jobs with this optimization,
> especially for 80-bit long doubles on amd64 and i386 since the bits
> don't fit evenly in registers.

math_private.h doesn't have GET_* or SET_* macros for use with
64-bit long double (not to mention 128-bit long double), and
the macros in math_private.h use similar unions.  AFAICT, all
supported archs have an _fpmath.h header.

> I see parametrizing the bit-field sizes and other magic numbers (not
> quoted) for various formats of long doubles as the main difficulty
> here (trig functions are easy except for this and k_rem_pio2.c).

Which explains why GET_* and SET_* don't exist.  Although these
might be present in bdeBSD. :)

> % Index: msun/src/s_tanl.c
> 
> Similarly.
> 
> % ...
> +	switch (e0 & 3) {
> + ...
> +	case 1:
> +	case 3:
> +	    z.e = - 1.L / __kernel_tanl(z.e);
> +	    break;
> 
> The fdlibm has to do this division in the __kernel since doing it carefully
> takes too much code to inline here (there are two cases, one for z.e near
> Pi/4 where everything is done differently, and the general case where
> things are simple except for the division).
> 
> Note that the fdlibm __kernel code for trig functions is generic except for:
> 
> - my optimizations in the float case
> - the magic number for the threshold near Pi/4 for tan()
> - the number and values of terms in the polynomial, and the scheduling
>   (best scheduling depends on the number of terms, and we don't have this
>   automated).
> 
> Thus it is easier to clone the correct code than to rewrite it incorrectly.

It may be easier to clone fdlibm, but I'm not sure I'd call it correct
code due to
  - magic numbers
  - questionable source formatting (it some of the worse code I've read)
  - lack of comments
-- 
Steve

From owner-freebsd-standards@FreeBSD.ORG  Sat Nov  3 22:55:21 2007
Return-Path: <owner-freebsd-standards@FreeBSD.ORG>
Delivered-To: freebsd-standards@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id BB01116A418
	for <freebsd-standards@FreeBSD.org>;
	Sat,  3 Nov 2007 22:55:21 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from fallbackmx02.syd.optusnet.com.au
	(fallbackmx02.syd.optusnet.com.au [211.29.133.72])
	by mx1.freebsd.org (Postfix) with ESMTP id 170BE13C4AA
	for <freebsd-standards@FreeBSD.org>;
	Sat,  3 Nov 2007 22:55:20 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au
	[211.29.132.191])
	by fallbackmx02.syd.optusnet.com.au (8.12.11.20060308/8.12.11) with
	ESMTP id lA34rdNC017032
	for <freebsd-standards@freebsd.org>; Sat, 3 Nov 2007 15:53:39 +1100
Received: from besplex.bde.org (c211-30-219-213.carlnfd3.nsw.optusnet.com.au
	[211.30.219.213])
	by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	lA34rJJS017331
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 3 Nov 2007 15:53:22 +1100
Date: Sat, 3 Nov 2007 15:53:19 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl <sgk@troutmask.apl.washington.edu>
In-Reply-To: <20071010204249.GA7446@troutmask.apl.washington.edu>
Message-ID: <20071103144151.U671@besplex.bde.org>
References: <20070928152227.GA39233@troutmask.apl.washington.edu>
	<20071001173736.U1985@besplex.bde.org>
	<20071002001154.GA3782@troutmask.apl.washington.edu>
	<20071002172317.GA95181@VARK.MIT.EDU>
	<20071002173237.GA12586@troutmask.apl.washington.edu>
	<20071003103519.X14175@delplex.bde.org>
	<20071010204249.GA7446@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-standards@FreeBSD.org
Subject: Re: long double broken on i386?
X-BeenThere: freebsd-standards@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Standards compliance <freebsd-standards.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-standards>
List-Post: <mailto:freebsd-standards@freebsd.org>
List-Help: <mailto:freebsd-standards-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-standards>, 
	<mailto:freebsd-standards-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 03 Nov 2007 22:55:21 -0000

This thread is old, so I quoted too much.

On Wed, 10 Oct 2007, Steve Kargl wrote:

> On Wed, Oct 03, 2007 at 11:13:38AM +1000, Bruce Evans wrote:
>> On Tue, 2 Oct 2007, Steve Kargl wrote:
>>
>>> On Tue, Oct 02, 2007 at 01:23:18PM -0400, David Schultz wrote:
>>>> ...
>>>> Anyway, my point is that if you have something that works
>>>> reasonably well and doesn't have egregious errors, my suggestion
>>>> is to just commit it and not kill yourself over whether the
>>>> argument reduction is correct in the last ulp.
>>
>>> There are a few problems: 1) I don't have commit access.  2) I
>>> need to do style(9) clean-up pass.  3) I've only tested these
>>> on i386 and amd64, and I know these fail for sparc64. 4) Most
>>> importantly, I don't know how the project wants to handle the
>>> 53 vs 64 bit default fpu setting on i386.
>>
>> (3) is most important.  The C versions will never even be used for
>> i386 or amd64, since these arches have sin and cos in hardware and (I
>> believe) it is impossible to beat the hardware on these arches.  The
>> asm versions have always been trivial to implement (change 2 bytes in
>> s_sin.S (compare e_sqrt.S with e_sqrtf.S; the long double changes go
>> the other way).  amd64 actually uses the C versions for double precision
>> becase the C versions are competitive with the hardware for speed and
>> beat it for accuracy on large args and on small args aear a multiple
>> of pi/2.
>
> I've given up on i386.  All of my attempts to make my C implementations
> work on both amd64 and i386 have failed.  The code when run on i386
> results in ULP's that exceed 10^5.  Yes, that is 1E5.  Results for amd64
> are reported below.  As far as assembly, someone else will need to write
> those routines because I do not know assembly.

Bad errors are expressed in GULPs (Giga-ULPs) :-).  Maybe you are hitting
the compiler bugs that break double-precision arithmetic on i386's when
the rounding precision is changed to long double.  The rounding precision
must be changed for long doubles to work at all.  Then the compiler bugs
probably affect even long doubles, since you use k_rem_pio2.c and it uses
long doubles and is believed to be affected by the bugs.  k_rem_pio2f.c
_was_ affected by the corresponding bugs from float precision being smaller
than the default double precision, and has some fixes, but the fixes may
be incomplete and have not been merged to k_rem_pio2.c.

>> (4): These functions are useless unless the default is changed.  Users
>> have always been able to get equivalent _in_accuracy on i386 by just
>> using the double precision versions.  Hacks like #define sinl sin
>> almost work for just getting things to compile.
>
> AFAIK, C99 says the prototype in math.h is "long double sinl(long double);".
> The #define hack simply won't cut it.

I only said that it will almost work.  It will give almost
"long double sinl(double);", since "double sin(double);" gives almost
"long double sin(double);".

>> In fact, on i386, due
>> to the bugfeatures outlined in my previous mail, users can even get
>> full long double accuracy using hacks like that, even if the rounding
>> precision is 53 bits:
>
> (code removed)
>
>> On i386, this stores the full long double result for sinl(1) in x, due
>> to the bugfeatures -- sin(1) returns extra precision in a register, and
>> gcc doesn't understand the extra precision so it just stores all bits
>> of the register in x.  The behaviour would be similar if sin(1) were used
>> in an expression.
>
> Given GCC's history, relying on the accidental full long double
> accuracy because FP registers are 80 bits seems to be folly.  It

Isn't 20 years of history of doing it like this (slightly wrong) enough?
I only said it is a hack.  However, the library can depend on the
current compiler behaviour since it is tied to the compiler.

> also isn't clear to me what sin(x) will do with x > DBL_MAX.  Will
> 1e400L be interpreted as +Inf or the long double value 1e400?  Yes,
> I'm fully aware that such a large argument may not be the greatest
> idea due to argument reduction.

sin(x) (and sinl(x) if sinl is #defined as sin) must pass x as a double
like the prototype says.  Since the i386 ABI requires passing doubles
on the stack, any extra precision and argument range will be lost.
In particular, values > DBL_MAX will be converted to +Inf (except I'm
not sure if rounding down of a value just slightly larger than DBL_MAX
can give DBL_MAX).

>>> PS: There is also the minor possibility of giving bde either
>>> a heartache or death via laughter when he finally sees what I
>>> have done. :)
>>
>> I would be unhappy with any versions that aren't either:
>> - as identical as possible in algorithm and style to the double precision
>>   versions, in the same way that the float versions are as identical as
>>   possible, or
>> - much better in algorithm and style and efficiency and accuracy than the
>>   double precision versions.  I don't want to maintain different versions.
>
> I'll go with your second option see comments below and attached
> code.  BTW, if I add

So I'm unhappy that the first point hasn't been addressed at all.  This
gives quality < the current float and double precision versions.

> #ifdef sinl
> #undef sinl
> #endif
> #define sinl sin
>
> to the top of my sin_test.c code, I get ULPs that exceed 10^9.

I guess this is from loss of exponent range for args < DBL_EPSILON.
Gradual underflow (denormals) only reduces this loss a little.  After
underflow to 0, I think the loss is about 2^<max difference in exponents>
~= 2^(16384-1024) ULPs = many GULPs.  Between DBL_EPSILON and DBL_MAX,
the loss is about 2^<difference in precision> = 2^11.  Between DBL_MAX
and LDBL_MAX, the loss is infinity ULPs.

> Back to sinl, cosl, tanl.  The attached diff contains everything
> that one needs for these functions on an architecture with 64-bit
> significand long double.  Note that portions of the diff for Makefile
> and math.h contain irrelevant changes that are only relevant for
> my ccosh, sqrtl, and cbrtl changes.  Also note that the Makefile only
> includes my long double routines when LDBL_PREC == 64.  My testing on
> FreeBSD-amd64 shows
>
> troutmask:sgk[219] ./sin_test 4000000 10 400000000
>      Number of test loops to run: 10
> Number of PRN in [0,4.000000e+08] per run: 4000000
>  ULP
> 1.0000 (5)
> 1.0000 (4)
> 1.0000 (6)
> 1.0000 (4)
> 1.0000 (3)
> 1.0000 (7)
> 1.0000 (3)
> 1.0000 (7)
> 1.0000 (5)
> 1.0000 (5)
>       ---
>        49
>
> The number in parenthesis is the count of values that give a sinl(x)
> that exceeds 0.5 ULP.  On the reduced range of [0,pi/4] where
> __kernel_cosl and __kernel_sinl are used, in all my testing no value
> of x in [0,pi/4] generated a sinl(x) that exceeded 0.5 ULP.  Thus, I
> suspect (but haven't tried to confirm), the 1.0 ULP observed above is
> due to the argument reduction.

The argument reduction seems to begin correctly, but then the extra
precision which is carefully provided by k_rem_pio2.c is thrown away,
unlike in the float and double versions which pass it to the __kernel
functions.  I would expect a loss of up to a full ULP from this, giving
a maximum error of 1.5 ULPs (0.5 from the __kernel and 1 from the arg
reduction).

> My timings on the basic routines using gprof are:
>
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ms/call  ms/call  name
>  0.6     804.42     5.67 40000000     0.00     0.00  sinl [18]
>  0.1     934.41     0.74 19998751     0.00     0.00  __kernel_sinl [95]
>  0.1     935.82     0.66 20001249     0.00     0.00  __kernel_cosl [96]
>
> where the system contains a dual-cpu opteron tyan motherboard running
> at 2.2 GHz.

That seems to be about 175 nS/call.  Not too bad.  (My current times on
a 2.2 GHz A64*1 in i386 mode on small args ([-2Pi,2Pi]) are 70 nS for
the C double version, 43 nS for the asm double version, and 15 nS for
the C float version).

> ...
> The testing of tanl() is a little more interesting.
>
> troutmask:sgk[227] ./tan_test 4000000 10 400000000
>      Number of test loops to run: 10
> Number of PRN in [0,4.000000e+08] per run: 4000000
>  ULP
> 1.0000 (24788)
> 1.0000 (24633)
> 1.5000 (24626)
> 1.0000 (24820)
> 1.0000 (24652)
> 1.5000 (24526)
> 1.0000 (24749)
> 1.0000 (24873)
> 1.0000 (24656)
> 1.0000 (24701)
>       -------
>        247024
>
>  0.5     911.88     5.80 40000000     0.00     0.00  tanl [16]
>  0.5     933.18     5.01 40000000     0.00     0.00  __kernel_tanl [53]
>
> At first glance, the 1.5 ULP and the total count of values of x
> that generate a result with a ULP that exceeds 0.5 seems disconcerting.
> However, testing of __kernel_tanl() shows that for all tested values of x
> in [0,pi/4] the maximum ULP did exceed 0.5.  This suggests that tanl() is
> more sensitive to errors in the argument reduction than sinl() and cosl().

This is because tan() has a larger derivative than sin() and cos().
E.g., tan'(Pi/4) = 2; this doubles any errors in arg reduction near
Pi/2.  The fdlibm __kernels take extra care near Pi/4 to reduce these
errors.  Also, between Pi/2 and Pi/2, the division in tan(x) = -1 /
tan(Pi/2) gives an extra error of up to 0.5 ULPs.  The fdlibm __kernels
take extra care to reduce this error.  With 3 or 4 sources of errors
in the final steps, with each error of up to 0.5 ULPs, it takes lots
of extra care to keep the final error below 1 ULP.

> Do with the code as you see fit.  I'll be moving on to hypothl and
> other functions as I have time.

I wouldn't use it directly since it doesn't copy fdlibm to get the final
steps right.  Its earlier steps are better, but I would prefer those to
copy fdlibm too, so that there aren't many different algorithms to maintain.
Note that the full long double precision versions require about the same
amount of code as the float and double precision versions combined, since
there are 2 very different long double precision formats on supported
machines (80 active bits for most, but 128 bits for sparc64).

% Index: msun/src/k_cosl.c
% ===================================================================
% RCS file: msun/src/k_cosl.c
% diff -N msun/src/k_cosl.c
% --- /dev/null	1 Jan 1970 00:00:00 -0000
% +++ msun/src/k_cosl.c	9 Oct 2007 21:52:32 -0000
% @@ -0,0 +1,62 @@
% +/*
% + * Compute cos(x) on the interval [-1:1] with a polynomial approximation.

Interval is larger than necessary.  This might result in unnecessary
coefficients.  10 of them seem to be too many for 64 bits of mantissa
but not enough for 113 bits (sparc64).

% + * The soefficients are determined from a Chebyshev interpolation scheme,

Spelling.

% + * In practice, this routine is called only for x in the interval [0:pi/4].
% + * Limited testing on pseudorandom numbers drawn within [0:pi/4] shows
% + * an accuracy of <= 0.5 ULP.
% + */

I think testing is neither necessary nor sufficient.  One just finds a
polynomial that has enough accuracy in infinite precision (the poly
should have about 8 more bits of precision than the number of mantissa
bits), then checks that there is no magic in coefficients than causes
problems (a large second coefficient will cause problems, especially
if it is not exactly representable.  Here 0.5 is quite large, but is
exactly representable, and for sin*(), 0.333 is just small enough not
to cause problems despite it not being exactly representable).

% +#define C0  (long double)  0x1.000000000000000p0L
% +#define C1  (long double) -0x8.000000000000000p-4L

Various style bugs.  The casts have no effect.

% +long double
% +__kernel_cosl(long double x)
% +{
% +	long double ys, fs;
% +
% +	ys = x * x;
% +
% +	fs = C0 + ys * (C1 + ys * (C2 + ys * (C3 + ys * (C4 +
% +	    ys * (C5 + ys * (C6 + ys * (C7 + ys * (C8 + ys * C9))))))));

Better scheduling is possible.

The C1 term does in fact cause problems due to it being large.  I
think writing the expression as (C0 + ys * C1 + ...) gives more
accuracy by taking advantage of C1 being exact, but that is not enough.
fdlibm __kernel cos takes extra care with the C1 term.

% Index: msun/src/k_sinl.c

Similarly.

% Index: msun/src/k_tanl.c
% ===================================================================
% RCS file: msun/src/k_tanl.c
% diff -N msun/src/k_tanl.c
% ...
% +#define C33  (long double)  0x1.8f939ed1883f595p-32L

This has far too many terms for a 64-bit mantissa, but not enough for a
113-bit mantissa.  IIRC, the former requires about 25 terms.

% +long double
% +__kernel_tanl(long double x)
% +{
% +	long double t, xs;
% + 
% +	xs = x * x;
% +
% +	t = C0 + xs * (C1 + xs * (C2 + xs * (C3 + xs * (C4 + xs *
% +	    (C5  + xs * (C6  + xs * (C7  + xs * (C8  + xs * (C9 + xs *
% +	    (C10 + xs * (C11 + xs * (C12 + xs * (C13 + xs * (C14 + xs *
% +	    (C15 + xs * (C16 + xs * (C17 + xs * (C18 + xs * (C19 + xs *
% +	    (C20 + xs * (C21 + xs * (C22 + xs * (C23 + xs * (C24 + xs *
% +	    (C25 + xs * (C26 + xs * (C27 + xs * (C28 + xs * (C29 + xs *
% +	    (C30 + xs * (C31 + xs *
% +	    (C32 + xs * C33))))))))))))))))))))))))))))))));

With so many terms, better scheduling is more needed.

Much more extra care is needed for an accuracy of < 1 ULP for tan*() than
for sin*() and cos*().

% Index: msun/src/s_cosl.c
% ...
% +/*
% + * Table of constants for 2/pi, 396 Hex digits (476 decimal) of 2/pi.
% + * The table was taken from e_rem_pio2.c.
% + */
% +static const int32_t two_over_pi[] = {
% +0xA2F983, 0x6E4E44, 0x1529FC, 0x2757D1, 0xF534DD, 0xC0DB62, 
% +0x95993C, 0x439041, 0xFE5163, 0xABDEBB, 0xC561B7, 0x246E3A,

This gives far too many copies of this table (instead of just too many).
The float and double versions avoid having a copy for each of cos/sin/tan
by doing arg reduction in e_rem_pio2[f].c.  The extra layer for this file
is a pessimization but seems to be worth it to avoid the duplication.

Are 396 hex digits enough?  It can't be right for all of the float, double
and long double cases.  I hope it is enough for 128-bit long doubles, and
the float case is just sloppy in using the same table as the double case.
If 396 hex digits is enough for all cases, then the table should be in
k_rem_pio2.c.

% +long double
% +cosl(long double x)
% +{
% +	union IEEEl2bits z;

I want to avoid using this union...

% ...
% +	z.e = x;
% +	z.bits.sign = 0;

... since, while code like this is easy to write, it hard-codes the
assumption that things are layed out in bit-fields.  Access macros
are more flexible and don't require so much compiler intelligence
to optimize away the memory accesses that are logically required for
bit-fields.  gcc does a mediocre jobs with this optimization,
especially for 80-bit long doubles on amd64 and i386 since the bits
don't fit evenly in registers.

I see parametrizing the bit-field sizes and other magic numbers (not
quoted) for various formats of long doubles as the main difficulty
here (trig functions are easy except for this and k_rem_pio2.c).

% ...
% +	for (z.e = 0, i = nx - 1; i >= 0; i--)
% +		z.e += (long double) yd[i];

Here the extra precision that is provided by k_rem_pio2.c is thrown away.

The cast has no effect.

Casts are not followed by a space in KNF.

% +
% +	switch (e0 & 3) {
% +	case 0:
% +	    z.e = __kernel_cosl(z.e);
% +	    break;

fdlibm __kernels pass an extra arg here to avoid loss of precision (except
I changed the float kernels to pass the extra precision in a double).

% Index: msun/src/s_sinl.c

Similarly.

% Index: msun/src/s_tanl.c

Similarly.

% ...
+	switch (e0 & 3) {
+ ...
+	case 1:
+	case 3:
+	    z.e = - 1.L / __kernel_tanl(z.e);
+	    break;

The fdlibm has to do this division in the __kernel since doing it carefully
takes too much code to inline here (there are two cases, one for z.e near
Pi/4 where everything is done differently, and the general case where
things are simple except for the division).

Note that the fdlibm __kernel code for trig functions is generic except for:

- my optimizations in the float case
- the magic number for the threshold near Pi/4 for tan()
- the number and values of terms in the polynomial, and the scheduling
   (best scheduling depends on the number of terms, and we don't have this
   automated).

Thus it is easier to clone the correct code than to rewrite it incorrectly.

We should investigate whether bit fiddling for the threshold and other
things in __kernel tan are needed.  I think the threshold calculation
would be faster without bit fiddling, but bit fiddling is used for it
because the bits are needed anyway.

Bruce