From owner-freebsd-i386@FreeBSD.ORG  Sun Feb 20 11:10:21 2005
Return-Path: <owner-freebsd-i386@FreeBSD.ORG>
Delivered-To: freebsd-i386@hub.freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 22E2116A4CE
	for <freebsd-i386@hub.freebsd.org>;
	Sun, 20 Feb 2005 11:10:21 +0000 (GMT)
Received: from freefall.freebsd.org (freefall.freebsd.org [216.136.204.21])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B32F343D1D
	for <freebsd-i386@hub.freebsd.org>;
	Sun, 20 Feb 2005 11:10:20 +0000 (GMT)
	(envelope-from gnats@FreeBSD.org)
Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1])
	by freefall.freebsd.org (8.13.1/8.13.1) with ESMTP id j1KBAKkE016744
	for <freebsd-i386@freefall.freebsd.org>; Sun, 20 Feb 2005 11:10:20 GMT
	(envelope-from gnats@freefall.freebsd.org)
Received: (from gnats@localhost)
	by freefall.freebsd.org (8.13.1/8.13.1/Submit) id j1KBAKNj016743;
	Sun, 20 Feb 2005 11:10:20 GMT
	(envelope-from gnats)
Date: Sun, 20 Feb 2005 11:10:20 GMT
Message-Id: <200502201110.j1KBAKNj016743@freefall.freebsd.org>
To: freebsd-i386@FreeBSD.org
From: Bruce Evans <bde@zeta.org.au>
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
X-BeenThere: freebsd-i386@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
Reply-To: Bruce Evans <bde@zeta.org.au>
List-Id: I386-specific issues for FreeBSD <freebsd-i386.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-i386>,
	<mailto:freebsd-i386-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-i386>
List-Post: <mailto:freebsd-i386@freebsd.org>
List-Help: <mailto:freebsd-i386-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-i386>,
	<mailto:freebsd-i386-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 20 Feb 2005 11:10:21 -0000

The following reply was made to PR i386/67469; it has been noted by GNATS.

From: Bruce Evans <bde@zeta.org.au>
To: David Schultz <das@freebsd.org>
Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org
Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results
 for large inputs
Date: Sun, 20 Feb 2005 22:02:50 +1100 (EST)

 On Mon, 14 Feb 2005, Bruce Evans wrote:
 
 > It seems that the hardware trig functions aren't worth using.  I want
 > to test them on a 486 and consider the ranges more before discarding
 > them.  This may take a while.
 
 I did this as threatened.
 
 My test program now covers most cases of interest for i386's and
 amd64's.  It divides the the range being tested into a number of
 regions.  I mostly used 16, which is good for dividing the range
 [-2pi, 2pi] into quadrants.  The output clearly shows that both
 the hardware and fdlibm do arg reduction related to pi/4, with
 different tradeoffs.  E.g., for sin(), fdlibm does very well for
 the range [-pi/4, pi/4], but not so well for args outside this
 range, while hardware sin does poorly for [-pi/4, pi/4] but very
 well for [pi/4, 3pi/4], and similarly for all translations of these
 ranges by pi.  OTOH, the hardware cos's best range is [-pi/4, pi/4]
 but fdlibm cos's best range is the same as for fdlibm sin.
 
 Some results:
 - the hardware trig functions were by far the best on a 486DX2-66 for
   all ranges except near 0 where fdlibm is competetive or a little
   faster.  The relative advantage of the hardware functions decreases
   with each CPU generation.
 - the hardware inverse trig functions (all based on fpatan) were by far
   the worst for all ranges, except on old machines where they are
   competitive or a little faster.
 - fdlibm is quite good for exp and log too.
 - an Athlon64 (cpuid says 3400 but not that fast in marketed version)
   running at 1994 MHz with slow memory has almost exactly the same
   speed for the fdlibm part of the benchmark as an AthlonXP Barton
   2600 overclocked to 2293 MHz with not-so-slow memory.  Using SSE2
   apparently makes just enough difference to compensate for the
   1994/2223 clock speed ratio.  There is apparently no similar
   difference for using the A64 FPU so using it is further away from
   being an optimization on A64's.
 
 Test control script:
 
 %%%
 set -e
 arch=`uname -p`
 if [ $arch = i386 ]; then arch=i387; fi
 msun=/usr/src/lib/msun
 CFLAGS="-march=athlon-xp -fomit-frame-pointer -O2 -g -I$msun/src"
 LDFLAGS="-lm -static"
 niter=100000000.0
 stdfiles="$msun/src/e_rem_pio2.c $msun/src/k_cos.c $msun/src/k_sin.c \
     $msun/src/k_tan.c"
 
 while :
 do
 	read functype func llimit limit nregion basename ename
 	if [ -z $func ]; then exit 0; fi
 	cfile=$msun/src/$basename.c
 	sfile=$msun/$arch/$basename.S
 	if [ -f $sfile -o $func = cos -o $func = cosf \
 	    -o $func = sin -o $func = sinf ]
 	then
 		myfunc=asm$func
 		COPTS="-DFUNCTYPE=$functype -DFUNC=$myfunc \
 		    -DLLIMIT=$llimit -DLIMIT=$limit -DNREGION=$nregion \
 		    -DNITER=$niter"
 		if [ -f $sfile ]
 		then
 			sed -e "s/^ENTRY($func)/ENTRY($myfunc)/" \
 			    -e "s/^ENTRY($ename)/ENTRY($myfunc)/" <$sfile >x.S
 			cc $CFLAGS $COPTS -o z z.c x.S $stdfiles $LDFLAGS
 		else
 			cc $CFLAGS $COPTS -o z z.c $LDFLAGS
 		fi
 		time ./z
 	fi
 	myfunc=fdl$func
 	COPTS="-DFUNCTYPE=$functype -DFUNC=$myfunc \
 	    -DLLIMIT=$llimit -DLIMIT=$limit -DNREGION=$nregion \
 	    -DNITER=$niter"
 	sed -e s/^$ename/$myfunc/ \
 	    -e s/__generic_$ename/$myfunc/ <$cfile >x.c
 	cc $CFLAGS $COPTS -o z z.c x.c $stdfiles $LDFLAGS
 	time ./z
 done
 %%%
 
 Run this using something like "sh t <tdata >to.machine".  (First edit
 the CFLAGS line to change -march to something appropriate, and the
 niter line to make the test run fast enough -- I used 1e8 on Athlons
 down to 1e6 on 486DX2-66 so that each function takes about 10 seconds.)
 
 Test data:
 
 %%%
 double acos -1.0 1.0 16 e_acos __ieee754_acos
 double asin -1.0 1.0 16 e_asin __ieee754_asin
 double atan -8.0 8.0 16 s_atan atan
 double cos -6.28 6.28 16 s_cos cos
 double exp -8.0 8.0 16 e_exp __ieee754_exp
 double log 0.0 1e32 16 e_log __ieee754_log
 double logb 0.0 1e32 16 s_logb logb
 double log10 0.0 1e32 16 e_log10 __ieee754_log10
 double sin -6.28 6.28 16 s_sin sin
 double tan -6.28 6.28 16 s_tan tan
 float cosf -6.28 6.28 16 s_cosf cosf
 float logf 0.0 1e32 16 e_logf __ieee754_logf
 float logbf 0.0 1e32 16 s_logbf logbf
 float log10f 0.0 1e32 16 e_log10f __ieee754_log10f
 float sinf -6.28 6.28 16 s_sinf sinf
 float tanf -6.28 6.28 16 s_tanf tanf
 
 # The following aren't actually implemented in asm (but e_atan2f is; it is
 # probably just as bad as e_atanf would be, but is harder to test).
 float acosf -1.0 1.0 16 e_acosf __ieee754_acosf
 float asinf -1.0 1.0 16 e_asinf __ieee754_asinf
 float atanf -8.0 8.0 16 s_atanf atanf
 float expf -8.0 8.0 16 e_expf __ieee754_expf
 %%%
 
 Test program:
 
 %%%
 #include <sys/types.h>
 #include <sys/time.h>
 #include <sys/resource.h>
 
 #ifdef HAVE_RDTSC
 #include <machine/cpufunc.h>
 #endif
 
 #include <math.h>
 #include <stdio.h>
 
 #ifndef FUNC
 #define	FUNC	sin
 #endif
 #define	FUNCNAME __XSTRING(FUNC)
 #ifndef FUNCTYPE
 #define	FUNCTYPE double
 #endif
 #ifndef LIMIT
 #define	LIMIT	(3.14159 / 4)
 #endif
 #ifndef LLIMIT
 #define	LLIMIT	0.0
 #endif
 #ifndef NITER
 #define NITER	10000000.0
 #endif
 #ifndef NREGION
 #define NREGION	16
 #endif
 
 #ifdef __amd64__
 /* Generate some asm versions since there are none in libm. */
 
 double asmcos(double);
 double asmsin(double);
 float asmcosf(float);
 float asmsinf(float);
 
 asm("asmcos: movsd %xmm0, -8(%rsp); fldl -8(%rsp); fcos; "
     "fstpl -8(%rsp); movsd -8(%rsp),%xmm0; ret");
 asm("asmsin: movsd %xmm0, -8(%rsp); fldl -8(%rsp); fsin; "
     "fstpl -8(%rsp); movsd -8(%rsp),%xmm0; ret");
 asm("asmcosf: movss %xmm0, -4(%rsp); flds -4(%rsp); fcos; "
     "fstps -4(%rsp); movss -4(%rsp),%xmm0; ret");
 asm("asmsinf: movss %xmm0, -4(%rsp); flds -4(%rsp); fsin; "
     "fstps -4(%rsp); movss -4(%rsp),%xmm0; ret");
 #endif /* __amd64__ */
 
 FUNCTYPE FUNC(FUNCTYPE);
 
 int
 main(void)
 {
 	struct rusage finish, start;
 	double d, limit, llimit, step;
 	long long tot[NREGION], usec[NREGION], x, y;
 	int i;
 
 	step = (LIMIT - LLIMIT) / NITER;
 	for (i = 0; i < NREGION; i++) {
 		llimit = LLIMIT + i * (LIMIT - LLIMIT) / NREGION;
 		limit = LLIMIT + (i + 1) * (LIMIT - LLIMIT) / NREGION;
 		tot[i] = 0;
 		getrusage(RUSAGE_SELF, &start);
 #ifdef HAVE_RDTSC
 		x = rdtsc();
 #endif
 		for (d = llimit; d < limit; d += step)
 			FUNC(d);
 #ifdef HAVE_RDTSC
 		y = rdtsc();
 #endif
 		getrusage(RUSAGE_SELF, &finish);
 		usec[i] = 1000000 *
 		    (finish.ru_utime.tv_sec - start.ru_utime.tv_sec +
 		    finish.ru_stime.tv_sec - start.ru_stime.tv_sec) +
 		    finish.ru_utime.tv_usec - start.ru_utime.tv_usec +
 		    finish.ru_stime.tv_usec - start.ru_stime.tv_usec;
 		tot[i] = y - x;
 	}
 	printf("%s:", FUNCNAME);
 #ifdef HAVE_RDTSC
 	printf(" cycles:nsec per call:");
 	for (i = 0; i < NREGION; i++)
 		printf(" %lld:%lld",
 		    tot[i] / (long long)(NITER / NREGION),
 		    1000 * usec[i] / (long long)(NITER / NREGION));
 #else
 	printf(" nsec per call: ");
 	for (i = 0; i < NREGION; i++)
 		printf(" %lld", 1000 * usec[i] / (long long)(NITER / NREGION));
 #endif
 	printf("\n");
 	return (0);
 }
 %%%
 
 Sample output:
 
 %%%
 to.486dx2-66:
 asmasin: nsec per call:  7569 7650 7742 7752 7665 7662 7584 7721 7409 7573 7645 7651 7723 7775 7686 7603
 fdlasin: nsec per call:  13452 13910 13967 13979 8000 7997 7995 8178 7872 7873 7994 7995 13853 13844 13808 13323
 asmsin: nsec per call:  5697 5788 5821 5642 5684 5774 5534 5557 5417 5788 5763 5558 5650 5828 5782 5701
 fdlsin: nsec per call:  11888 11906 11902 11736 11866 9405 9414 5377 5372 9250 9247 11320 11303 11435 11417 11326
 
 to.k6-233:
 asmasin: nsec per call:  1051 1076 1073 1072 1054 1041 1042 1042 1038 1037 1037 1050 1067 1069 1072 1046
 fdlasin: nsec per call:  1518 1594 1595 1594 819 819 818 818 818 819 819 819 1591 1590 1593 1514
 asmsin: nsec per call:  513 553 543 527 518 540 521 464 464 521 540 518 527 543 553 513
 fdlsin: nsec per call:  1466 1495 1493 1475 1474 1034 1034 647 647 1017 1017 1418 1419 1459 1459 1410
 
 to.axpb-2223:
 asmasin: nsec per call:  96 96 94 93 93 93 93 93 93 93 93 93 93 94 96 96
 fdlasin: nsec per call:  77 81 81 81 36 35 34 34 35 35 35 35 81 81 81 77
 asmsin: nsec per call:  60 32 33 59 59 32 32 57 56 32 32 58 58 33 32 59
 fdlsin: nsec per call:  84 91 91 85 85 69 69 32 32 68 67 79 79 86 86 79
 
 to.a64-1994:
 asmsin: nsec per call:  61 37 37 60 61 36 37 58 58 37 36 61 60 37 37 61
 fdlsin: nsec per call:  73 75 75 75 75 52 52 21 21 51 51 71 71 70 70 67
 %%%
 
 I would adjust the following due to these results:
 - delete all trig i387 float functions.  Think about deleting the exp and
   log i387 float functions.  Think about optimizing the fdlibm versions.
   They could use double or extended precision, and then they might not
   need range reduction for a much larger range (hopefully [-2pi, 2pi])
   and/or might not need a correction term for a much larger range.
 - delete all inverse trig i387 functions.
 - think about optimizing the trig fdlibm double functions until they are
   faster than the trig i387 double functions on a larger range than
   [-pi/4, pi/4].  They could use extended precision, but only only on
   some arches so there would be a negative reduction in complications
   for replacing the MD functions by "MI" ones optimized in this way.
   Does the polynomial approximation for sin start to fail between pi/4
   and pi/2, or are there other technical rasons to reduce to the current
   ranges?
 
 Bruce