From owner-freebsd-i386@FreeBSD.ORG Sun Feb 20 11:10:21 2005 Return-Path: Delivered-To: freebsd-i386@hub.freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 22E2116A4CE for ; Sun, 20 Feb 2005 11:10:21 +0000 (GMT) Received: from freefall.freebsd.org (freefall.freebsd.org [216.136.204.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id B32F343D1D for ; Sun, 20 Feb 2005 11:10:20 +0000 (GMT) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (gnats@localhost [127.0.0.1]) by freefall.freebsd.org (8.13.1/8.13.1) with ESMTP id j1KBAKkE016744 for ; Sun, 20 Feb 2005 11:10:20 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.13.1/8.13.1/Submit) id j1KBAKNj016743; Sun, 20 Feb 2005 11:10:20 GMT (envelope-from gnats) Date: Sun, 20 Feb 2005 11:10:20 GMT Message-Id: <200502201110.j1KBAKNj016743@freefall.freebsd.org> To: freebsd-i386@FreeBSD.org From: Bruce Evans Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs X-BeenThere: freebsd-i386@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: Bruce Evans List-Id: I386-specific issues for FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Feb 2005 11:10:21 -0000 The following reply was made to PR i386/67469; it has been noted by GNATS. From: Bruce Evans To: David Schultz Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs Date: Sun, 20 Feb 2005 22:02:50 +1100 (EST) On Mon, 14 Feb 2005, Bruce Evans wrote: > It seems that the hardware trig functions aren't worth using. I want > to test them on a 486 and consider the ranges more before discarding > them. This may take a while. I did this as threatened. My test program now covers most cases of interest for i386's and amd64's. It divides the the range being tested into a number of regions. I mostly used 16, which is good for dividing the range [-2pi, 2pi] into quadrants. The output clearly shows that both the hardware and fdlibm do arg reduction related to pi/4, with different tradeoffs. E.g., for sin(), fdlibm does very well for the range [-pi/4, pi/4], but not so well for args outside this range, while hardware sin does poorly for [-pi/4, pi/4] but very well for [pi/4, 3pi/4], and similarly for all translations of these ranges by pi. OTOH, the hardware cos's best range is [-pi/4, pi/4] but fdlibm cos's best range is the same as for fdlibm sin. Some results: - the hardware trig functions were by far the best on a 486DX2-66 for all ranges except near 0 where fdlibm is competetive or a little faster. The relative advantage of the hardware functions decreases with each CPU generation. - the hardware inverse trig functions (all based on fpatan) were by far the worst for all ranges, except on old machines where they are competitive or a little faster. - fdlibm is quite good for exp and log too. - an Athlon64 (cpuid says 3400 but not that fast in marketed version) running at 1994 MHz with slow memory has almost exactly the same speed for the fdlibm part of the benchmark as an AthlonXP Barton 2600 overclocked to 2293 MHz with not-so-slow memory. Using SSE2 apparently makes just enough difference to compensate for the 1994/2223 clock speed ratio. There is apparently no similar difference for using the A64 FPU so using it is further away from being an optimization on A64's. Test control script: %%% set -e arch=`uname -p` if [ $arch = i386 ]; then arch=i387; fi msun=/usr/src/lib/msun CFLAGS="-march=athlon-xp -fomit-frame-pointer -O2 -g -I$msun/src" LDFLAGS="-lm -static" niter=100000000.0 stdfiles="$msun/src/e_rem_pio2.c $msun/src/k_cos.c $msun/src/k_sin.c \ $msun/src/k_tan.c" while : do read functype func llimit limit nregion basename ename if [ -z $func ]; then exit 0; fi cfile=$msun/src/$basename.c sfile=$msun/$arch/$basename.S if [ -f $sfile -o $func = cos -o $func = cosf \ -o $func = sin -o $func = sinf ] then myfunc=asm$func COPTS="-DFUNCTYPE=$functype -DFUNC=$myfunc \ -DLLIMIT=$llimit -DLIMIT=$limit -DNREGION=$nregion \ -DNITER=$niter" if [ -f $sfile ] then sed -e "s/^ENTRY($func)/ENTRY($myfunc)/" \ -e "s/^ENTRY($ename)/ENTRY($myfunc)/" <$sfile >x.S cc $CFLAGS $COPTS -o z z.c x.S $stdfiles $LDFLAGS else cc $CFLAGS $COPTS -o z z.c $LDFLAGS fi time ./z fi myfunc=fdl$func COPTS="-DFUNCTYPE=$functype -DFUNC=$myfunc \ -DLLIMIT=$llimit -DLIMIT=$limit -DNREGION=$nregion \ -DNITER=$niter" sed -e s/^$ename/$myfunc/ \ -e s/__generic_$ename/$myfunc/ <$cfile >x.c cc $CFLAGS $COPTS -o z z.c x.c $stdfiles $LDFLAGS time ./z done %%% Run this using something like "sh t to.machine". (First edit the CFLAGS line to change -march to something appropriate, and the niter line to make the test run fast enough -- I used 1e8 on Athlons down to 1e6 on 486DX2-66 so that each function takes about 10 seconds.) Test data: %%% double acos -1.0 1.0 16 e_acos __ieee754_acos double asin -1.0 1.0 16 e_asin __ieee754_asin double atan -8.0 8.0 16 s_atan atan double cos -6.28 6.28 16 s_cos cos double exp -8.0 8.0 16 e_exp __ieee754_exp double log 0.0 1e32 16 e_log __ieee754_log double logb 0.0 1e32 16 s_logb logb double log10 0.0 1e32 16 e_log10 __ieee754_log10 double sin -6.28 6.28 16 s_sin sin double tan -6.28 6.28 16 s_tan tan float cosf -6.28 6.28 16 s_cosf cosf float logf 0.0 1e32 16 e_logf __ieee754_logf float logbf 0.0 1e32 16 s_logbf logbf float log10f 0.0 1e32 16 e_log10f __ieee754_log10f float sinf -6.28 6.28 16 s_sinf sinf float tanf -6.28 6.28 16 s_tanf tanf # The following aren't actually implemented in asm (but e_atan2f is; it is # probably just as bad as e_atanf would be, but is harder to test). float acosf -1.0 1.0 16 e_acosf __ieee754_acosf float asinf -1.0 1.0 16 e_asinf __ieee754_asinf float atanf -8.0 8.0 16 s_atanf atanf float expf -8.0 8.0 16 e_expf __ieee754_expf %%% Test program: %%% #include #include #include #ifdef HAVE_RDTSC #include #endif #include #include #ifndef FUNC #define FUNC sin #endif #define FUNCNAME __XSTRING(FUNC) #ifndef FUNCTYPE #define FUNCTYPE double #endif #ifndef LIMIT #define LIMIT (3.14159 / 4) #endif #ifndef LLIMIT #define LLIMIT 0.0 #endif #ifndef NITER #define NITER 10000000.0 #endif #ifndef NREGION #define NREGION 16 #endif #ifdef __amd64__ /* Generate some asm versions since there are none in libm. */ double asmcos(double); double asmsin(double); float asmcosf(float); float asmsinf(float); asm("asmcos: movsd %xmm0, -8(%rsp); fldl -8(%rsp); fcos; " "fstpl -8(%rsp); movsd -8(%rsp),%xmm0; ret"); asm("asmsin: movsd %xmm0, -8(%rsp); fldl -8(%rsp); fsin; " "fstpl -8(%rsp); movsd -8(%rsp),%xmm0; ret"); asm("asmcosf: movss %xmm0, -4(%rsp); flds -4(%rsp); fcos; " "fstps -4(%rsp); movss -4(%rsp),%xmm0; ret"); asm("asmsinf: movss %xmm0, -4(%rsp); flds -4(%rsp); fsin; " "fstps -4(%rsp); movss -4(%rsp),%xmm0; ret"); #endif /* __amd64__ */ FUNCTYPE FUNC(FUNCTYPE); int main(void) { struct rusage finish, start; double d, limit, llimit, step; long long tot[NREGION], usec[NREGION], x, y; int i; step = (LIMIT - LLIMIT) / NITER; for (i = 0; i < NREGION; i++) { llimit = LLIMIT + i * (LIMIT - LLIMIT) / NREGION; limit = LLIMIT + (i + 1) * (LIMIT - LLIMIT) / NREGION; tot[i] = 0; getrusage(RUSAGE_SELF, &start); #ifdef HAVE_RDTSC x = rdtsc(); #endif for (d = llimit; d < limit; d += step) FUNC(d); #ifdef HAVE_RDTSC y = rdtsc(); #endif getrusage(RUSAGE_SELF, &finish); usec[i] = 1000000 * (finish.ru_utime.tv_sec - start.ru_utime.tv_sec + finish.ru_stime.tv_sec - start.ru_stime.tv_sec) + finish.ru_utime.tv_usec - start.ru_utime.tv_usec + finish.ru_stime.tv_usec - start.ru_stime.tv_usec; tot[i] = y - x; } printf("%s:", FUNCNAME); #ifdef HAVE_RDTSC printf(" cycles:nsec per call:"); for (i = 0; i < NREGION; i++) printf(" %lld:%lld", tot[i] / (long long)(NITER / NREGION), 1000 * usec[i] / (long long)(NITER / NREGION)); #else printf(" nsec per call: "); for (i = 0; i < NREGION; i++) printf(" %lld", 1000 * usec[i] / (long long)(NITER / NREGION)); #endif printf("\n"); return (0); } %%% Sample output: %%% to.486dx2-66: asmasin: nsec per call: 7569 7650 7742 7752 7665 7662 7584 7721 7409 7573 7645 7651 7723 7775 7686 7603 fdlasin: nsec per call: 13452 13910 13967 13979 8000 7997 7995 8178 7872 7873 7994 7995 13853 13844 13808 13323 asmsin: nsec per call: 5697 5788 5821 5642 5684 5774 5534 5557 5417 5788 5763 5558 5650 5828 5782 5701 fdlsin: nsec per call: 11888 11906 11902 11736 11866 9405 9414 5377 5372 9250 9247 11320 11303 11435 11417 11326 to.k6-233: asmasin: nsec per call: 1051 1076 1073 1072 1054 1041 1042 1042 1038 1037 1037 1050 1067 1069 1072 1046 fdlasin: nsec per call: 1518 1594 1595 1594 819 819 818 818 818 819 819 819 1591 1590 1593 1514 asmsin: nsec per call: 513 553 543 527 518 540 521 464 464 521 540 518 527 543 553 513 fdlsin: nsec per call: 1466 1495 1493 1475 1474 1034 1034 647 647 1017 1017 1418 1419 1459 1459 1410 to.axpb-2223: asmasin: nsec per call: 96 96 94 93 93 93 93 93 93 93 93 93 93 94 96 96 fdlasin: nsec per call: 77 81 81 81 36 35 34 34 35 35 35 35 81 81 81 77 asmsin: nsec per call: 60 32 33 59 59 32 32 57 56 32 32 58 58 33 32 59 fdlsin: nsec per call: 84 91 91 85 85 69 69 32 32 68 67 79 79 86 86 79 to.a64-1994: asmsin: nsec per call: 61 37 37 60 61 36 37 58 58 37 36 61 60 37 37 61 fdlsin: nsec per call: 73 75 75 75 75 52 52 21 21 51 51 71 71 70 70 67 %%% I would adjust the following due to these results: - delete all trig i387 float functions. Think about deleting the exp and log i387 float functions. Think about optimizing the fdlibm versions. They could use double or extended precision, and then they might not need range reduction for a much larger range (hopefully [-2pi, 2pi]) and/or might not need a correction term for a much larger range. - delete all inverse trig i387 functions. - think about optimizing the trig fdlibm double functions until they are faster than the trig i387 double functions on a larger range than [-pi/4, pi/4]. They could use extended precision, but only only on some arches so there would be a negative reduction in complications for replacing the MD functions by "MI" ones optimized in this way. Does the polynomial approximation for sin start to fail between pi/4 and pi/2, or are there other technical rasons to reduce to the current ranges? Bruce