Date: Thu, 10 Feb 2005 07:30:18 GMT From: David Schultz <das@freebsd.org> To: freebsd-i386@FreeBSD.org Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs Message-ID: <200502100730.j1A7UI7v063663@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR i386/67469; it has been noted by GNATS. From: David Schultz <das@freebsd.org> To: Bruce Evans <bde@zeta.org.au> Cc: FreeBSD-gnats-submit@freebsd.org, freebsd-i386@freebsd.org, bde@freebsd.org Subject: Re: i386/67469: src/lib/msun/i387/s_tan.S gives incorrect results for large inputs Date: Thu, 10 Feb 2005 02:23:14 -0500 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050209232758.F3249@epsplex.bde.org> On Thu, Feb 10, 2005, Bruce Evans wrote: > > I used the following sets > > of inputs: > > > > tbl1: small numbers > > ... > > tbl2: numbers on [-8pi,8pi] greater in magnitude than 2^-18 > > ... > > tbl3: large numbers > > ... > > tbl4: special cases > > This data may be too unusual. Maybe the NaNs are slower. Denormals > would probably be slower. The data in tbl2 are pretty usual, I think, and I measured all of the data points independently. But yes, NaNs are slower, as the results for tbl4 indicate. Looking back, though, I did notice that very few of my inputs in tbl2 require argument reduction. In your tests on [0..10], on the other hand, 92% of the inputs require argument reduction in fdlibm. It would be interesting to see for which of your tests fdlibm is faster, and for which it is slower. One possibility is that fdlibm is slower most of the time; another is that it is far slower for the close-to-pi/2 cases that the i387 gets wrong, and that messes up the averages. > The synchronising cpuid here is responsible for a factor of 3 difference > for me. Moving the rdtsc out of the loop gives the following changes > in cycle counts: > > 2000 -> [944..1420] > 1000 -> 431 > 400 -> 132 > > Each rdtsc() in the loop costs 75 cycles for tbl1, and actually using > the results costs another 120 cycles. > > I think the cpuid is disturbing the timings too much. I don't care so much about the rdtsc overhead since I'm only measuring relative performance. A null function is measured as taking 388 cycles on my Pentium 4, but some of that is due to gcc getting confused by the volatile variable and generating extra code at -O0. However, it is true that I am basically measuring latency and not throughput. Ordinarily, it is possible to execute FPU and CPU instructions simultaneously, and the FPU may even have more than one FU available for executing fptan. The cpuid instructions clear out the pipeline and destroy any parallelism that might have been possible. Your version does a better job of measuring throughput. You're also right that fdlibm tan() blows out about 512 bytes of instruction cache. Anyway, I unfortunately don't have time for all this. Do you want the assembly versions of these to stay or not? If so, it would be great if you could fix them and make sure that the result isn't obviously slower than fdlibm. If not, I'll be happy to spend two minutes making all those pesky bugs in them go away. ;-)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200502100730.j1A7UI7v063663>