From owner-freebsd-numerics@FreeBSD.ORG Fri Mar 13 11:14:05 2015 Return-Path: Delivered-To: freebsd-numerics@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 660A47F9 for ; Fri, 13 Mar 2015 11:14:05 +0000 (UTC) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 2665A88A for ; Fri, 13 Mar 2015 11:14:04 +0000 (UTC) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 55FD13C6020; Fri, 13 Mar 2015 21:46:17 +1100 (AEDT) Date: Fri, 13 Mar 2015 21:46:14 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: enh Subject: Re: libm functions sensitive to current rounding mode In-Reply-To: Message-ID: <20150313192625.B885@besplex.bde.org> References: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=ZuzUdbLG c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=Oh2cFVv5AAAA:8 a=CCpqsmhAAAAA:8 a=UWV0wBGDR1em69PW5csA:9 a=CjuIK1q_8ugA:10 Cc: freebsd-numerics@freebsd.org X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 13 Mar 2015 11:14:05 -0000 On Thu, 12 Mar 2015, enh via freebsd-numerics wrote: > various SoC vendors have been sending us (Android) assembler > implementations of various libm functions, and we noticed when looking > at one change that it was slower on certain inputs than the FreeBSD C > implementation we were already using. the response was that the cost > was coping with other rounding modes, and they gave this example: This seems to be due to bad fixes. Supporting rounding in the current rounding mode is only very useful if the rounding is perfect, but perfect rounding is difficult for complicated functions. Switching the rounding mode is incredibly slow on x86 (it takes as long or longer than sin()), and doing it usually gives worse results. The result _should_ depend on the rounding mode. It is specified to do so for sqrt* and probably for many other simple functions. This has FreeBSD has only been tested to do perfect rounding in all rounding modes for the following functions: ceil*, fabs*, floor*, logb*, rint*, round*, significan*, sqrt*, trunc* ceil*, fabs*, floor*, logb*, rint*, round*, significan*, sqrt*, trunc* Implementing this for the software sqrt* makes sqrt* unusably slow, but most arches use hardware sqrt*. > #include > #include > #include > > int main(){ > double x = -0xb.6365e22ee46fp4; > fesetround(FE_UPWARD); > printf("sin(%f) = %.16e\n", x, sin(x)); > fesetround(FE_DOWNWARD); > printf("sin(%f) = %.16e\n", x, sin(x)); > > x = 0x1.921fb54442d16p0; > fesetround(FE_UPWARD); > printf("sin(%f) = %.16e\n", x, sin(x)); > fesetround(FE_DOWNWARD); > printf("sin(%f) = %.16e\n", x, sin(x)); > return 0; > } > > see https://android-review.googlesource.com/112700 for the original > change and related conversation. I couldn't read this with my old browsers. > glibc 2.19: > > sin(-182.212373) = -2.4759225463534308e-18 > sin(-182.212374) = -2.4759225463534309e-18 > sin(1.570797) = 1.0000000000000000e+00 > sin(1.570796) = 1.0000000000000000e+00 sin() is a good-bad example. If you "optimize" it by writing it in asm, then you get enormous errors (multiple tera-ulps) even near small even multiples of Pi/2. The 1-ulp difference made by the rounding mode is in the noise. Actual results for i387 sin() on i386, with a middle result for FE_TONEAREST added: sin(-182.212373) = -2.7105054312137606e-18 sin(-182.212374) = -2.7105054312137611e-18 sin(-182.212374) = -2.7105054312137611e-18 sin(1.570797) = 1.0000000000000000e+00 sin(1.570796) = 1.0000000000000000e+00 sin(1.570796) = 9.9999999999999988e-01 Results should be printed in %a format so that the bits can be seen. i387 values: sin(-182.212373) = -0x1.8ffffffffffffp-59 sin(-182.212374) = -0x1.9p-59 sin(-182.212374) = -0x1.9p-59 sin(1.570797) = 0x1p+0 sin(1.570796) = 0x1p+0 sin(1.570796) = 0x1.fffffffffffffp-1 We see that the rounding mode indeed makes a 1-bit difference. software values on i386 (almost the same as below): sin(-182.212373) = -2.4759225463534304e-18 sin(-182.212374) = -2.4759225463534308e-18 sin(-182.212374) = 2.2575413760606011e-11 sin(1.570797) = 1.0000000000000000e+00 sin(1.570796) = 1.0000000000000000e+00 sin(1.570796) = 1.0000000002522273e+00 sin(-182.212373) = -0x1.6d61b58c99c42p-59 sin(-182.212374) = -0x1.6d61b58c99c43p-59 sin(-182.212374) = 0x1.8d26ap-36 sin(1.570797) = 0x1p+0 sin(1.570796) = 0x1p+0 sin(1.570796) = 0x1.000000011553bp+0 We see much more from this: - somehow TONEAREST gives almost the same results as UPWARDS. It appears to work correctly at -182.212373 in the FreeBSD software case only - enormous errors at -182.212373 from the i387, as predicted (the software result is believed to be correct, while the i387 result shows evidence of canceling all except the top 5 bits). All bits except the top bit are wrong, so the error is about 2**52 = 4 Pulps (1 Pulp = 1 peta-ulp). DOWNWARDS makes a 1 ulp adjustment to this. - glibc apparently uses software, since it is correct in all cases, except it apparently forces a common rounding for UPWARDS and NEAREST and thus is certain to be off by 1 ulp for 1 of them. - the 1.57079x are was obviously chosen to be as close as possible to Pi/2 (it should be printed in %a format too, so that we can see its bits (these must be more accurate than shown by the decimal format) and see that the bits don't depend on the rounding mode). Since sin() has a maximum at this value, and the infinitely-precise result at exactly Pi/2 is exactly 1, the approximate internal result must be slightly less than 1, and UPWARDS and TONEAREST should round it to 1, while DOWNWARDS should round it to 1 ulp less than 1. Hardware sin does this. glibc sin apparently does extra work to round in the wrong direction... - ... but FreeBSD software sin has an enormous error for DOWNWARDS on 1.570796 -- it rounds in the wrong direction by 0x11553 ulps :-(. About 1 Mulp (1 Mulp = 1 mega-ulp). > (glibc's fixed bugs here in the past. see > https://sourceware.org/bugzilla/show_bug.cgi?id=3976 for example.) This is interesting. Vincent L certainly knows what he is doing. As stated in the thread, the the library can do anything it wants in non-default rounding modes provided it documents this. It apparently uses the bad fix of changing the rounding mode to FE_TONEAREST. This is slow and guarantees an error of 1 ulp in most cases, so to avoid much larger errors in hopefully most cases. Long ago, I all of the most common functions for such errors, exhaustively for float functions, and didn't find any of more than an ulp or 2. I might have only done this on i386. i386 protects against such errors in float precision by evaluating intermediate results in higher precision. In non-FreeBSD, the default rounding precision is not double, so i386 should also protect agains such errors in double precision. Special cases like multiples of Pi/2 are too hard to find without special tests except in float precision. > the FreeBSD libm: > > sin(-182.212374) = -2.4759225463534304e-18 > sin(-182.212374) = 2.2575413760606011e-11 > sin(1.570796) = 1.0000000000000000e+00 > sin(1.570796) = 1.0000000002522276e+00 Verified above. > it looks like e_sqrtl.c, s_fmal.c, s_fma.c, and s_fmaf.c save/restore > the rounding mode but nothing else does. This makes these functions incredibly slow. > let me know if you'd like me to send trivial patches for any of the > affected functions. (do all the libm functions need this?) This cannot be fixed using trivial patches. Almost all transcendental functions are probably affected. Actually, it is possible to fix many long double functions without making them more than twice as slow in the usual case where the rounding mode is not the default. The ENTERI() macros does a conditional mode switch to change the rounding precision from 53 bits to 64 bits, so that long doubles work properly. It avoids the switch when it would be null. (This reminds me that modern x86 does the same in hardware for at least switching the rounding mode, so unconditional switching of the rounding mode might not be so slow when the switch is null. Also, fpset*() is pessimized in space and time to avoid traps, while feset*() never bothered with this. To switch the rounding mode, you should use fesetround(), but fesetprec() doesn't exist so ENTERI() has to use fpsetprec().) You could try adding unconditional rounding mode switches to ENTERI() to test the long double case. However, testing long doubles is harder. exp() is mentioned in the glibc thread. It has some details of interest in FreeBSD. On i386 only, it written in "optimized" asm that is broken as follows: X /* X * Extended precision is needed to reduce the maximum error from X * hundreds of ulps to less than 1 ulp. Switch to it if necessary. X * We may as well set the rounding mode to to-nearest and mask traps X * if we switch. X */ I thought that this rounding mode switch was a good idea when I wrote this 20+ years ago. Now I know better, and haven't used this version for 3 years. The software version never did this. I now use the software version on i386 as well as on amd64, after optimizing it a bit so that it is a little more accurate than this asm (hardware i387) version. The software version has been faster for many years on x86. The precision mode switch here is part of what makes it slow. glibc probably assumes that the default rounding precision is 64 bits, so that it doesn't need this mode switch. It turns out that 64 bits is just enough to reduce the error in double precision to below 1 ulp. expl() cannot be written like this in asm, since there is no mode switch to get more than long double precision. Not switching would give the same large errors in long double precision that this mode switch fixes for double precision. The rounding mode switch accidentally "fixes" any algorithm instability in i386 exp() in the same way as in glibc. But I think the extra precision gives enough stablity. Bruce