Date: Tue, 10 Sep 2019 15:19:37 -0000 From: Bruce Evans <brde@optusnet.com.au> To: Stefan Kanthak <stefan.kanthak@nexgo.de> Cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: Shorter releng/12.0/lib/msun/i387/e_exp.S and releng/12.0/lib/msun/i387/s_finite.S Message-ID: <20190910230930.Q1373@besplex.bde.org> In-Reply-To: <769CF9CBA0A34DFA92C739C970FA2AAF@H270> References: <769CF9CBA0A34DFA92C739C970FA2AAF@H270>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 8 Sep 2019, Stefan Kanthak wrote: I recently got diagnosed as having serious medical problems and am not sure if I care about this... > here's a patch to remove a conditional branch (and more) from > http://sources.freebsd.org/releng/12.0/lib/msun/i387/e_exp.S > plus a patch to shave some bytes (immediate operands) from > http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_finite.S Anyway, don't bother with these functions. They should never have been written in asm and should go away. Improving the mod and remainder functions is more useful and difficult since they are in asm on amd64 too and there seems to be no better way to implement them on all x86 than to use the i387, but they are still slow. > --- -/releng/12.0/lib/msun/i387/e_exp.S > +++ +/releng/12.0/lib/msun/i387/e_exp.S This went away in my version in 2012 or 2013 together with implementing the long double hyperbolic functions. My version uses the same algorithm in all precision for the hyperbolic functions, but only the long double version was committed (in 2013). The uncommitted parts are faster and more accurate. The same methods work relatively trivially for exp() and expf(), except they are insignificantly faster than better d C version after improving the accuracy of that to be slightly worse than the asm version. I gave up on plans to use the same algorithm in all precisions for exp*(). The long double version is too sophisticated to be fast, after developments in x86 CPUs and compilers made the old Sun C versions fast. Summary of implementations of exp*() on x86: - expf(): use the same C version on amd64 and i386 (Cygnus translation of Sun version with some FreeBSD optimizations). This is fast and is currently little less accurate than it should be - exp(): use the C version on amd64 (Sun version with some FreeBSD optimizations). This is fast and is currently little less accurate than it should be. Use the asm version on i386. This is slow since it switches the rounding precision. It needs the 11 extra bits of precision to barely deliver a double precision result to within 1 ulp. > @@ -45,7 +45,25 @@ > movl 8(%esp),%eax > - andl $0x7fffffff,%eax > - cmpl $0x7ff00000,%eax > - jae x_Inf_or_NaN > + leal (%eax+%eax),%edx > + cmpl $0xffe00000,%edx This removes 1 instruction and 1 dependency, not a branch. Seems reasonable. I would try to do it all in %eax. Check what compilers do for the C version of finite() where this check is clearer and easier to optimize (see below). All of this can be written in C with about 1 line of inline asm, and then compilers can generate better code. > + jb finite This seems to pessimize the branch logic in all cases (as would be done in C by getting __predict_mumble() backwards). The branches were carefully optimized (hopefully not backwards) for the i386 and i486 and this happens to be best for later CPUs too. Taken branches are slower on old CPUs, so the code was arranged to not branch in the usual (finite) case. Newer CPUs only use static branch prediction for the first branch, so the branch organization rarely matters except in large code (not like here) where moving the unusual case far away is good for caching. The static prediction is usuually that the first forward branch is not taken while the first backward branch is taken. So the forward branch to the non-finite case was accidentally correct. > > + /* > + * Return 0 if x is -Inf. Otherwise just return x; when x is Inf > + * this gives Inf, and when x is a NaN this gives the same result > + * as (x + x) (x quieted). > + */ > + cmpl 4(%esp),$0 > + sbbl $0xfff00000,%eax > + je minus_inf > + > +nan: > fldl 4(%esp) > + ret > > +minus_inf: > + fldz > + ret > + > +finite: > + fldl 4(%esp) > + > @@ -80,19 +98,3 @@ > ret > - > -x_Inf_or_NaN: > - /* > - * Return 0 if x is -Inf. Otherwise just return x; when x is Inf > - * this gives Inf, and when x is a NaN this gives the same result > - * as (x + x) (x quieted). > - */ > - cmpl $0xfff00000,8(%esp) > - jne x_not_minus_Inf > - cmpl $0,4(%esp) > - jne x_not_minus_Inf > - fldz > - ret > - > -x_not_minus_Inf: > - fldl 4(%esp) > - ret Details not checked. Space/time efficiency doesn't matter in the non-finite case. But see s_expl.c where the magic expression (-1 / x) is used for the return value to optimize for space (it avoids branches but the division is slow). > END(exp) > > --- -/releng/12.0/lib/msun/i387/s_finite.S > +++ +/releng/12.0/lib/msun/i387/s_finite.S This function has several layers of reasons to not exist. It seems to be only a Sun extension to C90. It is not declared in <math.h>, but exists in libm as namespace pollution to support old ABIs. C99 has the better API isfinite() which is type-generic. I thought that this was usually inlined. Actually, it seems to be implemented by calling __isfinite(), and not this finite(). libm also has finite() in C. Not inlining this and/or having no way to know if it is efficiently inlined makes it unusable in optimized code. > @@ -39,8 +39,8 @@ > ENTRY(finite) > movl 8(%esp),%eax > - andl $0x7ff00000, %eax > - cmpl $0x7ff00000, %eax > + addl %eax, %eax > + cmpl $0xffe00000, %eax This doesn't reduce the number of instructions or dependencies, so it is less worth doing than similar changes above. > setneb %al This is now broken since setneb is only correct after masking out the unimportant bits. > - andl $0x000000ff, %eax > + movzbl %al, %eax > ret Old bug: extra instructions to avoid the branch might be a pessimization all CPUs: - perhaps cmov is best on newer CPUs, but it is unportable - the extra instructions and possibly movz instead of and are slower on old CPUs, while branch prediction is fast for the usual case on newer CPUs. > END(finite) Check what compilers generate for the C versions of finite() and __isfinite() with -fomit-frame-pointer -march=mumble (especially i386) and __predict_mumble(). The best code (better than the above) is for finite(). Oops, it is only gcc-4.2.1 that generates very bad code for __isfinite(). s_finite.c uses masks and compilers don't reorganize this much. s_isfinite.c uses hard-coded bit-fields which some compilers don't optimize very well. Neither does the above, or the standard access macros using bit-fields -- they tend to produce store-to-load mismatches. Well, I finally found where this is inlined. Use __builtin_isfinite() instead of isfinite(). Then gcc generates a libcall to __builtin_isfinite(), while clang generates inline code which is much larger and often slower than any of the above, but it at least avoids store-to-load mismatches and doesn't misclassify long doubles in unsupported formats as finite when they are actually NaNs. It also generates exceptions for signaling NaNs in some cases, which is arguably wrong. The fpclassify and isfinite, etc., macros in <math.h> are already too complicated but not nearly complicated enough to decide if the corresponding builtins should be used. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190910230930.Q1373>