Date: Tue, 15 Jan 2019 01:35:42 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Steve Kargl <sgk@troutmask.apl.washington.edu> Cc: Pedro Giffuni <pfg@freebsd.org>, freebsd-numerics@freebsd.org Subject: Re: New math library from ARM Message-ID: <20190115005528.H1450@besplex.bde.org> In-Reply-To: <20181231152230.GC823@troutmask.apl.washington.edu> References: <797a7755-db93-1b9c-f3b9-8850d948e098@FreeBSD.org> <20181231151904.GB823@troutmask.apl.washington.edu> <20181231152230.GC823@troutmask.apl.washington.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 31 Dec 2018, Steve Kargl wrote: > On Mon, Dec 31, 2018 at 07:19:04AM -0800, Steve Kargl wrote: >> On Mon, Dec 31, 2018 at 10:06:57AM -0500, Pedro Giffuni wrote: >>> Hi; >>> >>> Just noted a recent initiative from musl-libc: >>> >>> https://www.openwall.com/lists/musl/2018/12/08/1 >>> >>> It appears they plan to replace their (FreeBSD) math code with a new ARM >>> implementation: >>> >>> https://github.com/ARM-software/optimized-routines >> >> The Copyright on this code is unclear. For example, in >> single/e_rem_pio2.c lines 1-6: >> >> /* >> * e_rem_pio2.c >> * >> * Copyright (c) 1999-2018, Arm Limited. >> * SPDX-License-Identifier: MIT >> */ >> >> Then lines 16-18: >> >> /* >> * Simple cases: all nicked from the fdlibm version for speed. >> */ >> >> The original fdlibm licenses applies ot the nicked lines. Really? The fdlibm version is quite slow, especially for simple cases. In FreeBSD, most of the optimizations were written by me, especially for "simple" cases: - arrange not to call here for |arg| < Pi/4 - special optiimizations for |arg| < 9*PI/4 (1 case for every pair of quadrants. These optimizations are fairly generic. - special optimizations for "medium" args. These optimizations are mostly to to hide the high latency of float to integer conversions on older i386 x86's. Lots of functions use similar code with many ifdefs. My current version has the details for this in inline functions in math_private.g It is possible to do better for arches with extra precision, but this is not done and is still too hard for i386 on FreeBSD since extra precision is unavailable by default, and gcc and clang have different bugs in their support for extra precision. But all of these optimizations work very well for newer x86's with newer compilers. This is because newer x86's and compilers have lower latency and/or better scheduling. I don't know what is best for arm. > There is also some interesting uses of float literal constants > with 80 digits when at most 9 are relevant. That might be to inspire (negative) confidence :-). Why does the arm e_rem_pio2.c even have any float constants? The fdlibm version has tables of small hex constants representing high-precision FP constants. FreeBSD moved these tables to k_rem_pio2.c. IIRC, these constants are loaded as FP values and the can probably be optimized better using not float constants more directly, or better double constants more directly. These tables are only used for large args, and the fdlibm code is quite slow for large args, but FreeBSD doesn't optimize this case. I used to think that x86 hardware handled this case much better than the slow fdlibm code, but on newer x86 the fdlibm code is sometimes faster than the hardware. On older x86, the hardware was only a few times faster. Versions optimized for extended precision would use some long double constants in e_rem_pio2.c. e_rem_pio2f.c already uses extra precision by using double precision for almost everything. However, on newer x86, this class of optimizations is barely worth doing, since the more generic code in e_rem_pio2.c is fast enough (the throughput is 1 cos() per 20-30 cycles on modern x86). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190115005528.H1450>