Date: Wed, 27 Feb 2019 08:19:06 -0800 From: Steve Kargl <sgk@troutmask.apl.washington.edu> To: Bruce Evans <brde@optusnet.com.au> Cc: freebsd-numerics@freebsd.org Subject: Re: Update ENTERI() macro Message-ID: <20190227161906.GA77785@troutmask.apl.washington.edu> In-Reply-To: <20190227201214.V1823@besplex.bde.org> References: <20190226191825.GA68479@troutmask.apl.washington.edu> <20190227145002.P907@besplex.bde.org> <20190227074811.GA75972@troutmask.apl.washington.edu> <20190227201214.V1823@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Feb 27, 2019 at 09:15:52PM +1100, Bruce Evans wrote: > On Tue, 26 Feb 2019, Steve Kargl wrote: > > > On Wed, Feb 27, 2019 at 05:05:15PM +1100, Bruce Evans wrote: > >> On Tue, 26 Feb 2019, Steve Kargl wrote: > >* ... > >>> Update the ENTERI() macro in math_private.h to take a parameter. > >> ... > >> I don't like this. It churns and complicates all the simple cases > >> that only need ENTERI(). It bogotifies the existence of ENTERIT(), > > ... > > Okay. The other option is an ENTERC() and RETURNC() as > > we need to toggle FP_PE for long double complex functions. > > I suppose I could follow the one example currently in the > > tree that use > > > > ENTERIT(long double complex) > > > > I find it somewhat odd that we have > > > > ENTERI() /* Implicit declaration of __retval to long double. */ > > > > but must use directly ENTERIT(long double complex). > > ENTERI() hard-codes the long double for simplicity. Remember, it is only > needed for long double precision on i386. But I forgot about long double > complex types, and didn't dream about indirect long double types in sincosl(). That simplicity does not work for long double complex. We will need either ENTERIC as in #define ENTERIC() ENTERIT(long double complex) or a direct use of ENTERIT as you have done s_clogl.c > > > ... > >>> -#define RETURNI(x) RETURNF(x) > >>> +#define ENTERI(a) > >>> +#define RETURNI(a) RETURNF(a) > >>> #define ENTERV() > >>> #define RETURNV() return > >>> #endif > >> > >> This also changes RETURNI(), by unimproving its parameter name. 'x' for > >> ENTERI() wasn't a very good name for a type, but is good for a variable. > >> 'x' for RETURNI() is slightly worse than 'r', but better than 'a' > > > > The renaming is for consistency. I can use 'r'. > > 'r' is not quite right either, since the arg can be and is often an > expression. 'a' is good for 'arg'. > > >> ... > >> But I now see 3 more problems. The return in RETURNI() is not direct, > >> but goes through the macro RETURNF(x). In the committed version, this > >> is a default that just returns x, but in my version it returns > >> hackdouble_t(x) or hackfloat_t(x) in some cases (no cases are needed > >> for long doubles, so there is no interaction with ENTERI()/LEAVEI(), > >> and I only do this in a few simple cases not including any with > >> complex types). > > > > I'm fine with making ENTERI() only toggle precision, and adding > > a LEAVEI() to reset precision. RETURNI(r) would then be > > > > #define RETURNI(r) \ > > do { \ > > LEAVEI(); \ > > return (r); \ > > } while (0) > > No, may be an expression, so it must be evaluated before LEAVEI(). This > is the reason for existence of the variable to hold the result. So, we'll need RETURNI for long double and one for long double complex. Or, we give RETURNI a second parameter, which is the input parameter of the function #define RETURNI(x, r) \ do { \ x = (r) \ LEAVEI(); \ return (r); \ } while (0) This will cause a lot of churn. So, it seems that ENTERIC is the way forward. > >> [... about complications for the general case] > > >> This reminds me of a reason why I don't like sincos*(). Its API > >> requires destruction of efficiency and accuracy by returning the values > >> indirectly. On i386 with not very old CPUs, this costs about 8 cycles per > >> long double value. Float and double values cost about half as much. On > >> amd64, the long double case is the same and the float and double cases > >> are faster. > > > > Not sure your efficiency claim holds. I've seen significant improves > > in cexp and cexpf where sin[f]() and cos[f]() are replaced by > > sincos[f]. On my core2 running i386 freebsd, I see 0.1779 usecs/call > > for cexpf with sinf and cosf and 0.12522 usecs/call for sincosf. > > Yes, that's a 29.6% improvement. For cexp the numbers are 0.2697 > > usecs/call for sin and cos and 0.20586 for sincos (ie, 23.7% improvement). > > This is for z = x + I y with x and y in the non-exceptable case. > > Combined sin and cos probably does work better outside of benchmarks for > sin and cos alone, since it does less work so leaves more resources for > the, more useful things. Exactly! I have a significant amount of Fortran code that does z = cmplx(cos(x), sin(x)) in modern C this is 'z = CMPLX(cos(x), sin(x))'. GCC with optimization enables will convert this to z = cexp(cmplx(0,x)) where it expects cexp to optimize this to sincos(). GCC on FreeBSD will not do this optimization because FreeBSD's libm is not C99 compliant. > >> sinf() and cosf() on small args take only 15-20 cycles (thoughput) on > >> amd64 with not very old CPUs, so 2-8 extra cycles for the 2 indirect > >> return values is a lot. sincosf() still ends up being slightly faster > >> than separate sinf()/cosf(). > > > > Seems to be much faster when used in other functions. > > It's hard tp be much faster than 15-20 cycles. The latency is more like > 50 cycles, with 3 sinf()'s or cosf()'s running in parallel. > > sincos() is far from the best possible optimization for repeated calls on > the same or nearby args. If sin() and cos() cached the arg reduction, then > separate sin() and cos() on the same arg would run about as fast as sincos(), > and repeated sin()'s on the same arg would run much faster than now. > Caching the arg reduction may also be good when the arg changes slightly. > However, caching is slower if the args are not close. Even a 1-entry cache > takes a long time to look up relative to the 15-20 cycles taken by sinf() > and cosf(). Caching is complicated by signal handlers and threads. Perhaps > the right API one that has to ask for caching and provides the cache storage. > Then sincos() could be: > > ... > _dh_init(x, &dh); /* prefill 1-entry cache dh */ > s = _sin_cache(x, &dh, 1); /* cache hit unless x is NaN > /* cache misses update dh */ > c = _cos_cache(x, &dh, 1); /* cache hit unless x is NaN > ... > > and with everything inlined this is little different from the current > sincos() except for NaNs. NaNs can be cache hits too if you compare > them as bits, but the comparison should probably be x == dhp->dh_x > for a 1-entry cache, so as to not to have to extract the bits of x. When I worked on sincos() I tried a few variations. This included the simpliest implementation: void sincos(double x, double *s, double *c) { *c = cos(x); *s= sin(x); } I tried argument reduction with kernels. void sincos(double x, double *s, double *c) { a = inline argument reduction done to set a. *c = k_cos(x); *s= k_sin(x); } And finally the version that was committed where k_cos and k_sin were manually inlined and re-arranged to reduce redundant computations. Never thought about some caching mechanism. It seems to be more complicated than it may be worth. -- Steve
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190227161906.GA77785>