Date: Sun, 16 Sep 2012 05:00:41 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Stephen Montgomery-Smith <stephen@missouri.edu> Cc: freebsd-numerics@FreeBSD.org Subject: Re: Complex arg-trig functions Message-ID: <20120916041132.D6344@besplex.bde.org> In-Reply-To: <5054C200.7090307@missouri.edu> References: <5017111E.6060003@missouri.edu> <20120814003614.H3692@besplex.bde.org> <50295F5C.6010800@missouri.edu> <20120814072946.S5260@besplex.bde.org> <50297CA5.5010900@missouri.edu> <50297E43.7090309@missouri.edu> <20120814201105.T934@besplex.bde.org> <502A780B.2010106@missouri.edu> <20120815223631.N1751@besplex.bde.org> <502C0CF8.8040003@missouri.edu> <20120906221028.O1542@besplex.bde.org> <5048D00B.8010401@missouri.edu> <504D3CCD.2050006@missouri.edu> <504FF726.9060001@missouri.edu> <20120912191556.F1078@besplex.bde.org> <20120912225847.J1771@besplex.bde.org> <50511B40.3070009@missouri.edu> <20120913204808.T1964@besplex.bde.org> <5051F59C.6000603@missouri.edu> <20120914014208.I2862@besplex.bde.org> <50526050.2070303@missouri.edu> <20120914212403.H1983@besplex.bde.org> <50538E28.6050400@missouri.edu> <20120915231032.C2669@besplex.bde.org> <50548E15.3010405@missouri.edu> <5054C027.2040008@missouri.edu> <5054C200.7090307@missouri.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 15 Sep 2012, Stephen Montgomery-Smith wrote: > On 09/15/2012 12:51 PM, Stephen Montgomery-Smith wrote: >> On 09/15/2012 09:17 AM, Stephen Montgomery-Smith wrote: >>> On 09/15/2012 08:45 AM, Bruce Evans wrote: >>> >>> I had a hard time understanding what exactly the style fixes here were. >>> Probably the mail clients (either mine or yours) clobber these. They >>> seem like they are to do with the spacing, but I am not sure. >>> >>>> % +B_crossover = 0.6417, /* suggested by Hull et al */ > > I am going to guess that you wanted a tab instead of a space. Right. This one was a tab and another one was a trailing space. indent(1) would have fixed the latter, but it mangles the constant table declarations completely. > So, I think I have implemented all of the new changes. > > One more thing - my intent is that when they are committed, that catrig*.c > should be split into to two files. One contains casin(h) and cacos(h). The > other contains catan(h). And then some of those headings surrounded by boxes > can be removed. > > But that is up to you guys. Maybe even 3 files. fdlibm uses one entry point per file to a fault, but for log* I think I want all 4 standard entry points for all 3 precisions in 1 file so that tables can be shared. I think I figured out why cacosf() is faster on i386(core2) when it just calls cacos(). There are lots of (non-spurious) underflows, for at least atan2(y, x) where x > 0 and y/x underflows. I test with logarithmically uniformly distributed y and x, so the average difference of exponents is large and this underflow happens in a significant fraction of all cases. The result always underflows, but with catrigf.c, especially on i386, there are several more steps with the underflowing y/x: from e_atan2f.c: % ... % else if(k<-26&&hx<0) z=0.0; /* 0 > |y|/x > -2**-26 */ In this case, the result will be near +-pi, and we must avoid underflow, which we do by setting z to 0. % else z=atanf(fabsf(y/x)); /* safe to do y/x */ When k<-26 but x>0, the result will be near 0, and may underflow. On i386, y/x is normally evaluated in extra precision so it doesn't normally underflow. Strictly, it should be converted to float before passing it to fabsf() and underflow then, but fabsf() is normally a builtin and compiler bugfeatures usually result in no conversion. But atanf() is always extern, so the ABI forces a conversion, giving the exception. Then if the underflow is gradual, atanf() sees a denormal value and may give further slowness (but it should just return this value, so there should be no slowness from it for underflow, but there might be on some hardware for denormals. When y/x underflows, atanf() on fabsf() of it is just |y|/x, so the above could be optimized to avoid the extra conversions and operations. % switch (m) { % case 0: return z ; /* atan(+,+) */ % case 1: return -z ; /* atan(-,+) */ These 2 cases are when x > 0. When y < 0, we negate the |y|/x produced above, having done extra work to mess up the sign of y/x to merge the cases. This operation may cause further underflow or denormal exceptions. Even returning z unchanged may cause exceptions, depending on optimizations. z when it is returned by atanf() is normally in an i387 register on i386 and it will not be denormal there; when fully optimized, this will be returned unchanged, but if z is stored to memory then there will be denormal and underflow exceptions for conversions that are part of the store. % case 2: return pi-(z-pi_lo);/* atan(+,-) */ % default: /* case 3 */ % return (z-pi_lo)-pi;/* atan(-,-) */ % } On i386(athlon64), underflows don't cost speed and direct cacosf() is faster, but amd64(athlon64) acts strangely instead: on amd64(core2), direct catrigf.c is significantly faster, but on amd64(athlon64) there is little difference. On amd64, because it doesn't use extended precision, denormals inherently cost less because they stay denormal and don't cause underflow exceptions on conversion back to themselves (or nearby) after load/modify/store, provied the modification doesn't cause an exception. I think fabsf() on a float denormal doesn't cause an exception in SSE or i387, but it does on i386 iff the result is stored to memory. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120916041132.D6344>