Date: Fri, 10 Jun 2011 17:08:50 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Craig Rodrigues <rodrigc@freebsd.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r222908 - head/lib/libstand Message-ID: <20110610160628.Y1015@besplex.bde.org> In-Reply-To: <201106100113.p5A1DFUx003201@svn.freebsd.org> References: <201106100113.p5A1DFUx003201@svn.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 10 Jun 2011, Craig Rodrigues wrote: > Log: > Obtain bswap64() function implementation from > version 1.3 of src/common/lib/libc/gen/bswap64.c in NetBSD. > > Obtained from: NetBSD Why not use the better (actually, differently bad, but in the correct place to fix) implementation already in FreeBSD (in <machine/endian.h>? libstand isn't so standalone that it can't use basic system headers. > Modified: head/lib/libstand/bswap.c > ============================================================================== > --- head/lib/libstand/bswap.c Thu Jun 9 23:12:23 2011 (r222907) > +++ head/lib/libstand/bswap.c Fri Jun 10 01:13:15 2011 (r222908) > ... > @@ -30,12 +30,28 @@ bswap32(u_int32_t x) > > u_int64_t > bswap64(u_int64_t x) > -{ > - u_int32_t *p = (u_int32_t*)&x; > - u_int32_t t; > - t = bswap32(p[0]); > - p[0] = bswap32(p[1]); > - p[1] = t; > - return x; > -} This was adequate on some arches, but has style bugs and bad type puns which might break it on other arches. > - > +{ > +#ifdef _LP64 > + /* > + * Assume we have wide enough registers to do it without touching > + * memory. > + */ There are negative reasons to optimize this for libstand, but <machine/endian.h> already does similar optimizations, with messy ifdefs for each of about 10 different arches. > + return ( (x << 56) & 0xff00000000000000UL ) | > + ( (x << 40) & 0x00ff000000000000UL ) | > + ( (x << 24) & 0x0000ff0000000000UL ) | > + ( (x << 8) & 0x000000ff00000000UL ) | > + ( (x >> 8) & 0x00000000ff000000UL ) | > + ( (x >> 24) & 0x0000000000ff0000UL ) | > + ( (x >> 40) & 0x000000000000ff00UL ) | > + ( (x >> 56) & 0x00000000000000ffUL ); This has different style bugs (excessive whitespace), but it otherwise formatted better than the similar expression which is spammed into too many <machine/endian.h>'s. This expression is MI except for the constants in it. But it is far from optimal with some compilers, so too many of <machine/endian.h>'s only use it when a compile-time constant expression is needed -- they use asms with bswap instructions in them in other cases. OTOH, this expression is optimal for clang, both for amd64 and i386 IIRC -- clang turns it into 1 bswap for amd64, and (after adjusting the constants) into 2 bswaps for i386. gcc3.x and gcc-4.2.1 generate large slow code for this expression. > +#else > + /* > + * Split the operation in two 32bit steps. > + */ > + u_int32_t tl, th; > + > + th = bswap32((u_int32_t)(x & 0x00000000ffffffffULL)); > + tl = bswap32((u_int32_t)((x >> 32) & 0x00000000ffffffffULL)); > + return ((u_int64_t)th << 32) | tl; This is a good way to avoid the type puns in the original version, except for the style bugs in it (use of the long long abomination, in a way that doesn't even have any effect). On i386, this results in essentially the same good code that clang generates for the big MI expression (one (virtual) swap of a register pair to swap 2 32-bit words, then 2 bswaps to swap the words internally. then more virtual instructions to recombine the words), provided bswap32 is inline and reduces to the bswap instruction, either via an asm or by optimization (but in libstand it is neither). (At least if bswap32() in the above is replaced by _bswap32() from <machine/endian.h>, clang optimizes the above to exactly the same 2 bswaps as it does for the big MI expression; gcc is not so good, and generates about 8 movl instructions instead of the virtual swapping.) The above is not so optimal for LP64 (even clang doesn't manage to convert the 2 32-bit bswaps to 1 64-bit one, at least if bswap32() is in asm), but is adequate in libstand. > +#endif > +} Anyway, the messy ifdefs and micro-optimizations for this should not be duplicated here. When gcc optimization catches up with where clang already is, all the messy ifdefs and MD asms in <machine/endian.h>'s can be replaced by single MI expressions, and everything that doesn't roll its own endian functions will benefit automatically. The different badness now in i386/include/endian.h is now: - the big MI expressions for bswap32 and bswap64 (named __bswapNN_const) are spammed into this and all (?) other MD files - these expressions are missing some casts. For example, if the _x arg is either signed (with the sign bit set), or of a larger type (with higher bits set), then (_x) >> 56 gives a wrong value. - these expressions are poorly formatted - the big MI expression for bswap64 is used for both the "const" and the "var" cases. This is good for the "const" case, but for the "var" case it pessimal for gcc. This badness is sort of the opposite of the above for libstand -- i386/include/endian.h should optimize and use the !_LP64 code above, while libstand doesn't need this optimization but has it. amd64 doesn't have this problem since it uses a MD asm for bswap64_var. There is a PR about bswap64 not being optimized for i386. Apparently, the optimization actually matters for some cases (for something like swab(3) on large data?). See the PR for more details. My followup explains why fancy asms are not needed to fix this -- as above, the big MI expression works perfectly for clang, but for gcc an optimization like the above is good (generate 2 bswap instructions using the asms for bswap32, but don't use asms to combine these. But use a better C expression than the above to combine them, so that gcc doesn't generate so many movls). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110610160628.Y1015>