Date: Fri, 16 Nov 2012 18:07:00 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Jung-uk Kim <jkim@freebsd.org> Cc: freebsd-arch@freebsd.org Subject: Re: [RFC] Generic population count function Message-ID: <20121116171923.L1135@besplex.bde.org> In-Reply-To: <50A548C6.5070208@FreeBSD.org> References: <50A43B52.8030102@FreeBSD.org> <50A477BA.4020700@freebsd.org> <50A548C6.5070208@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 15 Nov 2012, Jung-uk Kim wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 2012-11-15 00:03:54 -0500, Julian Elischer wrote: >> there are more efficient algorithms than the one you show in >> bitcount.h see the bit-twidling site: >> >> http://graphics.stanford.edu/~seander/bithacks.html > > Yes, I know that. __bitcount32() is a fall-back for *unsupported* > compilers anyway, I didn't see much point in improving it further. If Unsupported arches? That is most of them, since most don't have popcount in hardware. I still think the fallback should be to a simple lookup table. The old bitcount32() and bitcount16() do lots of shifts by more than 1, so they assume that the hardware has a fast barrel shifter. This assumption is satisfied on x86 starting with i486 IIRC. They have lots of instructions so they assume that instructions are fast. This assumption is not quite satisfied on modern x86, since there are many serial dependencies between the instructions. A simple lookup method would assume fast memory, which is usually not satisified, or cached memory, which often is. The optimality of alternative implementations is even more MD. The above URL gives some 64-bit implementations for 14, 24 and 32-bit counts (the magic in __bitcount32() can be optimized by using wider masks), but that is obviously not going to work well if it is used on 32-bit systems where the 64-bit arithmetic is emulated. > you are looking for better population count functions, this site is > better, actually: > > http://www.dalkescientific.com/writings/diary/archive/2011/11/02/faster_popcount_update.html It agrees with me by saying that the conclusion in 2008 was that the lookup table was competitive :-). Things are a little different now, but it is even more complicated to even consider all the possibilities. The best methods (when there is no hardware popcount) mostly use SSE, and that should have been better in 2008 too, but SSE is not always available even on x86, and is unavailable in practice in the kernel (you don't want to switch the FP state just to do a couple of popcounts). gcc's builtin popcount is fairly useless since it often reduces to a libcall. I just tried __builtin_popcount() and got the following results: - gcc-3.3.3: unavailable - gcc-4.2.1: reduces to the libcall of __popcountdi2, at least with no -march and with -march=native on freefall. __popcountdi2 is used for both 32-bit and 64-bit counts. - clang: does the same bitcount32() inline, with the final 2 shifts optimized to a single multiplication. That is with no -march. With -march=native, it issues a popcountl instruction. Similarly in both cases for __builtin_popcountl(). Unsimilarly for __builtin_popcountimax(): clang is broken and doesn't have it. It only has __builtin_popcount[2-letter-expletive](). Not too bad, but possibly still worse than a libcall could do: - distributions can't be compiled with -march=native, so they can't use popcount[lq], but perhaps the library can - bitcount32() is probably suboptimal for a shift-and-mask version on a 64-bit arch. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121116171923.L1135>