Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 16 Nov 2012 18:07:00 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Jung-uk Kim <jkim@freebsd.org>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: [RFC] Generic population count function
Message-ID:  <20121116171923.L1135@besplex.bde.org>
In-Reply-To: <50A548C6.5070208@FreeBSD.org>
References:  <50A43B52.8030102@FreeBSD.org> <50A477BA.4020700@freebsd.org> <50A548C6.5070208@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 15 Nov 2012, Jung-uk Kim wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 2012-11-15 00:03:54 -0500, Julian Elischer wrote:
>> there are more efficient algorithms than the one you show in
>> bitcount.h see the bit-twidling site:
>>
>> http://graphics.stanford.edu/~seander/bithacks.html
>
> Yes, I know that.  __bitcount32() is a fall-back for *unsupported*
> compilers anyway, I didn't see much point in improving it further.  If

Unsupported arches?  That is most of them, since most don't have popcount
in hardware.

I still think the fallback should be to a simple lookup table.  The old
bitcount32() and bitcount16() do lots of shifts by more than 1, so they
assume that the hardware has a fast barrel shifter.  This assumption is
satisfied on x86 starting with i486 IIRC.  They have lots of instructions
so they assume that instructions are fast.  This assumption is not quite
satisfied on modern x86, since there are many serial dependencies between
the instructions.  A simple lookup method would assume fast memory, which
is usually not satisified, or cached memory, which often is.  The optimality
of alternative implementations is even more MD.  The above URL gives some
64-bit implementations for 14, 24 and 32-bit counts (the magic in
__bitcount32() can be optimized by using wider masks), but that is
obviously not going to work well if it is used on 32-bit systems where the
64-bit arithmetic is emulated.

> you are looking for better population count functions, this site is
> better, actually:
>
> http://www.dalkescientific.com/writings/diary/archive/2011/11/02/faster_popcount_update.html

It agrees with me by saying that the conclusion in 2008 was that the lookup
table was competitive :-).  Things are a little different now, but it is
even more complicated to even consider all the possibilities.  The best
methods (when there is no hardware popcount) mostly use SSE, and that
should have been better in 2008 too, but SSE is not always available even
on x86, and is unavailable in practice in the kernel (you don't want to
switch the FP state just to do a couple of popcounts).

gcc's builtin popcount is fairly useless since it often reduces to a
libcall.  I just tried __builtin_popcount() and got the following
results:
- gcc-3.3.3: unavailable
- gcc-4.2.1: reduces to the libcall of __popcountdi2, at least with no
   -march and with -march=native on freefall.  __popcountdi2 is used for
   both 32-bit and 64-bit counts.
- clang: does the same bitcount32() inline, with the final 2 shifts
   optimized to a single multiplication.  That is with no -march.  With
   -march=native, it issues a popcountl instruction.  Similarly in both
   cases for __builtin_popcountl().  Unsimilarly for
   __builtin_popcountimax(): clang is broken and doesn't have it.  It
   only has __builtin_popcount[2-letter-expletive]().  Not too bad, but
   possibly still worse than a libcall could do:
   - distributions can't be compiled with -march=native, so they can't
     use popcount[lq], but perhaps the library can
   - bitcount32() is probably suboptimal for a shift-and-mask version on
     a 64-bit arch.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121116171923.L1135>