From owner-freebsd-arch@FreeBSD.ORG Fri Nov 16 07:15:03 2012 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id A7979D9B; Fri, 16 Nov 2012 07:15:03 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx07.syd.optusnet.com.au (fallbackmx07.syd.optusnet.com.au [211.29.132.9]) by mx1.freebsd.org (Postfix) with ESMTP id 378F28FC13; Fri, 16 Nov 2012 07:15:02 +0000 (UTC) Received: from mail10.syd.optusnet.com.au (mail10.syd.optusnet.com.au [211.29.132.191]) by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qAG77Bo3032135; Fri, 16 Nov 2012 18:07:11 +1100 Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qAG770Ih009910 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 16 Nov 2012 18:07:02 +1100 Date: Fri, 16 Nov 2012 18:07:00 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jung-uk Kim Subject: Re: [RFC] Generic population count function In-Reply-To: <50A548C6.5070208@FreeBSD.org> Message-ID: <20121116171923.L1135@besplex.bde.org> References: <50A43B52.8030102@FreeBSD.org> <50A477BA.4020700@freebsd.org> <50A548C6.5070208@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-Cloudmark-Score: 0 X-Optus-Cloudmark-Analysis: v=2.0 cv=F/YP7ddN c=1 sm=1 a=nz2LO5XVLtEA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=hMQG39urcAQA:10 a=cqvmlmxeAAAA:8 a=qDOkOHFhAAAA:8 a=d9KzjkUFrCz0tZpBnxYA:9 a=CjuIK1q_8ugA:10 a=t_ttxzwXqboA:10 a=gosoKpmXwM4A:10 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Nov 2012 07:15:03 -0000 On Thu, 15 Nov 2012, Jung-uk Kim wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 2012-11-15 00:03:54 -0500, Julian Elischer wrote: >> there are more efficient algorithms than the one you show in >> bitcount.h see the bit-twidling site: >> >> http://graphics.stanford.edu/~seander/bithacks.html > > Yes, I know that. __bitcount32() is a fall-back for *unsupported* > compilers anyway, I didn't see much point in improving it further. If Unsupported arches? That is most of them, since most don't have popcount in hardware. I still think the fallback should be to a simple lookup table. The old bitcount32() and bitcount16() do lots of shifts by more than 1, so they assume that the hardware has a fast barrel shifter. This assumption is satisfied on x86 starting with i486 IIRC. They have lots of instructions so they assume that instructions are fast. This assumption is not quite satisfied on modern x86, since there are many serial dependencies between the instructions. A simple lookup method would assume fast memory, which is usually not satisified, or cached memory, which often is. The optimality of alternative implementations is even more MD. The above URL gives some 64-bit implementations for 14, 24 and 32-bit counts (the magic in __bitcount32() can be optimized by using wider masks), but that is obviously not going to work well if it is used on 32-bit systems where the 64-bit arithmetic is emulated. > you are looking for better population count functions, this site is > better, actually: > > http://www.dalkescientific.com/writings/diary/archive/2011/11/02/faster_popcount_update.html It agrees with me by saying that the conclusion in 2008 was that the lookup table was competitive :-). Things are a little different now, but it is even more complicated to even consider all the possibilities. The best methods (when there is no hardware popcount) mostly use SSE, and that should have been better in 2008 too, but SSE is not always available even on x86, and is unavailable in practice in the kernel (you don't want to switch the FP state just to do a couple of popcounts). gcc's builtin popcount is fairly useless since it often reduces to a libcall. I just tried __builtin_popcount() and got the following results: - gcc-3.3.3: unavailable - gcc-4.2.1: reduces to the libcall of __popcountdi2, at least with no -march and with -march=native on freefall. __popcountdi2 is used for both 32-bit and 64-bit counts. - clang: does the same bitcount32() inline, with the final 2 shifts optimized to a single multiplication. That is with no -march. With -march=native, it issues a popcountl instruction. Similarly in both cases for __builtin_popcountl(). Unsimilarly for __builtin_popcountimax(): clang is broken and doesn't have it. It only has __builtin_popcount[2-letter-expletive](). Not too bad, but possibly still worse than a libcall could do: - distributions can't be compiled with -march=native, so they can't use popcount[lq], but perhaps the library can - bitcount32() is probably suboptimal for a shift-and-mask version on a 64-bit arch. Bruce