Date: Wed, 11 Feb 2015 02:37:05 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Jung-uk Kim <jkim@freebsd.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, John Baldwin <jhb@freebsd.org> Subject: Re: svn commit: r278474 - head/sys/sys Message-ID: <20150211014516.N1511@besplex.bde.org> In-Reply-To: <54D92CE8.1030803@FreeBSD.org> References: <201502092103.t19L3OAn013792@svn.freebsd.org> <1698688.fEq0HxqPxg@ralph.baldwin.cx> <54D92CE8.1030803@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 9 Feb 2015, Jung-uk Kim wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA256 > > On 02/09/2015 16:08, John Baldwin wrote: >> On Monday, February 09, 2015 09:03:24 PM John Baldwin wrote: >>> Author: jhb Date: Mon Feb 9 21:03:23 2015 New Revision: 278474 >>> URL: https://svnweb.freebsd.org/changeset/base/278474 >>> >>> Log: Use __builtin_popcnt() to implement a BIT_COUNT() operation >>> for bitsets and use this to implement CPU_COUNT() to count the >>> number of CPUs in a cpuset. >>> >>> MFC after: 2 weeks >> >> Yes, __builtin_popcnt() works with GCC 4.2. It should also allow >> the compiler to DTRT in userland uses of this if -msse4.2 is >> enabled. > > Back in 2012, when I submitted a similar patch, bde noted > __builtin_popcount*() cannot be used with GCC 4.2 for *kernel* because > it emits a library call. Also, the gcc library uses a fairly slow implementation (a 256-byte lookup table), even in gcc-4.8. On modern CPUs, this is probably slower than the bit-magic algorithm in <sys/systm.h>. In the usual case where the arch doesn't support popcount in hardware, clang generates inline code that uses much the same bit-magic algorithm as in <sys/systm.h>. It also uses this for its libgcc.a. gcc48 only has library support for 32-bit and 64-bit integers on 64-bit arches. clang also supports a "ti2" (128-bit?) integer type in its libgcc.a. I don't know how to test this. (__int128 works for declaring 128-bit variables, but the only popcount builtins according to the available documentation (strings clang) are *popcount(), *popcountl() and *popcountll(), and ll is the same as l on amd64. The implementation in <sys/systm.h> is a silly optimization. It also supports 16-bit integers but not 64-bit ones. If optimizing popcount was important, then it would be done in libc first and would support 64-bit integers. It would use the builtin iff it is available and doesn't reduce to a library call that is slower than the bit-magic algorithm. This is not easy, since compilers don't even tell you if their builtins reduce to library calls, and whether they do depends on the arch (*). Older compilers also don't tell you if their builtins exist. Some builtins have a type-generic API, but unfortunately this not one. This means that the builtin can barely do as well as the bit-magic inline function for 16-bit integers since you cannot even write the name of the builtin for this case (you have to promote to 32 bits; popcount on this is probably faster than bit-magic, but bit-magic has more bits to look at). Similarly for 128-bit intergers. (*) Since generic amd64 and i386 have no popcount instruction in hardware, using builtin popcount rarely uses the hardware instruction (it takes special -march to get it, and the resulting binaries don't run on generic CPUs). Thus using the builtin works worse than using the old inline function in most cases. Except, the old inline function is only implemented in the kernel, and isn't implemented for 64-bit integers. gcc-4.8 generates the hardware popcount if the arch supports it. Only its library popcounts are slower than clang's. gcc-4.2 presumably doesn't generate the hardware popcount, since it doesn't have a -march for newer CPUs that have it. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150211014516.N1511>