From owner-freebsd-arch@FreeBSD.ORG Thu Nov 15 08:33:02 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 2B084B23; Thu, 15 Nov 2012 08:33:02 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au [211.29.132.190]) by mx1.freebsd.org (Postfix) with ESMTP id 8AD298FC16; Thu, 15 Nov 2012 08:33:01 +0000 (UTC) Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26]) by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qAF8WoBQ025930 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 15 Nov 2012 19:32:51 +1100 Date: Thu, 15 Nov 2012 19:32:50 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Jung-uk Kim Subject: Re: [RFC] Generic population count function In-Reply-To: <50A43B52.8030102@FreeBSD.org> Message-ID: <20121115181009.D42162@besplex.bde.org> References: <50A43B52.8030102@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-Cloudmark-Score: 0 X-Optus-Cloudmark-Analysis: v=2.0 cv=eqGHVfVX c=1 sm=1 a=nz2LO5XVLtEA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=hMQG39urcAQA:10 a=6I5d2MoRAAAA:8 a=Y1UM5pQ7PpipOtpjvNUA:9 a=CjuIK1q_8ugA:10 a=3M_0ICQ7WrFIE0d4:21 a=Ei4GSbzkcvkxRff1:21 a=bxQHXO5Py4tHmhUgaywp5w==:117 Cc: freebsd-arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Nov 2012 08:33:02 -0000 On Wed, 14 Nov 2012, Jung-uk Kim wrote: > I implemented generic population count function. Please see the > attachment. It is also available from here: > > http://people.freebsd.org/~jkim/bitcount.diff > > The idea is to make use of CPU-supported population count instructions > if available and the compiler supports them (i.e., clang), especially > for larger than 32-bit data. The patch also has use cases for the new > function (i.e., counting number of bits in IPv6 address[*]). > > Any objection? This is rather over-engineered for something that is used once or twice, but it is good to move the functions out of sys/systm.h where they weren't even usable outside the kernel. The implementation of the ffs() family has similar non-problems. No one cares because they are rarely used, but... cperciva@ wrote the lookup table part of the following set of implementations of ffs() and fls(). A lookup table is the best general method, though it can be beaten by unportable methodw in some cases. I added more methods and a test framework to this (test framework not always included). The methods are: - gcc builtin - cpufunc inline - ilog inline for fls only (repeat some internals of ilogb(3)) - ilogb library for fls only (just use the standard function ilogb()). This and the last method might not handle all cases correctly (especially when FPU traps are enabled). Otherwise, they are much better than the loops in the portable implementation in libc, up to 53 bits. 64 bits are also easy if the system has real long doubles. - ilogbm0 inline (alternative of previous) - library ffs()/fls() - alternative library fls() - lookup table for ffs() only (orginal cperciva@) method). @ #include @ @ #if defined(CPUFUNC_FFS) || defined(CPUFUNC_FLS) @ #define _KERNEL /* now needed to get non-builtin */ @ #include @ /* MI library versions: */ @ #elif ILOGBM0_FLS @ #include "/usr/src/lib/msun/src/s_ilogb.c" @ #elif LIBMET0_FLS @ #include "/usr/src/lib/libc/string/ffs.c" @ #elif LIBMET0_FFS @ #include "/usr/src/lib/libc/string/fls.c" @ #endif @ @ #include @ #include @ #include @ @ static int ffslut32[32]; @ @ static inline int @ ilog_fls(int x) @ { @ union { @ float fvalue; @ uint32_t uvalue; @ } u; @ @ if (x <= 0) @ return (x == 0 ? 0 : 32); @ u.fvalue = x; @ return ((u.uvalue >> 23) - 127 + 1); This is hacked too much for efficiency. Conversion to float can't overflow for 32-bit ints, but it gives rounding problems. I now know more about what made this slow using doubles and how to avoid the slowness using integer SSE packing instructions. Unfortunately, compilers don't understand this. (amd64 will use FP SSE and not be slow). As mentioned above, this is portable up to 53 bits, so on 32-bit systems it may be extendable to work on 64-bit integers or arrays of uint32_t and be signficantly faster than hardware 32-bit instructions then. But repacking for this would be painful. @ } @ @ static inline int @ ilogb_fls(int x) @ { @ return (x == 0 ? 0 : ilogbf(x) + 1); @ } @ @ static inline int @ lut_ffs(int mask) @ { @ return (mask ? ffslut32[(((mask & (-mask)) * 0x0FB9AC52) >> 27) + 16] : 0); @ } The lookup table method is not the trivial lookup 8 bits at a time followed by large code to select the first nonzero set of 8 bits. It magically reduces from 32 bits to 5 for a single lookup. popcount() is easier with a trivial method (just add up the results). I don't know if it can be reduced to a single 5-bit lookup. @ @ int z[4096]; @ @ main() @ { @ volatile int v; @ int i, j; @ @ for (j = 0; j < 32; j++) { @ i = 1 << j; @ ffslut32[(((i & -i) * 0x0FB9AC52) >> 27) + 16] = j + 1; @ } @ for (i = 0; i < 4096; i++) @ #ifdef UNIFORM @ z[i] = random(); @ #elif RANDBIT @ z[i] = 1 << random(); /* Yes, this is sloppy. */ @ #elif ALLZERO @ continue; @ #elif ALLONE_ @ z[i] = 1; @ #elif ALLLAST @ z[i] = 1 << 31; /* More sloppiness. */ @ #else @ #error "No distribution" @ continue; @ #endif @ for (j = 0; j < 10000; j++) @ for (i = 0; i < 4096; i++) @ #ifdef BUILTIN_FFS @ v = __builtin_ffs(z[i]); @ #elif CPUFUNC_FFS @ v = ffs(z[i]); @ #elif CPUFUNC_FLS @ v = fls(z[i]); @ #elif ILOGMET_FLS @ v = ilog_fls(z[i]); @ #elif ILOGBME_FLS @ v = ilogb_fls(z[i]); @ #elif ILOGBM0_FLS @ v = ilogb_fls(z[i]); @ #elif LIBMET0_FFS @ v = ffs(z[i]); @ #elif LIBMET0_FLS @ v = fls(z[i]); @ #elif LIBMETH_FFS @ v = ffs(z[i]); @ #elif LIBMETH_FLS @ v = fls(z[i]); @ #elif LUTMETH_FFS @ v = lut_ffs(z[i]); @ #else @ #error "No method" @ ; @ #endif @ #ifdef PRINTLUT @ for (j = 0; j < 32; j++) { @ i = 1 << j; @ printf("%5d: %d %d\n", j, lut_ffs(j), @ (((i & -i) * 0x0FB9AC52) >> 27) + 16); @ #endif @ #ifdef TEST @ for (j = 0; ; j++) { @ if (lut_ffs(j) != __builtin_ffs(j)) @ fprintf(stderr, "lut_ffs(%d) failed: %d %d\n", @ j, lut_ffs(j), __builtin_ffs(j)); @ if (ilog_fls(j) != fls(j)) @ fprintf(stderr, "ilog_fls(%d) failed: %d %d\n", @ j, ilogb_fls(j), fls(j)); @ if (j + 1 == 0) @ break; @ } @ #endif @ } In a game program, I use simple lookups of 13-bit ints to count cards and find the highest card in 13-bit sets of suits of cards. This works better than any hardware method (especially when there is none, but I would have used gcc builtin fls() if it were faster, or maybe I didn't care since the program has lots of tables and one more to give a fast portable fls() was simplest). I only noticed the following in your version: - including is a style bug because sys/types.h includes it - you removed the silly special inline for bitcount16() but still implement bitcount16(), presumably for compatibiltiy. But it is incompatible since the old version has the bogus return type uint16_t. It seems to be used just twice: in intel_sdvo.c, it is blindly converted to bool in 1 place and to unsigned int in 1 place. Both of these uses are for device initialization so their efficiency is especially uniportant. - bitcount32() is also incompatible since the old version has the slightly less bogus return type uint32_t. At least the args have correct type, unlike the ffs() family. Bruce