From owner-freebsd-arch@FreeBSD.ORG  Thu Nov 15 08:33:02 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
 by hub.freebsd.org (Postfix) with ESMTP id 2B084B23;
 Thu, 15 Nov 2012 08:33:02 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au
 [211.29.132.190])
 by mx1.freebsd.org (Postfix) with ESMTP id 8AD298FC16;
 Thu, 15 Nov 2012 08:33:01 +0000 (UTC)
Received: from c122-106-175-26.carlnfd1.nsw.optusnet.com.au
 (c122-106-175-26.carlnfd1.nsw.optusnet.com.au [122.106.175.26])
 by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id qAF8WoBQ025930
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Thu, 15 Nov 2012 19:32:51 +1100
Date: Thu, 15 Nov 2012 19:32:50 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Jung-uk Kim <jkim@FreeBSD.org>
Subject: Re: [RFC] Generic population count function
In-Reply-To: <50A43B52.8030102@FreeBSD.org>
Message-ID: <20121115181009.D42162@besplex.bde.org>
References: <50A43B52.8030102@FreeBSD.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-Cloudmark-Score: 0
X-Optus-Cloudmark-Analysis: v=2.0 cv=eqGHVfVX c=1 sm=1 a=nz2LO5XVLtEA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=hMQG39urcAQA:10
 a=6I5d2MoRAAAA:8 a=Y1UM5pQ7PpipOtpjvNUA:9 a=CjuIK1q_8ugA:10
 a=3M_0ICQ7WrFIE0d4:21 a=Ei4GSbzkcvkxRff1:21 a=bxQHXO5Py4tHmhUgaywp5w==:117
Cc: freebsd-arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 15 Nov 2012 08:33:02 -0000

On Wed, 14 Nov 2012, Jung-uk Kim wrote:

> I implemented generic population count function.  Please see the
> attachment.  It is also available from here:
>
> http://people.freebsd.org/~jkim/bitcount.diff
>
> The idea is to make use of CPU-supported population count instructions
> if available and the compiler supports them (i.e., clang), especially
> for larger than 32-bit data.  The patch also has use cases for the new
> function (i.e., counting number of bits in IPv6 address[*]).
>
> Any objection?

This is rather over-engineered for something that is used once or twice,
but it is good to move the functions out of sys/systm.h where they weren't
even usable outside the kernel.

The implementation of the ffs() family has similar non-problems.  No
one cares because they are rarely used, but...  cperciva@ wrote the
lookup table part of the following set of implementations of ffs()
and fls().  A lookup table is the best general method, though it can
be beaten by unportable methodw in some cases.  I added more methods and
a test framework to this (test framework not always included).  The
methods are:
- gcc builtin
- cpufunc inline
- ilog inline for fls only (repeat some internals of ilogb(3))
- ilogb library for fls only (just use the standard function ilogb()).
   This and the last method might not handle all cases correctly
   (especially when FPU traps are enabled).  Otherwise, they are much
   better than the loops in the portable implementation in libc, up to
   53 bits.  64 bits are also easy if the system has real long doubles.
- ilogbm0 inline (alternative of previous)
- library ffs()/fls()
- alternative library fls()
- lookup table for ffs() only (orginal cperciva@) method).

@ #include <sys/types.h>
@ 
@ #if defined(CPUFUNC_FFS) || defined(CPUFUNC_FLS)
@ #define	_KERNEL				/* now needed to get non-builtin */
@ #include <machine/cpufunc.h>
@ /* MI library versions: */
@ #elif ILOGBM0_FLS
@ #include "/usr/src/lib/msun/src/s_ilogb.c"
@ #elif LIBMET0_FLS
@ #include "/usr/src/lib/libc/string/ffs.c"
@ #elif LIBMET0_FFS
@ #include "/usr/src/lib/libc/string/fls.c"
@ #endif
@ 
@ #include <math.h>
@ #include <stdio.h>
@ #include <stdlib.h>
@ 
@ static int ffslut32[32];
@ 
@ static inline int
@ ilog_fls(int x)
@ {
@ 	union {
@ 		float	fvalue;
@ 		uint32_t uvalue;
@ 	} u;
@ 
@ 	if (x <= 0)
@ 		return (x == 0 ? 0 : 32);
@ 	u.fvalue = x;
@ 	return ((u.uvalue >> 23) - 127 + 1);

This is hacked too much for efficiency.  Conversion to float can't overflow
for 32-bit ints, but it gives rounding problems.  I now know more about
what made this slow using doubles and how to avoid the slowness using
integer SSE packing instructions.  Unfortunately, compilers don't understand
this.  (amd64 will use FP SSE and not be slow).  As mentioned above, this
is portable up to 53 bits, so on 32-bit systems it may be extendable to
work on 64-bit integers or arrays of uint32_t and be signficantly faster
than hardware 32-bit instructions then.  But repacking for this would be
painful.

@ }
@ 
@ static inline int
@ ilogb_fls(int x)
@ {
@ 	return (x == 0 ? 0 : ilogbf(x) + 1);
@ }
@ 
@ static inline int
@ lut_ffs(int mask)
@ {
@ 	return (mask ? ffslut32[(((mask & (-mask)) * 0x0FB9AC52) >> 27) + 16] : 0);
@ }

The lookup table method is not the trivial lookup 8 bits at a time followed
by large code to select the first nonzero set of 8 bits.  It magically
reduces from 32 bits to 5 for a single lookup.  popcount() is easier with
a trivial method (just add up the results).  I don't know if it can be
reduced to a single 5-bit lookup.

@ 
@ int z[4096];
@ 
@ main()
@ {
@ 	volatile int v;
@ 	int i, j;
@ 
@ 	for (j = 0; j < 32; j++) {
@ 		i = 1 << j;
@ 		ffslut32[(((i & -i) * 0x0FB9AC52) >> 27) + 16] = j + 1;
@ 	}
@ 	for (i = 0; i < 4096; i++)
@ #ifdef UNIFORM
@ 		z[i] = random();
@ #elif RANDBIT
@ 		z[i] = 1 << random();	/* Yes, this is sloppy. */
@ #elif ALLZERO
@ 		continue;
@ #elif ALLONE_
@ 		z[i] = 1;
@ #elif ALLLAST
@ 		z[i] = 1 << 31;		/* More sloppiness. */
@ #else
@ #error "No distribution"
@ 		continue;
@ #endif
@ 	for (j = 0; j < 10000; j++)
@ 		for (i = 0; i < 4096; i++)
@ #ifdef BUILTIN_FFS
@ 			v = __builtin_ffs(z[i]);
@ #elif CPUFUNC_FFS
@ 			v = ffs(z[i]);
@ #elif CPUFUNC_FLS
@ 			v = fls(z[i]);
@ #elif ILOGMET_FLS
@ 			v = ilog_fls(z[i]);
@ #elif ILOGBME_FLS
@ 			v = ilogb_fls(z[i]);
@ #elif ILOGBM0_FLS
@ 			v = ilogb_fls(z[i]);
@ #elif LIBMET0_FFS
@ 			v = ffs(z[i]);
@ #elif LIBMET0_FLS
@ 			v = fls(z[i]);
@ #elif LIBMETH_FFS
@ 			v = ffs(z[i]);
@ #elif LIBMETH_FLS
@ 			v = fls(z[i]);
@ #elif LUTMETH_FFS
@ 			v = lut_ffs(z[i]);
@ #else
@ #error "No method"
@ 			;
@ #endif
@ #ifdef PRINTLUT
@ 	for (j = 0; j < 32; j++) {
@ 		i = 1 << j;
@ 		printf("%5d: %d %d\n", j, lut_ffs(j),
@ 		    (((i & -i) * 0x0FB9AC52) >> 27) + 16);
@ #endif
@ #ifdef TEST
@ 	for (j = 0; ; j++) {
@ 		if (lut_ffs(j) != __builtin_ffs(j))
@ 			fprintf(stderr, "lut_ffs(%d) failed: %d %d\n",
@ 			    j, lut_ffs(j), __builtin_ffs(j));
@ 		if (ilog_fls(j) != fls(j))
@ 			fprintf(stderr, "ilog_fls(%d) failed: %d %d\n",
@ 			    j, ilogb_fls(j), fls(j));
@ 		if (j + 1 == 0)
@ 			break;
@ 	}
@ #endif
@ }

In a game program, I use simple lookups of 13-bit ints to count cards
and find the highest card in 13-bit sets of suits of cards.  This works
better than any hardware method (especially when there is none, but I
would have used gcc builtin fls() if it were faster, or maybe I didn't
care since the program has lots of tables and one more to give a fast
portable fls() was simplest).

I only noticed the following in your version:
- including <sys/cdefs.h> is a style bug because sys/types.h includes it
- you removed the silly special inline for bitcount16() but still
   implement bitcount16(), presumably for compatibiltiy.  But it is
   incompatible since the old version has the bogus return type uint16_t.
   It seems to be used just twice: in intel_sdvo.c, it is blindly converted
   to bool in 1 place and to unsigned int in 1 place.  Both of these uses
   are for device initialization so their efficiency is especially
   uniportant.
- bitcount32() is also incompatible since the old version has the slightly
   less bogus return type uint32_t.  At least the args have correct type,
   unlike the ffs() family.

Bruce