Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 16 Nov 2012 20:43:41 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Jung-uk Kim <jkim@freebsd.org>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: [RFC] Generic population count function
Message-ID:  <20121116181327.B1135@besplex.bde.org>
In-Reply-To: <50A57279.3040607@FreeBSD.org>
References:  <50A43B52.8030102@FreeBSD.org> <20121115181009.D42162@besplex.bde.org> <50A57279.3040607@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 15 Nov 2012, Jung-uk Kim wrote:

> On 2012-11-15 03:32:50 -0500, Bruce Evans wrote:
>> The implementation of the ffs() family has similar non-problems.
>> No one cares because they are rarely used, but...  cperciva@ wrote
>> the lookup table part of the following set of implementations of
>> ffs() and fls().  A lookup table is the best general method, though
>> it can be beaten by unportable methodw in some cases.  I added more
>> methods and a test framework to this (test framework not always
>> included).  The methods are: - gcc builtin - cpufunc inline
>
> Talking about cpufunc ffsl(), I found something strange about clang:

My old optimization is probably wrong on amd64 for this anyway, since
all amd64 CPUs have cmove so there is no poor branching in the gcc
builtin to avoid (I should also check this on i386 -- maybe gcc's
branching is best if the CPU has a branch predictor.  No x86 CPUs
had branch prediction when my old optimization was implemented (except
i486 predicts always in a fixed way)).

amd64 cpufunc.h changed ffs() to use __builtin_ffs, but for some reason
it still uses my old optimization for ffsl().

> % cat ffs.c
> #include <sys/types.h>
> #include <machine/cpufunc.h>
> int ffs_cpufunc(long x) /* copied from cpufunc.h as is */
> { return (x == 0 ? x : (int)bsfq((u_long)x) + 1); }
> int ffs_builtin(long x)
> { return (__builtin_ffsl(x)); }
>
> % clang -O2 -c ffs.c
> % objdump -d ffs.o
>
> ffs.o:     file format elf64-x86-64-freebsd
>
> Disassembly of section .text:
>
> 0000000000000000 <ffs_cpufunc>:
>   0:   48 85 ff                test   %rdi,%rdi
>   3:   74 21                   je     26 <ffs_cpufunc+0x26>
>   5:   48 89 7c 24 f8          mov    %rdi,-0x8(%rsp)
>   a:   48 0f bc 4c 24 f8       bsf    -0x8(%rsp),%rcx

This seems to be the i386 ffs() cpufunc, since it only does a 32-bit bsf.
I get a 64-bit bsf, but similarly strange memory operations and shifting
(see below).

>  10:   48 c1 e1 20             shl    $0x20,%rcx
>  14:   48 b8 00 00 00 00 01    mov    $0x100000000,%rax
>  1b:   00 00 00
>  1e:   48 01 c8                add    %rcx,%rax
>  21:   48 c1 e8 20             shr    $0x20,%rax
>  25:   c3                      retq
>  26:   31 c0                   xor    %eax,%eax
>  28:   c3                      retq
>  29:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
>
> 0000000000000030 <ffs_builtin>:
>  30:   48 0f bc cf             bsf    %rdi,%rcx
>  34:   ff c1                   inc    %ecx
>  36:   31 c0                   xor    %eax,%eax
>  38:   48 85 ff                test   %rdi,%rdi
>  3b:   0f 45 c1                cmovne %ecx,%eax
>  3e:   c3                      retq
>
> Note clang generates very inefficient code for ffsl() from cpufunc.h.
> It is using memory, setting an unused bit, etc. :-(

Even the builtin takes 1 more instruction than the best gcc one.

I get the following for

@ 	.type	g,@function
@ g:                                      # @g
@ 	.cfi_startproc
@ # BB#0:                                 # %entry
@ 	testq	%rdi, %rdi
@ 	je	.LBB1_1
@ # BB#2:                                 # %cond.false.i
@ 	movq	%rdi, -8(%rsp)
@ 	#APP
@ 	bsfq -8(%rsp),%rcx

Bogus memory operations.

@ 	#NO_APP
@ 	shlq	$32, %rcx
@ 	movabsq	$4294967296, %rax       # imm = 0x100000000

Even stranger.

BTW, the assembler code is low-quality in a different way than gdb's
disassembly.  The annotation helps.  gcc would only print the decimal
value, which is fairly easy to decrypt to the shifted 1 bit here,
but usually isn't.

@ 	addq	%rcx, %rax
@ 	shrq	$32, %rax
@                                         # kill: EAX<def> EAX<kill> RAX<kill>

The annotation gives a hint about the strange shifting.  It seems to be
something to do with partial register conversions.

@ 	ret
@ .LBB1_1:
@ 	xorl	%eax, %eax
@                                         # kill: EAX<def> EAX<kill> RAX<kill>
@ 	ret
@ .Ltmp1:
@ 	.size	g, .Ltmp1-g
@ 	.cfi_endproc

This was easy to fix, though I don't understand the fix :-).  First,
bsfq() bogusly requrns u_long instead of int (this may be my fault for
returning u_int in i386 bsf().  The hardware generates an unsigned
value, but we don't want that in the API).  bsfq() returns a result
in rax instead of in eax, and for some reason this confuses clang into
generating the strange shifts.  Apparently it is trying to handle
partial register or ABI problems.  When we return an int in eax, are
we required to do anything with the top bits in rax?  I know that
some operations clear them automagically, but not what happens for
incl.  Maybe incl does the right thing, but does it slowly if the
partial register is not handled properly, and the shifting is to
handle it properly.  Anyway, after changing bsfq() to return int
(using a bad type/hardware pun to pretend that bsfq only sets the
low bits, the generated code becomes):

@ 	.type	g,@function
@ g:                                      # @g
@ 	.cfi_startproc
@ # BB#0:                                 # %entry
@ 	testq	%rdi, %rdi
@ 	je	.LBB1_1
@ # BB#2:                                 # %cond.false.i
@ 	movq	%rdi, -8(%rsp)
@ 	#APP
@ 	bsfq -8(%rsp),%eax
@ 	#NO_APP
@ 	incl	%eax
@ 	ret
@ .LBB1_1:
@ 	xorl	%eax, %eax
@ 	ret
@ .Ltmp1:
@ 	.size	g, .Ltmp1-g
@ 	.cfi_endproc

Here is the bad type/hardware pun:

@ static __inline int
@ bsfq(u_long mask)
@ {
@ 	int	result;
@ 
@ 	__asm __volatile("bsfq %1,%0" : "=r" (result) : "rm" (mask));
@ 	return (result);
@ }

I just changed the return type to int, and the type of `result' to int.
But the latter is not really right, since bsfq really produces a 64-bit
value.  If there are any partial register problems, then I would have
expected clang to do better knowing that the value is 64 bits.  ffsl()
casts the value returned by bsfq() to int, so that the addition is
only a 32-bit one.  clang cannot know that the upper 33 bits are always
0 so that the cast cannot overflow and the behaviour is defined.  It
seems to be handling the undefined case in an unusual way.  Suppose that
bsfq() returns (1 << 31) so that the behaviour is in fact undefined.
Then the usual undefined behaviour is for casting it to int to produce
the 32-bit INT_MIN, and then adding 1 to produce (INT_MIN + 1).  clang's
fancy shifting code instead first shifts to (1 << 63), then adds (1 << 32),
then shifts back to give the same result except now with the upper bits
all 0, while (INT_MIN + 1) would have the upper bits all 1 if it were
promoted to 64 bits or if the upper bits were automatically set to 1
to sign-extend the result of incl (does this happen?).

If the type of `result' is left as u_long so that there is no type pun
in the asm, but the function type is changed to int, then the strange
shifts are still produced.  This is unsuprising, since we this just
doubles the conversion to int.  It is more suprising that if we do
the addition of 1 on u_longs, then clang still produces the strange
shifts -- it is careful not to use incl to add 1 even in this case,
but shifts so that it can add (1 << 32).

clang doesn't produce the strange shifts if bsfq() is replaced by a
non-inline function that returns u_long.  It still shows the "kill EAX"
annotation then.

Another thing that works is changing the return expression from
(mask == 0 ? mask : ...) to (mask == 0 ? 0 : ...).  Apparently clang
doesn't properly optimize away the promotion of '(int)bsfq(...) + 1)'
to long to match the type of `mask'.  Replacing the 0 by 0L confuses
it again.  Here is a cut-down example:

@ unsigned long foo(void);
@ 
@ int
@ bar(unsigned long mask)
@ {
@ 	return (mask == 0 ? 0L : (int)foo() + 1);
@ }

Now the further cutting down of removing the (int) cast makes the problem
go away.  So fairly ordinary code is affected by this.  Maybe I was wrong
about it having no effect in otherwise-not-cut-down example.

The memory operations are obviously possible because the constraint in
the asm allows them, but why does clang prefer them when the arg starts
in a register?  Removing the allowance would pessimize cases where the
arg is not in a register.

> Both GCC 4.2 and 4.8 generate "good code", of course:
>
> % gcc -O2 -c ffs.c
> % objdump -d ffs.o
>
> ffs.o:     file format elf64-x86-64-freebsd
>
> Disassembly of section .text:
>
> 0000000000000000 <ffs_cpufunc>:
>   0:   31 c0                   xor    %eax,%eax
>   2:   48 85 ff                test   %rdi,%rdi
>   5:   74 07                   je     e <ffs_cpufunc+0xe>
>   7:   48 0f bc c7             bsf    %rdi,%rax
>   b:   83 c0 01                add    $0x1,%eax
>   e:   f3 c3                   repz retq

je takes 1 cycle for the usual case of branch not taken on i486, so this
is good for i486.

With branch prediction, IIRC the initial prediction for cold branches
is "taken" for forward branches (so that if (foo) branches over the
presumed-special case for foo).  The above is backwards for that.

I thought that gcc generated a branch with order something like the
reverse of the above, but now I can't find any version of gcc that
generates a branch for __builtin_ffs().  gcc-3.3.3 -march=i386 generates
a setcc and 2 more instructions than the add after the bsf.  setcc is
not as good as cmove, and on i386 it takes 4/5 cycles, and on i486 it
takes 4/3 cycles.  The branch must be much better on i486, though I
don't really believe its official timing of 1/3 for not-taken/taken.

> 0000000000000010 <ffs_builtin>:
>  10:   48 0f bc ff             bsf    %rdi,%rdi
>  14:   48 c7 c0 ff ff ff ff    mov    $0xffffffffffffffff,%rax
>  1b:   48 0f 44 f8             cmove  %rax,%rdi
>  1f:   48 83 c7 01             add    $0x1,%rdi
>  23:   89 f8                   mov    %edi,%eax
>  25:   c3                      retq

This is not as good as the gcc48 version (it has 1 more instruction)
to compensate for building the result in a wrong place (%edi instead
of %eax), but I get the gcc48 code from gcc-4.2.1 too.

> % gcc48 -O2 -c ffs.c
> % objdump -d ffs.o
>
> ffs.o:     file format elf64-x86-64-freebsd
>
> Disassembly of section .text:
>
> 0000000000000000 <ffs_cpufunc>:
>   0:   31 c0                   xor    %eax,%eax
>   2:   48 85 ff                test   %rdi,%rdi
>   5:   75 09                   jne    10 <ffs_cpufunc+0x10>
>   7:   f3 c3                   repz retq
>   9:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
>  10:   48 0f bc c7             bsf    %rdi,%rax
>  14:   83 c0 01                add    $0x1,%eax
>  17:   c3                      retq
>  18:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
>  1f:   00

This does a lot of padding for the branch target.  Another reason to
avoid branches.  Here the branching and the padding for it are just
bad, since we expect the branch to not be taken.  gcc48 is guessing
the branch probabilities backwards (see below).

Ignore some of what I said about organizing the branch for best
prediction.  I was thinking that the branch was in the asm so that gcc
couldn't organize it.  But it is in the C code, and gcc is organizing
it strangely.  I tried adding __predict_true() and __predict_false().
It is __predict_false() that gives the current organization, for both
gcc-4.2.1 and clang.  This shows that the above is backwards about the
default for cold branches, and that the organization without any
__predict_foo() is correct for gcc4.2.1 and clang but backwards for
gcc48.

> 0000000000000020 <ffs_builtin>:
>  20:   48 0f bc c7             bsf    %rdi,%rax
>  24:   48 c7 c2 ff ff ff ff    mov    $0xffffffffffffffff,%rdx
>  2b:   48 0f 44 c2             cmove  %rdx,%rax
>  2f:   48 83 c0 01             add    $0x1,%rax
>  33:   c3                      retq

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121116181327.B1135>