Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 21 Oct 2014 23:05:58 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        David Chisnall <theraven@freebsd.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, "Alexander V. Chernikov" <melifaro@freebsd.org>, Andriy Gapon <avg@freebsd.org>
Subject:   Re: svn commit: r273274 - head/sys/netpfil/ipfw
Message-ID:  <20141021203956.C1228@besplex.bde.org>
In-Reply-To: <6FEB1269-2A8D-48A7-A18E-2EAB9961EDA6@FreeBSD.org>
References:  <201410191115.s9JBFJxA058370@svn.freebsd.org> <5443A83F.5090807@FreeBSD.org> <6FEB1269-2A8D-48A7-A18E-2EAB9961EDA6@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 21 Oct 2014, David Chisnall wrote:

> On 19 Oct 2014, at 13:02, Andriy Gapon <avg@FreeBSD.org> wrote:
>
>> I think that on platforms where an optimized version of fls() is available that
>> would work faster than this cool piece of bit magic.

Even a lightly optimized naive linear search might work faster.  The
fancy method has about 7 dependent operations, while the naive method
has about 32 independent operations.

> If you're lucky, the compiler's idiom recogniser will spot this.  You're generally better off using the builtins though, because then the compiler will expand them to something sensible (hopefully - old versions of gcc did horribly inefficient things for bswap and clz on platforms without native support).

No one cared when cperciva optimized fls() using a lookup table and I
investigated 11 different method to show that neither the builtins nor
the lookup table were especially good (or bad).  On i386, the best
portable method is to use ilogb() (floating point).  The next best
is to use floating point directly.  These methods are not the fastest,
except probably for 64 bits on i386, but they are quite fast and fairly
portable.  Not all systems have fls() in libc, and FreeBSD has only
pessimized implementations there.

I just ran my benchmark on ref11-i386 and got the following:

% UNIFORM/BUILTIN_FFS:         0.08 real         0.08 user         0.00 sys
% UNIFORM/CPUFUNC_FFS:         0.12 real         0.11 user         0.00 sys
% UNIFORM/LIBMET0_FFS:         0.11 real         0.09 user         0.00 sys
% UNIFORM/LIBMETH_FFS:         0.11 real         0.10 user         0.00 sys
% UNIFORM/LUTMETH_FFS:         0.10 real         0.09 user         0.00 sys
% UNIFORM/CPUFUNC_FLS:         0.12 real         0.10 user         0.01 sys
% UNIFORM/ILOGMET_FLS:         0.14 real         0.13 user         0.00 sys
% UNIFORM/ILOGBME_FLS:         0.23 real         0.23 user         0.00 sys
% UNIFORM/ILOGBM0_FLS:         0.23 real         0.23 user         0.00 sys
% UNIFORM/LIBMET0_FLS:         1.63 real         1.61 user         0.00 sys
% UNIFORM/LIBMETH_FLS:         1.61 real         1.60 user         0.00 sys

Several bit patterns are tested, to try to see if there are cache effects.

The builtin is now clearly fastest.  (On old Athlon64 in 32-bit mode with
gcc, it loses to floating point:

@ UNIFORM/BUILTIN_FFS:         0.22 real         0.22 user         0.00 sys
@ UNIFORM/ILOGMET_FLS:         0.17 real         0.17 user         0.00 sys),

but even in the benchmark that spends 100% of its time doing ffs(),
the speedup from using the builtin vs the next best is only 20%.  Only
the pessimized libc methods are very slow.

Programs like ministat are intentionally not used.  It takes long enough
to read the results of 55 simple tests.  44 more follow:

% RANDBIT/BUILTIN_FFS:         0.08 real         0.08 user         0.00 sys
% RANDBIT/CPUFUNC_FFS:         0.12 real         0.12 user         0.00 sys
% RANDBIT/LIBMET0_FFS:         0.11 real         0.10 user         0.00 sys
% RANDBIT/LIBMETH_FFS:         0.11 real         0.10 user         0.00 sys
% RANDBIT/LUTMETH_FFS:         0.10 real         0.09 user         0.00 sys
% RANDBIT/CPUFUNC_FLS:         0.12 real         0.11 user         0.00 sys
% RANDBIT/ILOGMET_FLS:         0.14 real         0.13 user         0.00 sys
% RANDBIT/ILOGBME_FLS:         0.23 real         0.23 user         0.00 sys
% RANDBIT/ILOGBM0_FLS:         0.23 real         0.23 user         0.00 sys
% RANDBIT/LIBMET0_FLS:         1.25 real         1.24 user         0.00 sys
% RANDBIT/LIBMETH_FLS:         1.24 real         1.24 user         0.00 sys
% ALLZERO/BUILTIN_FFS:         0.08 real         0.07 user         0.00 sys
% ALLZERO/CPUFUNC_FFS:         0.05 real         0.04 user         0.00 sys
% ALLZERO/LIBMET0_FFS:         0.11 real         0.10 user         0.00 sys
% ALLZERO/LIBMETH_FFS:         0.11 real         0.10 user         0.00 sys
% ALLZERO/LUTMETH_FFS:         0.05 real         0.04 user         0.00 sys
% ALLZERO/CPUFUNC_FLS:         0.05 real         0.04 user         0.00 sys
% ALLZERO/ILOGMET_FLS:         0.07 real         0.06 user         0.00 sys
% ALLZERO/ILOGBME_FLS:         0.05 real         0.03 user         0.00 sys
% ALLZERO/ILOGBM0_FLS:         0.05 real         0.04 user         0.00 sys
% ALLZERO/LIBMET0_FLS:         0.18 real         0.17 user         0.00 sys
% ALLZERO/LIBMETH_FLS:         0.20 real         0.20 user         0.00 sys
% ALLONE_/BUILTIN_FFS:         0.08 real         0.08 user         0.00 sys
% ALLONE_/CPUFUNC_FFS:         0.12 real         0.12 user         0.00 sys
% ALLONE_/LIBMET0_FFS:         0.11 real         0.10 user         0.00 sys
% ALLONE_/LIBMETH_FFS:         0.11 real         0.10 user         0.00 sys
% ALLONE_/LUTMETH_FFS:         0.10 real         0.09 user         0.00 sys
% ALLONE_/CPUFUNC_FLS:         0.12 real         0.12 user         0.00 sys
% ALLONE_/ILOGMET_FLS:         0.10 real         0.09 user         0.00 sys
% ALLONE_/ILOGBME_FLS:         0.23 real         0.23 user         0.00 sys
% ALLONE_/ILOGBM0_FLS:         0.23 real         0.23 user         0.00 sys
% ALLONE_/LIBMET0_FLS:         0.20 real         0.19 user         0.00 sys
% ALLONE_/LIBMETH_FLS:         0.20 real         0.19 user         0.00 sys
% ALLLAST/BUILTIN_FFS:         0.08 real         0.08 user         0.00 sys
% ALLLAST/CPUFUNC_FFS:         0.13 real         0.11 user         0.00 sys
% ALLLAST/LIBMET0_FFS:         0.11 real         0.10 user         0.00 sys
% ALLLAST/LIBMETH_FFS:         0.11 real         0.10 user         0.00 sys
% ALLLAST/LUTMETH_FFS:         0.10 real         0.09 user         0.00 sys
% ALLLAST/CPUFUNC_FLS:         0.13 real         0.12 user         0.00 sys
% ALLLAST/ILOGMET_FLS:         0.07 real         0.06 user         0.00 sys
% ALLLAST/ILOGBME_FLS:         0.23 real         0.23 user         0.00 sys
% ALLLAST/ILOGBM0_FLS:         0.23 real         0.23 user         0.00 sys
% ALLLAST/LIBMET0_FLS:         1.57 real         1.55 user         0.00 sys
% ALLLAST/LIBMETH_FLS:         1.57 real         1.56 user         0.00 sys

Part of the program with the methods:

% #if defined(CPUFUNC_FFS) || defined(CPUFUNC_FLS)
% #define	_KERNEL				/* now needed to get non-builtin */
% #include <machine/cpufunc.h>
% /* MI library versions: */
% #elif ILOGBM0_FLS
% #include "/usr/src/lib/msun/src/s_ilogb.c"
% #elif LIBMET0_FLS
% #include "/usr/src/lib/libc/string/ffs.c"
% #elif LIBMET0_FFS
% #include "/usr/src/lib/libc/string/fls.c"
% #endif
% 
% static int ffslut32[32];
% 
% static inline int
% ilog_fls(int x)
% {
% 	union {
% 		float	fvalue;
% 		uint32_t uvalue;
% 	} u;
% 
% 	if (x <= 0)
% 		return (x == 0 ? 0 : 32);
% 	u.fvalue = x;
% 	return ((u.uvalue >> 23) - 127 + 1);
% }

This is like the libm (fdlibm/FreeBSD ilogb()) with minor optimizations
and de-obfuscations.  It might be too sloppy to be correct.

% static inline int
% ilogb_fls(int x)
% {
% 	return (x == 0 ? 0 : ilogbf(x) + 1);
% }

The portable FP method.

% 
% static inline int
% lut_ffs(int mask)
% {
% 	return (mask ? ffslut32[(((mask & (-mask)) * 0x0FB9AC52) >> 27) + 16] : 0);
% }

cperciva's lookup table method.  It is not simply examining 8 nybbles
using a lookup table of size 16.

% ...
% #ifdef BUILTIN_FFS
% 			v = __builtin_ffs(z[i]);
% #elif CPUFUNC_FFS
% 			v = ffs(z[i]);
% #elif CPUFUNC_FLS
% 			v = fls(z[i]);
% #elif ILOGMET_FLS
% 			v = ilog_fls(z[i]);
% #elif ILOGBME_FLS
% 			v = ilogb_fls(z[i]);
% #elif ILOGBM0_FLS
% 			v = ilogb_fls(z[i]);
% #elif LIBMET0_FFS
% 			v = ffs(z[i]);
% #elif LIBMET0_FLS
% 			v = fls(z[i]);
% #elif LIBMETH_FFS
% 			v = ffs(z[i]);
% #elif LIBMETH_FLS
% 			v = fls(z[i]);
% #elif LUTMETH_FFS
% 			v = lut_ffs(z[i]);
% #else
% #error "No method"
% 			;
% #endif

Further comments on the methods:

% RANDBIT/BUILTIN_FFS:         0.08 real         0.08 user         0.00 sys
% RANDBIT/CPUFUNC_FFS:         0.12 real         0.12 user         0.00 sys

CPUFUNC_* is the kernel version.  On amd64, it is #defined'ed as the
builtin, but ffsll() (bletch) still uses old methods.  (Apart from
using the long long abomination, ffsll() and flsll() shouldn't exist
in the kernel.  They are used approximately once, in code that can be
written better without them.)

This is on i386.  i386 kernel ffs() still uses my 1994 de-pessimization
for gcc-2.  This is now slower than the builtin.  Even the current amd64
methods are similar except for ffs().  There is a problem converting the
result of bsf*/bsr* to ffs*()/fls*().  Efficiently.  gcc-2 used a method
with a badly-placed branch, and I changed FreeBSD to use a less badly-placed
branch.  Branches were much more important with no branch prediction or
caches in 1994.  Now they don't matter if they are correctly predicted.
The tests try to get them mispredicted but are probably defeated by the
prediction being too smart.

% RANDBIT/LIBMET0_FFS:         0.11 real         0.10 user         0.00 sys

LIBMET0 is #including the libc source.  It is a naive linear search but is
optimized well, but apparently not to the same code as the builtin.

% RANDBIT/LIBMETH_FFS:         0.11 real         0.10 user         0.00 sys

LIBMETH is a function call to libc.  Statically linked of course.  Other
CFLAGS are -O -march=athlon-xp.  Not the right arch.  clang gives excessive
optimizations with -O so -O2 would make little difference.

% RANDBIT/LUTMETH_FFS:         0.10 real         0.09 user         0.00 sys

cperciva's lookup table method.  Only implemented for ffs().

% RANDBIT/CPUFUNC_FLS:         0.12 real         0.11 user         0.00 sys

Same time as CPUFUNC_FFS.  No test for BUILTIN_FLS since the kernel didn't
have it when this was written.

% RANDBIT/ILOGMET_FLS:         0.14 real         0.13 user         0.00 sys

Direct FP method is now not the fastest.

% RANDBIT/ILOGBME_FLS:         0.23 real         0.23 user         0.00 sys

Function call FP method never was the fastest.

% RANDBIT/ILOGBM0_FLS:         0.23 real         0.23 user         0.00 sys

Function call FP method with the function code exposed in the same
compilation unit -- makes no difference.

% RANDBIT/LIBMET0_FLS:         1.25 real         1.24 user         0.00 sys
% RANDBIT/LIBMETH_FLS:         1.24 real         1.24 user         0.00 sys

As for LIBMETH*_FLS, except now the naive linear search in the libc source
is not optimized well.

Summary: use the simple naive libc source and wait for the compiler to
catch up for fls() and other functions.

The slowness of LIBMET0 relative to the builtin is just due to it making
function calls.  Here is a disassembly of ffs.o:

% 00000000 <ffs>:
%    0:   0f bc 44 24 04          bsf    0x4(%esp),%eax
%    5:   74 03                   je     a <L1>
%    7:   40                      inc    %eax
%    8:   c3                      ret
%    9:   90                      nop
% 
% 0000000a <L1>:
%    a:   31 c0                   xor    %eax,%eax
%    c:   c3                      ret

(Why does it align to merely an even boundary?)

Disassembly of fls.o shows a naive loop.  Quite good code for the inner
loop.  Now there is alignment to a 16-byte boundary and almost half of
the function's size is for padding.

Disassembly of ffsl.o shows a naive loop.  ffsl() is not optimized although
it is essentially the same as ffs().  ffsll() is of course fully pessimized.
You could write a much faster one using 2 ffs()'s and that would be good
enough for a portable library.  Splitting up can handle any number of bits.

On amd64, even ffs.o uses the naive loop.  The optimization is not in the
copmpiler at all.  Someone just wrote ffs.S for i386 only.  This and many
other machine-dependent optimizations shouldn't exist.  You can write them
better using builtins.

The weird even alignment is now easy to explain.  The alignment
statement was written in 1993 when .align 2 meant 4-byte alignment.
(I forget when this changed, but checked the FreeBSD-1 sources where
I documented what .align 2 meant and changed all the hard-coded .align
2's to use either the ALIGN macro or the ALIGN_TEXT macro.  ALIGN_TEXT
hard-codes a fill byte.  FreeBSD still uses these macros but spells
.align as .p2align of course.  Plain .align is too unportable to use
for anything.  The fill byte shouldn't be hard-coded but still is.)

Optimizing string functions is so important that the i386 optimizations
are still not so good ones for original i386's, with minor unimprovements
from the drift in .align.  Most still use .align 2.  The exceptions are
a couple of wide char functions which use .p2align 2 and also .p2align 4.
Even in the kernel where the alignment statements are macro-ized, it is
practically impossible to do alignment as well as compilers.  There are
too many special cases (depending on the arch and code context).

Here is (most of) the code produced for ffs(x) { return __builtin_ffs(x); }:
with default CFLAGS except for -O -fomit-frame-pointer:

%         .align  16, 0x90
% ffs:                                    # @ffs
%         movl    4(%esp), %eax
%         testl   %eax, %eax
%         je      .LBB0_2
%         bsfl    %eax, %eax
%         incl    %eax
% .LBB0_2:                                # %entry
%         ret

It does nothing special, and produces the same code that I wrote in the
inline asm in 1994.

Here is the libc code again:

% 00000000 <ffs>:
%    0:   0f bc 44 24 04          bsf    0x4(%esp),%eax
%    5:   74 03                   je     a <L1>
%    7:   40                      inc    %eax
%    8:   c3                      ret
%    9:   90                      nop
% 
% 0000000a <L1>:
%    a:   31 c0                   xor    %eax,%eax
%    c:   c3                      ret

This is pessimized by doing a less-predictable branch, later.  This was
not so bad in 1994.  Modern CPUs can start working on the bsfl in parallel
to testing the value.  They can probably also predict the branch and thus
not really care about the order.

The branch is because I forgot to optimize for an arch that supports
cmove.  Compiling with -march=core2 gives:

         movl    4(%esp), %ecx
         bsfl    %ecx, %eax
 	incl    %ecx
         testl   %ecx, %ecx
 	cmovel  %ecx, %eax
         ret

This explains another 10-20% of the speed of the builtin vs libc asm.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141021203956.C1228>