Date: Tue, 21 Oct 2014 23:05:58 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: David Chisnall <theraven@freebsd.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, "Alexander V. Chernikov" <melifaro@freebsd.org>, Andriy Gapon <avg@freebsd.org> Subject: Re: svn commit: r273274 - head/sys/netpfil/ipfw Message-ID: <20141021203956.C1228@besplex.bde.org> In-Reply-To: <6FEB1269-2A8D-48A7-A18E-2EAB9961EDA6@FreeBSD.org> References: <201410191115.s9JBFJxA058370@svn.freebsd.org> <5443A83F.5090807@FreeBSD.org> <6FEB1269-2A8D-48A7-A18E-2EAB9961EDA6@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 21 Oct 2014, David Chisnall wrote: > On 19 Oct 2014, at 13:02, Andriy Gapon <avg@FreeBSD.org> wrote: > >> I think that on platforms where an optimized version of fls() is available that >> would work faster than this cool piece of bit magic. Even a lightly optimized naive linear search might work faster. The fancy method has about 7 dependent operations, while the naive method has about 32 independent operations. > If you're lucky, the compiler's idiom recogniser will spot this. You're generally better off using the builtins though, because then the compiler will expand them to something sensible (hopefully - old versions of gcc did horribly inefficient things for bswap and clz on platforms without native support). No one cared when cperciva optimized fls() using a lookup table and I investigated 11 different method to show that neither the builtins nor the lookup table were especially good (or bad). On i386, the best portable method is to use ilogb() (floating point). The next best is to use floating point directly. These methods are not the fastest, except probably for 64 bits on i386, but they are quite fast and fairly portable. Not all systems have fls() in libc, and FreeBSD has only pessimized implementations there. I just ran my benchmark on ref11-i386 and got the following: % UNIFORM/BUILTIN_FFS: 0.08 real 0.08 user 0.00 sys % UNIFORM/CPUFUNC_FFS: 0.12 real 0.11 user 0.00 sys % UNIFORM/LIBMET0_FFS: 0.11 real 0.09 user 0.00 sys % UNIFORM/LIBMETH_FFS: 0.11 real 0.10 user 0.00 sys % UNIFORM/LUTMETH_FFS: 0.10 real 0.09 user 0.00 sys % UNIFORM/CPUFUNC_FLS: 0.12 real 0.10 user 0.01 sys % UNIFORM/ILOGMET_FLS: 0.14 real 0.13 user 0.00 sys % UNIFORM/ILOGBME_FLS: 0.23 real 0.23 user 0.00 sys % UNIFORM/ILOGBM0_FLS: 0.23 real 0.23 user 0.00 sys % UNIFORM/LIBMET0_FLS: 1.63 real 1.61 user 0.00 sys % UNIFORM/LIBMETH_FLS: 1.61 real 1.60 user 0.00 sys Several bit patterns are tested, to try to see if there are cache effects. The builtin is now clearly fastest. (On old Athlon64 in 32-bit mode with gcc, it loses to floating point: @ UNIFORM/BUILTIN_FFS: 0.22 real 0.22 user 0.00 sys @ UNIFORM/ILOGMET_FLS: 0.17 real 0.17 user 0.00 sys), but even in the benchmark that spends 100% of its time doing ffs(), the speedup from using the builtin vs the next best is only 20%. Only the pessimized libc methods are very slow. Programs like ministat are intentionally not used. It takes long enough to read the results of 55 simple tests. 44 more follow: % RANDBIT/BUILTIN_FFS: 0.08 real 0.08 user 0.00 sys % RANDBIT/CPUFUNC_FFS: 0.12 real 0.12 user 0.00 sys % RANDBIT/LIBMET0_FFS: 0.11 real 0.10 user 0.00 sys % RANDBIT/LIBMETH_FFS: 0.11 real 0.10 user 0.00 sys % RANDBIT/LUTMETH_FFS: 0.10 real 0.09 user 0.00 sys % RANDBIT/CPUFUNC_FLS: 0.12 real 0.11 user 0.00 sys % RANDBIT/ILOGMET_FLS: 0.14 real 0.13 user 0.00 sys % RANDBIT/ILOGBME_FLS: 0.23 real 0.23 user 0.00 sys % RANDBIT/ILOGBM0_FLS: 0.23 real 0.23 user 0.00 sys % RANDBIT/LIBMET0_FLS: 1.25 real 1.24 user 0.00 sys % RANDBIT/LIBMETH_FLS: 1.24 real 1.24 user 0.00 sys % ALLZERO/BUILTIN_FFS: 0.08 real 0.07 user 0.00 sys % ALLZERO/CPUFUNC_FFS: 0.05 real 0.04 user 0.00 sys % ALLZERO/LIBMET0_FFS: 0.11 real 0.10 user 0.00 sys % ALLZERO/LIBMETH_FFS: 0.11 real 0.10 user 0.00 sys % ALLZERO/LUTMETH_FFS: 0.05 real 0.04 user 0.00 sys % ALLZERO/CPUFUNC_FLS: 0.05 real 0.04 user 0.00 sys % ALLZERO/ILOGMET_FLS: 0.07 real 0.06 user 0.00 sys % ALLZERO/ILOGBME_FLS: 0.05 real 0.03 user 0.00 sys % ALLZERO/ILOGBM0_FLS: 0.05 real 0.04 user 0.00 sys % ALLZERO/LIBMET0_FLS: 0.18 real 0.17 user 0.00 sys % ALLZERO/LIBMETH_FLS: 0.20 real 0.20 user 0.00 sys % ALLONE_/BUILTIN_FFS: 0.08 real 0.08 user 0.00 sys % ALLONE_/CPUFUNC_FFS: 0.12 real 0.12 user 0.00 sys % ALLONE_/LIBMET0_FFS: 0.11 real 0.10 user 0.00 sys % ALLONE_/LIBMETH_FFS: 0.11 real 0.10 user 0.00 sys % ALLONE_/LUTMETH_FFS: 0.10 real 0.09 user 0.00 sys % ALLONE_/CPUFUNC_FLS: 0.12 real 0.12 user 0.00 sys % ALLONE_/ILOGMET_FLS: 0.10 real 0.09 user 0.00 sys % ALLONE_/ILOGBME_FLS: 0.23 real 0.23 user 0.00 sys % ALLONE_/ILOGBM0_FLS: 0.23 real 0.23 user 0.00 sys % ALLONE_/LIBMET0_FLS: 0.20 real 0.19 user 0.00 sys % ALLONE_/LIBMETH_FLS: 0.20 real 0.19 user 0.00 sys % ALLLAST/BUILTIN_FFS: 0.08 real 0.08 user 0.00 sys % ALLLAST/CPUFUNC_FFS: 0.13 real 0.11 user 0.00 sys % ALLLAST/LIBMET0_FFS: 0.11 real 0.10 user 0.00 sys % ALLLAST/LIBMETH_FFS: 0.11 real 0.10 user 0.00 sys % ALLLAST/LUTMETH_FFS: 0.10 real 0.09 user 0.00 sys % ALLLAST/CPUFUNC_FLS: 0.13 real 0.12 user 0.00 sys % ALLLAST/ILOGMET_FLS: 0.07 real 0.06 user 0.00 sys % ALLLAST/ILOGBME_FLS: 0.23 real 0.23 user 0.00 sys % ALLLAST/ILOGBM0_FLS: 0.23 real 0.23 user 0.00 sys % ALLLAST/LIBMET0_FLS: 1.57 real 1.55 user 0.00 sys % ALLLAST/LIBMETH_FLS: 1.57 real 1.56 user 0.00 sys Part of the program with the methods: % #if defined(CPUFUNC_FFS) || defined(CPUFUNC_FLS) % #define _KERNEL /* now needed to get non-builtin */ % #include <machine/cpufunc.h> % /* MI library versions: */ % #elif ILOGBM0_FLS % #include "/usr/src/lib/msun/src/s_ilogb.c" % #elif LIBMET0_FLS % #include "/usr/src/lib/libc/string/ffs.c" % #elif LIBMET0_FFS % #include "/usr/src/lib/libc/string/fls.c" % #endif % % static int ffslut32[32]; % % static inline int % ilog_fls(int x) % { % union { % float fvalue; % uint32_t uvalue; % } u; % % if (x <= 0) % return (x == 0 ? 0 : 32); % u.fvalue = x; % return ((u.uvalue >> 23) - 127 + 1); % } This is like the libm (fdlibm/FreeBSD ilogb()) with minor optimizations and de-obfuscations. It might be too sloppy to be correct. % static inline int % ilogb_fls(int x) % { % return (x == 0 ? 0 : ilogbf(x) + 1); % } The portable FP method. % % static inline int % lut_ffs(int mask) % { % return (mask ? ffslut32[(((mask & (-mask)) * 0x0FB9AC52) >> 27) + 16] : 0); % } cperciva's lookup table method. It is not simply examining 8 nybbles using a lookup table of size 16. % ... % #ifdef BUILTIN_FFS % v = __builtin_ffs(z[i]); % #elif CPUFUNC_FFS % v = ffs(z[i]); % #elif CPUFUNC_FLS % v = fls(z[i]); % #elif ILOGMET_FLS % v = ilog_fls(z[i]); % #elif ILOGBME_FLS % v = ilogb_fls(z[i]); % #elif ILOGBM0_FLS % v = ilogb_fls(z[i]); % #elif LIBMET0_FFS % v = ffs(z[i]); % #elif LIBMET0_FLS % v = fls(z[i]); % #elif LIBMETH_FFS % v = ffs(z[i]); % #elif LIBMETH_FLS % v = fls(z[i]); % #elif LUTMETH_FFS % v = lut_ffs(z[i]); % #else % #error "No method" % ; % #endif Further comments on the methods: % RANDBIT/BUILTIN_FFS: 0.08 real 0.08 user 0.00 sys % RANDBIT/CPUFUNC_FFS: 0.12 real 0.12 user 0.00 sys CPUFUNC_* is the kernel version. On amd64, it is #defined'ed as the builtin, but ffsll() (bletch) still uses old methods. (Apart from using the long long abomination, ffsll() and flsll() shouldn't exist in the kernel. They are used approximately once, in code that can be written better without them.) This is on i386. i386 kernel ffs() still uses my 1994 de-pessimization for gcc-2. This is now slower than the builtin. Even the current amd64 methods are similar except for ffs(). There is a problem converting the result of bsf*/bsr* to ffs*()/fls*(). Efficiently. gcc-2 used a method with a badly-placed branch, and I changed FreeBSD to use a less badly-placed branch. Branches were much more important with no branch prediction or caches in 1994. Now they don't matter if they are correctly predicted. The tests try to get them mispredicted but are probably defeated by the prediction being too smart. % RANDBIT/LIBMET0_FFS: 0.11 real 0.10 user 0.00 sys LIBMET0 is #including the libc source. It is a naive linear search but is optimized well, but apparently not to the same code as the builtin. % RANDBIT/LIBMETH_FFS: 0.11 real 0.10 user 0.00 sys LIBMETH is a function call to libc. Statically linked of course. Other CFLAGS are -O -march=athlon-xp. Not the right arch. clang gives excessive optimizations with -O so -O2 would make little difference. % RANDBIT/LUTMETH_FFS: 0.10 real 0.09 user 0.00 sys cperciva's lookup table method. Only implemented for ffs(). % RANDBIT/CPUFUNC_FLS: 0.12 real 0.11 user 0.00 sys Same time as CPUFUNC_FFS. No test for BUILTIN_FLS since the kernel didn't have it when this was written. % RANDBIT/ILOGMET_FLS: 0.14 real 0.13 user 0.00 sys Direct FP method is now not the fastest. % RANDBIT/ILOGBME_FLS: 0.23 real 0.23 user 0.00 sys Function call FP method never was the fastest. % RANDBIT/ILOGBM0_FLS: 0.23 real 0.23 user 0.00 sys Function call FP method with the function code exposed in the same compilation unit -- makes no difference. % RANDBIT/LIBMET0_FLS: 1.25 real 1.24 user 0.00 sys % RANDBIT/LIBMETH_FLS: 1.24 real 1.24 user 0.00 sys As for LIBMETH*_FLS, except now the naive linear search in the libc source is not optimized well. Summary: use the simple naive libc source and wait for the compiler to catch up for fls() and other functions. The slowness of LIBMET0 relative to the builtin is just due to it making function calls. Here is a disassembly of ffs.o: % 00000000 <ffs>: % 0: 0f bc 44 24 04 bsf 0x4(%esp),%eax % 5: 74 03 je a <L1> % 7: 40 inc %eax % 8: c3 ret % 9: 90 nop % % 0000000a <L1>: % a: 31 c0 xor %eax,%eax % c: c3 ret (Why does it align to merely an even boundary?) Disassembly of fls.o shows a naive loop. Quite good code for the inner loop. Now there is alignment to a 16-byte boundary and almost half of the function's size is for padding. Disassembly of ffsl.o shows a naive loop. ffsl() is not optimized although it is essentially the same as ffs(). ffsll() is of course fully pessimized. You could write a much faster one using 2 ffs()'s and that would be good enough for a portable library. Splitting up can handle any number of bits. On amd64, even ffs.o uses the naive loop. The optimization is not in the copmpiler at all. Someone just wrote ffs.S for i386 only. This and many other machine-dependent optimizations shouldn't exist. You can write them better using builtins. The weird even alignment is now easy to explain. The alignment statement was written in 1993 when .align 2 meant 4-byte alignment. (I forget when this changed, but checked the FreeBSD-1 sources where I documented what .align 2 meant and changed all the hard-coded .align 2's to use either the ALIGN macro or the ALIGN_TEXT macro. ALIGN_TEXT hard-codes a fill byte. FreeBSD still uses these macros but spells .align as .p2align of course. Plain .align is too unportable to use for anything. The fill byte shouldn't be hard-coded but still is.) Optimizing string functions is so important that the i386 optimizations are still not so good ones for original i386's, with minor unimprovements from the drift in .align. Most still use .align 2. The exceptions are a couple of wide char functions which use .p2align 2 and also .p2align 4. Even in the kernel where the alignment statements are macro-ized, it is practically impossible to do alignment as well as compilers. There are too many special cases (depending on the arch and code context). Here is (most of) the code produced for ffs(x) { return __builtin_ffs(x); }: with default CFLAGS except for -O -fomit-frame-pointer: % .align 16, 0x90 % ffs: # @ffs % movl 4(%esp), %eax % testl %eax, %eax % je .LBB0_2 % bsfl %eax, %eax % incl %eax % .LBB0_2: # %entry % ret It does nothing special, and produces the same code that I wrote in the inline asm in 1994. Here is the libc code again: % 00000000 <ffs>: % 0: 0f bc 44 24 04 bsf 0x4(%esp),%eax % 5: 74 03 je a <L1> % 7: 40 inc %eax % 8: c3 ret % 9: 90 nop % % 0000000a <L1>: % a: 31 c0 xor %eax,%eax % c: c3 ret This is pessimized by doing a less-predictable branch, later. This was not so bad in 1994. Modern CPUs can start working on the bsfl in parallel to testing the value. They can probably also predict the branch and thus not really care about the order. The branch is because I forgot to optimize for an arch that supports cmove. Compiling with -march=core2 gives: movl 4(%esp), %ecx bsfl %ecx, %eax incl %ecx testl %ecx, %ecx cmovel %ecx, %eax ret This explains another 10-20% of the speed of the builtin vs libc asm. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20141021203956.C1228>