Date: Tue, 26 Aug 2003 20:00:20 +1000 (EST) From: Bruce Evans <bde@zeta.org.au> To: Marcel Moolenaar <marcel@xcllnt.net> Cc: cvs-all@FreeBSD.org Subject: Re: cvs commit: src/sys/gnu/ext2fs ext2_bitops.hext2_linux_balloc.c ext2_linux_ialloc.c ia64-bitops.h Message-ID: <20030826192612.Q9540@gamplex.bde.org> In-Reply-To: <20030826082035.GA728@dhcp43.pn.xcllnt.net> References: <200308250139.h7P1dlwg002331@repoman.freebsd.org> <20030826082035.GA728@dhcp43.pn.xcllnt.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 26 Aug 2003, Marcel Moolenaar wrote: > On Tue, Aug 26, 2003 at 05:44:40PM +1000, Bruce Evans wrote: > > On Sun, 24 Aug 2003, Marcel Moolenaar wrote: > > > ... > > > Log: > > > Change of plans: Add ext2_bitops.h with generic and portable > > > implementations. Use those on platforms that don't have MD > > > ... > > > > Thnaks. I'll deventually do some benchmarks to see if we want the MD > > (inline assembler) versions at all. > > Ok. I didn't write the functions for speed so there may be some > opportunities after analyzing what the compiler made of it. > > BTW: I noticed that GCC has a builtin implementation for ffs(). > It may be beneficial. Is this more than ffs(3)? (It could be for bit strings generally.). I recently benchmarked ffs(3). Too many builtins are turned off by -fno-builtin for the kernel, but on i386's we have an inline asm version of ffs() which is still better than the gcc builtin according to my benchmarks. The main difference is that inline asm version uses a conditional branch while the builtin uses extra instructions to avoid the branch. Somehow the branch was always faster on my test systems (old Celeron, AthlonXP and freefall(?)). I tested with a moving bit to try to defeat branch target caches, but forgot to test with alternating zero and nonzero data which would probably cause more branch mispredictions. Data which is zero much more often than random data (probability 1/2^32) may be unusual anyway. IIRC, the amount by which the extra instructions were slower was quite machine-dependent, with the AthlonXP handling them almost as fast as a branch. The extra instructions look like this: movl $0, %edx bsfl x, %eax # result less 1 if x was nonzero; else garbage sete %dl # begin fixup for garbage case negl %edx orl %edx, %eax # end fixup for garbage case incl %eax # fix up semantic mismatch between bsf and ffs This is highly non-parallelizable, so it is probably as inefficient as it looks to a naive instruction counter. Of course, any optimizations and pessimizations here are only a few cycles, so they don't show up except in micro-benchmarks. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030826192612.Q9540>