Date: Sun, 7 Aug 2005 08:13:27 +1000 (EST) From: Bruce Evans <bde@zeta.org.au> To: Xin LI <delphij@frontfree.net> Cc: freebsd-amd64@freebsd.org, freebsd-arch@freebsd.org Subject: Re: [RFC] Port of NetBSD's optimized amd64 string code Message-ID: <20050807050237.H3151@epsplex.bde.org> In-Reply-To: <20050802164120.GA16775@frontfree.net> References: <20050801182518.GA85423@frontfree.net> <20050802013916.GA37135@dragon.NUXI.org> <20050802164120.GA16775@frontfree.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 3 Aug 2005, Xin LI wrote: > On Mon, Aug 01, 2005 at 06:39:16PM -0700, David O'Brien wrote: >> On Tue, Aug 02, 2005 at 02:25:18AM +0800, Xin LI wrote: >>> Here is a patchset that I have produced to make our libc aware of the >>> NetBSD assembly implementation of the string related operations. >> >> What performance benchmarks have these been thru? I would expect insignificant ones. alc already optimized most of the amd64 string functions that are worth optimizing. > In summary: index, rindex, memchr, strchr, strlen, strlen, strrchr are > faster than their C counterparts; ffs, strncmp are about the same, and > swab is worse. This is not surprising, since the faster ones mostly use better algorithms so their fastness has very little to do with them being written in asm, ffs.S has slightly worse code than the gcc builtin, strncmp.S doesn't do anything special, and swab.S does almost everything pessimally. > Note that these tests are done within loops that first copy a string to > a new place, then perform the operation, then free the string, to defeat > cache effects. I usually test like this since it is hard to do better, but this tests for precisely things that shouldn't happen in real applications. > BETTER CASES > > [delphij@warrior7] ~/test/index> /usr/bin/time ./index > 5.81 real 5.79 user 0.00 sys > ASSEMBLY > 1.84 real 1.82 user 0.00 sys > > [delphij@warrior7] ~/test/rindex> /usr/bin/time ./rindex > 6.25 real 6.24 user 0.00 sys > ASSEMBLY > 2.17 real 1.84 user 0.01 sys > > [delphij@warrior7] ~/test/memchr> /usr/bin/time ./memchr > 5.93 real 5.91 user 0.00 sys > ASSEMBLY > 1.91 real 1.84 user 0.00 sys > > [delphij@warrior7] ~/test/strchr> /usr/bin/time ./strchr > 5.93 real 5.91 user 0.00 sys > ASSEMBLY > 1.84 real 1.83 user 0.00 sys memchr.S, strchr.S (and thus index.S), strchchr.S (strchr timings below) (and thus rindex.S) all use the well known method of comparing with the magic byte sequences \x80\x80... and \x01\0x01... It's not surprising that this is several times faster since it accesses the data 4-8 bytes at a time (8 bytes for amd64), but IMO this optimization belongs only in C code (if anywhere) so that it optimizes all arches (if it is an optimization). glibc has it in C code. NetBSD also has it in the i386 string library but not in the generic string library. > [delphij@warrior7] ~/test/strlen> /usr/bin/time ./strlen > 7.13 real 7.12 user 0.00 sys > ASSEMBLY > 1.44 real 1.43 user 0.00 sys strlen.S uses the magic byte sequence method in the same way (NetBSD also does this in the i386 strlen.S...). The method works even better for strlen(), giving closer to an 8-fold speedup, since there is less overhead. Other benchmarks for strlen() show strange behaviour. (Note that neither the generic library version nor the MD i386/amd64 version of it is normally used, since the gcc builtin is normally used.) - the builtin is slightly faster than the MD i386 version, since they use the same algorithm (a scasb loop) and the builtin doesn't have function call overhead. - the simple generic C version is often faster than the "optimized" builtin! IIRC, alc noticed this for amd64 and didn't implement an amd64 strlen.S using the same simple algorithm as the i386 one because it would be a pessimization even if it were actually used. It seems to be pessimal on Athlon XP's too. For 10^8 strlen()s, with other CFLAGS -O2 -static: on an AXP2600 overclocked: %%% string length 0: algorithm -fbuiltin: 0.30 real 0.29 user 0.00 sys algorithm -fno-builtin: 0.33 real 0.32 user 0.00 sys algorithm -DMISTRLEN: 0.25 real 0.24 user 0.00 sys string length 1: algorithm -fbuiltin: 0.31 real 0.30 user 0.00 sys algorithm -fno-builtin: 0.34 real 0.33 user 0.00 sys algorithm -DMISTRLEN: 0.26 real 0.25 user 0.00 sys string length 10: algorithm -fbuiltin: 0.40 real 0.39 user 0.00 sys algorithm -fno-builtin: 0.43 real 0.42 user 0.00 sys algorithm -DMISTRLEN: 0.40 real 0.39 user 0.00 sys string length 100: algorithm -fbuiltin: 1.32 real 1.30 user 0.00 sys algorithm -fno-builtin: 1.35 real 1.33 user 0.00 sys algorithm -DMISTRLEN: 1.21 real 1.20 user 0.00 sys string length 1000: algorithm -fbuiltin: 10.46 real 10.42 user 0.00 sys algorithm -fno-builtin: 10.50 real 10.46 user 0.00 sys algorithm -DMISTRLEN: 9.35 real 9.31 user 0.00 sys %%% Note that the MI version is fastest for all string lengths. The C version was sometimes (in results with -O1 and/or not shown above) 30% slower than it usually is; this turned out to be due to the loop in strlen() not fitting in a cache line due to the start of the loop accidentally beginning near a cache line boundary. On an A64 3400: %%% string length 0: algorithm -fbuiltin: 0.34 real 0.32 user 0.00 sys algorithm -fno-builtin: 0.24 real 0.23 user 0.00 sys algorithm -fno-builtin zq.S: 0.27 real 0.26 user 0.00 sys string length 1: algorithm -fbuiltin: 0.35 real 0.33 user 0.00 sys algorithm -fno-builtin: 0.25 real 0.24 user 0.00 sys algorithm -fno-builtin zq.S: 0.29 real 0.28 user 0.00 sys string length 10: algorithm -fbuiltin: 0.44 real 0.42 user 0.00 sys algorithm -fno-builtin: 0.42 real 0.40 user 0.00 sys algorithm -fno-builtin zq.S: 0.30 real 0.29 user 0.00 sys string length 100: algorithm -fbuiltin: 1.34 real 1.32 user 0.00 sys algorithm -fno-builtin: 1.32 real 1.30 user 0.00 sys algorithm -fno-builtin zq.S: 0.55 real 0.54 user 0.00 sys string length 1000: algorithm -fbuiltin: 10.39 real 10.37 user 0.00 sys algorithm -fno-builtin: 10.37 real 10.34 user 0.00 sys algorithm -fno-builtin zq.S: 2.25 real 2.23 user 0.00 sys %%% zq.S is your proposed strlen.S. > > [delphij@warrior7] ~/test/strrchr> /usr/bin/time ./strrchr > 9.12 real 9.08 user 0.00 sys > ASSEMBLY > 4.71 real 4.69 user 0.00 sys > > WORSE/NO EFFECTS > > [delphij@warrior7] ~/test/ffs> /usr/bin/time ./ffs > 0.33 real 0.33 user 0.00 sys > ASSEMBLY > 0.33 real 0.33 user 0.00 sys The MI library version is poor for ffs, but the builtin is good. It is easy to improve the MI version, at least in benchmarks, using a lookup table, but efficiency of ffs() is unimportant in most applications so no one has done it. From an old benchmark of various methods on an A64: %%% UNIFORM/BUILTIN_FFS: 0.27 real 0.26 user 0.00 sys UNIFORM/CPUFUNC_FFS: 0.27 real 0.26 user 0.00 sys UNIFORM/LIBMET0_FFS: 0.70 real 0.70 user 0.00 sys UNIFORM/LIBMETH_FFS: 0.72 real 0.72 user 0.00 sys UNIFORM/LUTMETH_FFS: 0.25 real 0.24 user 0.00 sys UNIFORM/CPUFUNC_FLS: 0.33 real 0.33 user 0.00 sys UNIFORM/ILOGMET_FLS: 0.18 real 0.18 user 0.00 sys UNIFORM/ILOGBME_FLS: 0.37 real 0.36 user 0.00 sys UNIFORM/ILOGBM0_FLS: 0.39 real 0.39 user 0.00 sys UNIFORM/LIBMET0_FLS: 1.67 real 1.67 user 0.00 sys UNIFORM/LIBMETH_FLS: 1.71 real 1.71 user 0.00 sys RANDBIT/BUILTIN_FFS: 0.27 real 0.26 user 0.00 sys RANDBIT/CPUFUNC_FFS: 0.27 real 0.26 user 0.00 sys RANDBIT/LIBMET0_FFS: 1.43 real 1.43 user 0.00 sys RANDBIT/LIBMETH_FFS: 1.44 real 1.43 user 0.00 sys RANDBIT/LUTMETH_FFS: 0.25 real 0.24 user 0.00 sys RANDBIT/CPUFUNC_FLS: 0.33 real 0.33 user 0.00 sys RANDBIT/ILOGMET_FLS: 0.17 real 0.17 user 0.00 sys RANDBIT/ILOGBME_FLS: 0.41 real 0.41 user 0.00 sys RANDBIT/ILOGBM0_FLS: 0.39 real 0.39 user 0.00 sys RANDBIT/LIBMET0_FLS: 1.16 real 1.16 user 0.00 sys RANDBIT/LIBMETH_FLS: 1.16 real 1.16 user 0.00 sys ALLZERO/BUILTIN_FFS: 0.27 real 0.26 user 0.00 sys ALLZERO/CPUFUNC_FFS: 0.27 real 0.26 user 0.00 sys ALLZERO/LIBMET0_FFS: 0.53 real 0.53 user 0.00 sys ALLZERO/LIBMETH_FFS: 0.55 real 0.55 user 0.00 sys ALLZERO/LUTMETH_FFS: 0.12 real 0.12 user 0.00 sys ALLZERO/CPUFUNC_FLS: 0.10 real 0.10 user 0.00 sys ALLZERO/ILOGMET_FLS: 0.12 real 0.12 user 0.00 sys ALLZERO/ILOGBME_FLS: 0.12 real 0.12 user 0.00 sys ALLZERO/ILOGBM0_FLS: 0.10 real 0.10 user 0.00 sys ALLZERO/LIBMET0_FLS: 0.24 real 0.24 user 0.00 sys ALLZERO/LIBMETH_FLS: 0.29 real 0.28 user 0.00 sys ALLONE_/BUILTIN_FFS: 0.27 real 0.26 user 0.00 sys ALLONE_/CPUFUNC_FFS: 0.27 real 0.26 user 0.00 sys ALLONE_/LIBMET0_FFS: 0.58 real 0.57 user 0.00 sys ALLONE_/LIBMETH_FFS: 0.60 real 0.59 user 0.00 sys ALLONE_/LUTMETH_FFS: 0.25 real 0.24 user 0.00 sys ALLONE_/CPUFUNC_FLS: 0.37 real 0.37 user 0.00 sys ALLONE_/ILOGMET_FLS: 0.16 real 0.16 user 0.00 sys ALLONE_/ILOGBME_FLS: 0.37 real 0.37 user 0.00 sys ALLONE_/ILOGBM0_FLS: 0.39 real 0.39 user 0.00 sys ALLONE_/LIBMET0_FLS: 0.25 real 0.24 user 0.00 sys ALLONE_/LIBMETH_FLS: 0.29 real 0.28 user 0.00 sys ALLLAST/BUILTIN_FFS: 0.27 real 0.26 user 0.00 sys ALLLAST/CPUFUNC_FFS: 0.27 real 0.26 user 0.00 sys ALLLAST/LIBMET0_FFS: 2.37 real 2.36 user 0.00 sys ALLLAST/LIBMETH_FFS: 2.39 real 2.38 user 0.00 sys ALLLAST/LUTMETH_FFS: 0.25 real 0.24 user 0.00 sys ALLLAST/CPUFUNC_FLS: 0.37 real 0.37 user 0.00 sys ALLLAST/ILOGMET_FLS: 0.12 real 0.12 user 0.00 sys ALLLAST/ILOGBME_FLS: 0.37 real 0.37 user 0.00 sys ALLLAST/ILOGBM0_FLS: 0.39 real 0.39 user 0.00 sys ALLLAST/LIBMET0_FLS: 1.77 real 1.76 user 0.00 sys ALLLAST/LIBMETH_FLS: 1.81 real 1.81 user 0.00 sys %%% Here various distributions of the bits are tested, with various algorithms: BUILTIN_FFS: whatever gcc gives CPUFUNC_FFS: as in <machine/cpufunc.h> (this uses the builtin for amd64) LIBMET0_FFS: LIBMETH_FFS compiled with benchmark's CFLAGS LIBMETH_FFS: MI library method LUTMETH_FFS: a lookup table method, orignally by cperciva CPUFUNC_FLS: whatever gcc gives ILOGMET_FLS: convert to floating point and extract the exponent directly ILOGBME_FLS: convert to floating point and use ilogb(3) to extract... ILOGBM0_FLS: ILOGBM0_FLS compiled with benchmark's CFLAGS LIBMET0_FLS: MI library method compiled with benchmark's CFLAGS LIBMETH_FLS: MI library method > [delphij@warrior7] ~/test/strncmp> /usr/bin/time ./strncmp > 3.90 real 3.88 user 0.00 sys > 3.88 real 3.87 user 0.00 sys > ASSEMBLY > 3.88 real 3.87 user 0.00 sys > > [delphij@warrior7] ~/test/swab> /usr/bin/time ./swab > 6.59 real 6.53 user 0.01 sys > ASSEMBLY > 8.06 real 8.05 user 0.00 sys > Some comments on the patch. % Index: ffs.S % =================================================================== % RCS file: ffs.S % diff -N ffs.S % --- /dev/null 1 Jan 1970 00:00:00 -0000 % +++ ffs.S 1 Aug 2005 17:54:04 -0000 % @@ -0,0 +1,22 @@ % +/* % + * Written by J.T. Conklin <jtc@NetBSD.org>. % + * Public domain. % + * Adapted for NetBSD/x86_64 by Frank van der Linden <fvdl@wasabisystems.com> % + */ % + % +#include <machine/asm.h> % +__FBSDID("$FreeBSD$"); % + % +#if 0 % + RCSID("$NetBSD: ffs.S,v 1.2 2003/07/26 19:24:38 salo Exp $") % +#endif % + % +ENTRY(ffs) % + bsfl %edi,%eax % + jz L1 /* ZF is set if all bits are 0 */ % + incl %eax /* bits numbered from 1, not 0 */ % + ret % + % + .align 4 % +L1: xorl %eax,%eax /* clear result */ % + ret The amd64 builtin uses cmove to avoid the branch. This is always better, but is only a small optimization (see above benchmark). The i386 ffs() does the test and branch before the bsfl so as to avoid the bsfl if the arg is 0. This sigificantly optimimizes the case of a zero arg at no cost for nonzero args. The gcc builtin ffs() has evolved in many stages: - very old versions depended on undefined behaviour. The i386 kernel ffs() was written to avoid this bug and it was easy to optimize it while doing this. - not old versions generated essentially the above code. - not so old versions (3.3.3) generate "sete; negl; orl" to avoid the branch. This is a less efficient variant of the cmove. % Index: memchr.S % =================================================================== % RCS file: memchr.S % diff -N memchr.S % --- /dev/null 1 Jan 1970 00:00:00 -0000 % +++ memchr.S 1 Aug 2005 18:09:44 -0000 % @@ -0,0 +1,112 @@ This is good code (algorithm and style). I don't mind using this micro-optimization if someone else writes and maintains it :-). % ... % +.Lzero: % + xorq %rax,%rax Some of the other functions only return %eax, which is sufficient for an int but is a different style. % Index: strchr.S % =================================================================== % RCS file: strchr.S % diff -N strchr.S % --- /dev/null 1 Jan 1970 00:00:00 -0000 % +++ strchr.S 1 Aug 2005 18:11:51 -0000 % @@ -0,0 +1,137 @@ Similarly to memchr.S.... % ... % + subq $7,%rdi % + jmp .Ldone % +1: testb %dl,%dl /* 2nd byte == 0? */ ... but not so good style near the end of it: - nameless labels are fairly easy to write but hard to read - no blank line before separate blocks of code - labels on the same line as statements % Index: strlen.S % =================================================================== % RCS file: strlen.S % diff -N strlen.S % --- /dev/null 1 Jan 1970 00:00:00 -0000 % +++ strlen.S 1 Aug 2005 18:12:48 -0000 % @@ -0,0 +1,157 @@ Not as good style as memchr.S. % + /* % + * There are many well known branch-free sequences which are used % + * for determining whether a zero-byte is contained within a word. % + * These sequences are generally much more efficent than loading % + * and comparing each byte individually. % + * % + * The expression [1,2]: % + * % + * (1) ~(((x & 0x7f....7f) + 0x7f....7f) | (x | 0x7f....7f)) % + * % + * evaluates to a non-zero value if any of the bytes in the % + * original word is zero. % ... This comment has some spelling errors and doesn't belong here. The algorithm is mostly general and should be described centrally. A more complicated variant of It is used without comment in memchr.S and strchr.S. Describing things in terms of PowerPC (?) clz instructions belongs even less in the i386 asm version than it does in a general comment. % Index: strncmp.S % =================================================================== % RCS file: strncmp.S % diff -N strncmp.S % --- /dev/null 1 Jan 1970 00:00:00 -0000 % +++ strncmp.S 1 Aug 2005 18:13:51 -0000 % @@ -0,0 +1,108 @@ Note as good as memchr.S. It uses an algorithm that apparently gives no better results than the generic C one, and uses meaningless though named labels throughout. % Index: strrchr.S % =================================================================== % RCS file: strrchr.S % diff -N strrchr.S % --- /dev/null 1 Jan 1970 00:00:00 -0000 % +++ strrchr.S 1 Aug 2005 18:15:07 -0000 % @@ -0,0 +1,127 @@ Similarly to strchr.S. but less worth optimizing. % Index: swab.S % =================================================================== % RCS file: swab.S % diff -N swab.S % --- /dev/null 1 Jan 1970 00:00:00 -0000 % +++ swab.S 1 Aug 2005 18:18:17 -0000 % @@ -0,0 +1,47 @@ Shows how not to optimize things. % +#define LOAD_SWAP_STORE_WORD \ % + lodsw ; \ The lods instruction should never be used, since it is very slow. % + xchgb %al,%ah ; \ Partial register operations should be avoided. I don't know if this is important on amd64's. On Athlons xchgb of 2 registers is on the VectorPath and takes 2 cycles, so is probably possible to do 2 byte loads to the right places (%al and %ah) in the same time that it takes to fix up the bytes after slowly loading them as a word, provided there is no penalty for the partial register operations. I think the C code wins by doing something similar. % + stosw stosw Is OK. % +L2: shrq $3,%rdx # copy remainder 8 words at a time % + jz L4 # while swapping alternate bytes. % +L3: % + LOAD_SWAP_STORE_WORD % + LOAD_SWAP_STORE_WORD % + LOAD_SWAP_STORE_WORD % + LOAD_SWAP_STORE_WORD % + LOAD_SWAP_STORE_WORD % + LOAD_SWAP_STORE_WORD % + LOAD_SWAP_STORE_WORD % + LOAD_SWAP_STORE_WORD % + % + decq %rdx % + jnz L3 To optimize the amd64 version, I would first try loading 8 bytes at a time and swap them using a minimal combination of bswaps and rotates. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20050807050237.H3151>