From owner-freebsd-arch@FreeBSD.ORG Tue Jan 13 16:18:49 2009 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 6C3FE106564A for ; Tue, 13 Jan 2009 16:18:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx10.syd.optusnet.com.au (fallbackmx10.syd.optusnet.com.au [211.29.132.251]) by mx1.freebsd.org (Postfix) with ESMTP id 6B6F48FC1F for ; Tue, 13 Jan 2009 16:18:48 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by fallbackmx10.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id n0DDYGMc029620 for ; Wed, 14 Jan 2009 00:34:16 +1100 Received: from c122-107-120-227.carlnfd1.nsw.optusnet.com.au (c122-107-120-227.carlnfd1.nsw.optusnet.com.au [122.107.120.227]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id n0DDXVMu027751 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 14 Jan 2009 00:33:33 +1100 Date: Wed, 14 Jan 2009 00:33:31 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: d@delphij.net In-Reply-To: <496C3A80.5040007@delphij.net> Message-ID: <20090113232658.E31712@delplex.bde.org> References: <4966B5D4.7040709@delphij.net> <86zlhw5zsr.fsf@ds4.des.no> <496C3A80.5040007@delphij.net> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-471949903-1231853611=:31712" Cc: =?UTF-8?B?RGFnLUVybGluZyBTbcO4cmdyYXY=?= , freebsd-arch@freebsd.org Subject: Re: RFC: MI strlen() X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 Jan 2009 16:18:49 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-471949903-1231853611=:31712 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Mon, 12 Jan 2009, Xin LI wrote: > Dag-Erling Sm=C3=B8rgrav wrote: >> Xin LI writes: >>> Here is a new implementation of strlen() which employed the bitmask >>> skill in order to achieve better performance on modern hardware. >> >> Why bother? Do we have code that uses strlen() heavily in performance- >> critical regions? Does anybody? Of course not. Also, it takes something like -fno-builtin-strlen in CFLAGS for the libc strlen() to ever be used. On both amd64 and i386, gcc's builtin strlen() uses an inline "rep scasb", which can be inferior to even the simple C loop in the simple libc version since i386ish string instructions tend to be inferior (e.g., on AthlonXP, "rep scas*" takes 2.5 cycles per iteration, plus 15 cycles to start, possibly plus extra startup/finishu= p costs to use specific registers for it, while the simple C loop takes 1 cycle per iteration and can have startup costs much smaller than 15+ cycles (but the function call for the libc version costs about 15 cycles). I think most strings are short, so sophisticated methods will usually lose since they take longer to start up and have more branches. However, the loss is usually unnoticable since it doesn't take long to process a non-huge number of short strings. The "optimized" asm i386 strlen() also uses "rep scasb", so it tends to be inferior to the simple C version. The "optimized" asm amd64 strlen() intentionally doesn't exist, since the "rep scasb" version of it is believed to be slower than the simple C version on all amd64 CPUs, and more sophisticated versions would take more work to implement and might end up being no better, and anyway the gcc builtin normally provides the "rep scasb" "optimization, so even more work would be required to actually usually use the sophisticated versions. amd64 still has "optimized" asm versions of strcat(), strcmp() and strcpy(), and actually-optimized asm version of the memcpy() family. For memcpy(), the string instruction actually tends to be faster (1.33 cycles/iteration on AthlonXP, same on A64 IIRC, and faster on Phenom IIRC), though it still wins mainly because it does 64 bits at a time while the simple libc version only does 32 bits at a time (due to its rotted code: typedef int word;=09=09/* "word" used for optimal copy speed */ -- 32-bit words are of course far from optimal on 64-bit machines). Another issue for the memcpy() family is gcc doesn't have a builtin for everything. IIRC, it doesn't have one for memcpy() since handling overlapping copies makes the inline version too large. > I agree that strlen() should never be used in performance-sensitive > regions, but given we do not have assembly optimized versions of these > functions for amd64 and almost everybody else has it, having a C version > that gives similar performance is valuable. Also, worldstone with -j9 > on 2x4core machine with both src/ and obj/ in tmpfs, seems to have > small, but positive effect: > > Unpatched libc: > 1400.97 real 4159.34 user 2901.08 sys > 1396.73 real 4159.06 user 2906.16 sys > 1380.27 real 4158.20 user 2803.22 sys > > Patched libc: > 1363.29 real 4154.89 user 2749.94 sys > 1373.96 real 4150.45 user 2830.46 sys > 1368.62 real 4152.48 user 2838.52 sys > > (with 'make cleanworld' between builds) I don't believe this improvement. First, it is far too large to have been caused by changing libc strlen(). Second, libc strlen() is not normally used. My buildworld times on a 1x2core machine with nfs-mounted src/ and ffs obj/ src are much smaller: 824.96 real 1294.35 user 187.09 sys but this is for FreeBSD-~5.2. FreeBSD and gcc have a combined bloat factor of almost 4 since then :-(, and more cores don't help much, apparently due to them always running into locks (the bloat factor for system time is almost 15, and that is with nfs bloating my sys time by about a factor of 2). Here are some old files for too-simple strlen() benchmarks: %%% #! /bin/sh # This is a shell archive. Remove anything before this line, then unpack # it by saving it into a file and typing "sh file". To overwrite existing # files, type "sh file -c". You can also feed this as standard input via # unshar, or by typing "sh 'z' <<'END_OF_FILE' END_OF_FILE if test 0 -ne `wc -c <'z'`; then echo shar: \"'z'\" unpacked with wrong size! fi # end of 'z' fi if test -f 'z.c' -a "${1}" !=3D "-c" ; then echo shar: Will not clobber existing file \"'z.c'\" else echo shar: Extracting \"'z.c'\" \(418 characters\) sed "s/^X//" >'z.c' <<'END_OF_FILE' X#include X#include X#include X X#ifndef LEN X#define=09LEN=095 X#endif X Xstatic char *foo[1000]; Xstatic volatile size_t s; X X#ifdef MISTRLEN X#include "/usr/src/lib/libc/string/strlen.c" X#endif X Xint Xmain(void) X{ X=09int i; X X=09for (i =3D 0; i < 10; i++) { X=09=09foo[i] =3D malloc(LEN + 1); X=09=09sprintf(foo[i], "%*.*s", LEN, LEN, "foo"); X=09} X=09for (i =3D 0; i < 10000000; i++) X=09=09s =3D strlen(foo[i % 10]); X=09return (0); X} END_OF_FILE if test 418 -ne `wc -c <'z.c'`; then echo shar: \"'z.c'\" unpacked with wrong size! fi # end of 'z.c' fi if test -f 'zi' -a "${1}" !=3D "-c" ; then echo shar: Will not clobber existing file \"'zi'\" else echo shar: Extracting \"'zi'\" \(209 characters\) sed "s/^X//" >'zi' <<'END_OF_FILE' Xfor len in 0 1 10 100 1000 Xdo X=09echo "string length $len:" X=09for alg in -fbuiltin -fno-builtin "-fno-builtin zq.S" X=09do X=09=09echo "algorithm $alg:" X=09=09cc -o z z.c -O2 $alg -g -static -DLEN=3D$len X=09=09time ./z X=09done Xdone END_OF_FILE if test 209 -ne `wc -c <'zi'`; then echo shar: \"'zi'\" unpacked with wrong size! fi # end of 'zi' fi if test -f 'zq.S' -a "${1}" !=3D "-c" ; then echo shar: Will not clobber existing file \"'zq.S'\" else echo shar: Extracting \"'zq.S'\" \(4304 characters\) sed "s/^X//" >'zq.S' <<'END_OF_FILE' X/* X * Written by J.T. Conklin X * Public domain. X */ X X#include X__FBSDID("$FreeBSD$"); X X#if 0 X=09RCSID("$NetBSD: strlen.S,v 1.3 2004/07/19 20:04:41 drochner Exp $") X#endif X XENTRY(strlen) X=09movq=09%rdi,%rax X=09negq=09%rdi X X.Lalign: X=09/* Consider unrolling loop? */ X=09testb=09$7,%al X=09je=09.Lword_aligned X=09cmpb=09$0,(%rax) X=09jne=091f X=09leaq=09(%rdi,%rax),%rax X=09ret X1:=09incq=09%rax X=09jmp=09.Lalign X X=09/* X=09 * There are many well known branch-free sequences which are used X=09 * for determining whether a zero-byte is contained within a word. X=09 * These sequences are generally much more efficent than loading X=09 * and comparing each byte individually. X=09 * X=09 * The expression [1,2]: X=09 * X=09 * (1) ~(((x & 0x7f....7f) + 0x7f....7f) | (x | 0x7f....7f)) X=09 * X=09 * evaluates to a non-zero value if any of the bytes in the X=09 * original word is zero. X=09 * X=09 * It also has the useful property that bytes in the result word X=09 * that coorespond to non-zero bytes in the original word have X=09 * the value 0x00, while bytes cooresponding to zero bytes have X=09 * the value 0x80. This allows calculation of the first (and X=09 * last) occurance of a zero byte within the word (useful for C's X=09 * str* primitives) by counting the number of leading (or X=09 * trailing) zeros and dividing the result by 8. On machines X=09 * without (or with slow) clz() / ctz() instructions, testing X=09 * each byte in the result word for zero is necessary. X=09 * X=09 * This typically takes 4 instructions (5 on machines without X=09 * "not-or") not including those needed to load the constant. X=09 * X=09 * X=09 * The expression: X=09 * X=09 * (2) ((x - 0x01....01) & ~x & 0x80....80) X=09 * X=09 * evaluates to a non-zero value if any of the bytes in the X=09 * original word is zero. X=09 * X=09 * On little endian machines, the first byte in the result word X=09 * that cooresponds to a zero byte in the original byte is 0x80, X=09 * so clz() can be used as above. On big endian machines, and X=09 * little endian machines without (or with a slow) clz() insn, X=09 * testing each byte in the original for zero is necessary X=09 * X=09 * This typically takes 3 instructions (4 on machines without X=09 * "and with complement") not including those needed to load X=09 * constants. X=09 * X=09 * X=09 * The expression: X=09 * X=09 * (3) ((x - 0x01....01) & 0x80....80) X=09 * X=09 * always evaluates to a non-zero value if any of the bytes in X=09 * the original word is zero. However, in rare cases, it also X=09 * evaluates to a non-zero value when none of the bytes in the X=09 * original word is zero. X=09 * X=09 * To account for possible false positives, each byte of the X=09 * original word must be checked when the expression evaluates to X=09 * a non-zero value. However, because it is simpler than those X=09 * presented above, code that uses it will be faster as long as X=09 * the rate of false positives is low. X=09 * X=09 * This is likely, because the the false positive can only occur X=09 * if the most siginificant bit of a byte within the word is set. X=09 * The expression will never fail for typical 7-bit ASCII strings. X=09 * X=09 * This typically takes 2 instructions not including those needed X=09 * to load constants. X=09 * X=09 * X=09 * [1] Henry S. Warren Jr., "Hacker's Delight", Addison-Westley 2003 X=09 * X=09 * [2] International Business Machines, "The PowerPC Compiler Writer's X=09 * Guide", Warthman Associates, 1996 X=09 */ X X=09.align 4 X.Lword_aligned: X=09movabsq=09$0x0101010101010101,%r8 X=09movabsq=09$0x8080808080808080,%r9 X.Lloop: X=09movq=09(%rax),%rdx X=09addq=09$8,%rax X=09subq=09%r8,%rdx X=09testq=09%r9,%rdx X=09je=09.Lloop X X=09/* X=09 * In rare cases, the above loop may exit prematurely. We must X=09 * return to the loop if none of the bytes in the word equal 0. X=09 */ X=09cmpb=09$0,-8(%rax)=09=09/* 1st byte =3D=3D 0? */ X=09je=09.Lsub8 X=09cmpb=09$0,-7(%rax)=09=09/* 2nd byte =3D=3D 0? */ X=09je=09.Lsub7 X=09cmpb=09$0,-6(%rax)=09=09/* 3rd byte =3D=3D 0? */ X=09je=09.Lsub6 X=09cmpb=09$0,-5(%rax)=09=09/* 4th byte =3D=3D 0? */ X=09je=09.Lsub5 X=09cmpb=09$0,-4(%rax)=09=09/* 5th byte =3D=3D 0? */ X=09je=09.Lsub4 X=09cmpb=09$0,-3(%rax)=09=09/* 6th byte =3D=3D 0? */ X=09je=09.Lsub3 X=09cmpb=09$0,-2(%rax)=09=09/* 7th byte =3D=3D 0? */ X=09je=09.Lsub2 X=09cmpb=09$0,-1(%rax)=09=09/* 8th byte =3D=3D 0? */ X=09jne=09.Lloop X X.Lsub1: X=09leaq=09-1(%rdi,%rax),%rax X=09ret X.Lsub2: X=09leaq=09-2(%rdi,%rax),%rax X=09ret X.Lsub3: X=09leaq=09-3(%rdi,%rax),%rax X=09ret X.Lsub4: X=09leaq=09-4(%rdi,%rax),%rax X=09ret X.Lsub5: X=09leaq=09-5(%rdi,%rax),%rax X=09ret X.Lsub6: X=09leaq=09-6(%rdi,%rax),%rax X=09ret X.Lsub7: X=09leaq=09-7(%rdi,%rax),%rax X=09ret X.Lsub8: X=09leaq=09-8(%rdi,%rax),%rax X=09ret END_OF_FILE if test 4304 -ne `wc -c <'zq.S'`; then echo shar: \"'zq.S'\" unpacked with wrong size! fi # end of 'zq.S' fi echo shar: End of shell archive. exit 0 %%% Here - z.c is a simple C program that calls strlen() - zi is a shell script that tries various algrithms (sorry, no makefile) - zq.S is the NetBSD amd64 asm version which uses the 0x8080 trick and not "rep scasb" The algorithms are: - -fbuiltin: try to use builtin strlen() - -fno-builtin: use default libc strlen(), except if MISTRLEN is defined, use MI libc strlen(). Variation to use MISTRLEN is not supported by zi. - -fno-builtin zq.S: use NetBSD amd64 strlen() I forget the results (don't have an amd64 machine handy for re-testing). Bruce --0-471949903-1231853611=:31712--