Date: Sat, 2 Oct 2010 07:01:58 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Dimitry Andric <dim@freebsd.org> Cc: Jilles Tjoelker <jilles@freebsd.org>, svn-src-head@freebsd.org, Roman Divacky <rdivacky@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org Subject: Re: svn commit: r213326 - head/lib/libc/i386/string Message-ID: <20101002053412.C873@delplex.bde.org> In-Reply-To: <4CA5FE07.5010104@FreeBSD.org> References: <201010011310.o91DABUU007534@svn.freebsd.org> <20101001132233.GA83116@freebsd.org> <4CA5FE07.5010104@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 1 Oct 2010, Dimitry Andric wrote: > On 2010-10-01 15:22, Roman Divacky wrote: >> there's "rep cmps" in bcmp.S and memcmp.S in both amd64/i386 >> >> they both have C counterparts, no idea how fast those are (they >> are going char by char). >> >> does this wisdom apply to those too? > > Well, these versions compare by words first, using "repe cmpsl" (for > i386) and "rep cmpsq" (for amd64). Only if there any bytes remaining > after that, they are checked with "repe cmpsb". I think the speedup is not really because string instructions are very slow, but because the C strlen() is sophisticated. The asm versions compare a byte at a time, while the C version mostly compares sizeof(long) words at a time. Only the "rep lodsb" is very slow, and "rep movs*" is quite fast, at least if it asks to move 8 bytes at a time or the implementation converts to that. However, I double that the sophistication of the C version is an optimization. It has more overhead for short strings, and most strings are short. Many of the asm functions already don't use string instructions, but use ordinary byte access instructions that would be generated for a C program written using the same algorithm, except the compiler would add overhead instructions. > It is probably possible to optimize these versions for the case the > source pointers are not aligned. That will not occur too much, > probably, and complicate the implementation. And would be a negative optimization, or irrelevant, since the source pointers must already be aligned if efficiency is important. > On the other hand, the C implementations of bcmp and memcmp are > basically simple "while (*src++ == *dst++);" implementations, so it is > entirely up to the compiler to do any optimization magic there. It is only when the asm versions are more sophisticated than that when the asm versions are worth having. The ones worth doing in asm for x86 are probably shown by ls -C of src/libc/amd64/string: CVS/ bcopy.S memcpy.S strcat.S Makefile.inc bzero.S memmove.S strcmp.S bcmp.S memcmp.S memset.S strcpy.S Originally, amd64 didn't have any MD implementations of string functions in libc, and no one really noticed since even bcopy is only about twice as slow in C using a naive algorithm and a naive compiler, but eventually the above were added. Analysis: bcmp.S: uses cmpsq. Probably not called enough to matter. bcopy.S: uses movsq. Adequate, though it can be beaten by up to a factor of 2 using SSE2 for large copies on some CPUs bzero.S: like bcopy.S memcmp.S: like bcmp.S. Should be called even less than bcmp(), or at least halve the number of uses of each, sinc bcmp() is the BSD spelling memcpy.S: just wraps bcopy.S memmove.S: just wraps bcopy.S memset.S: uses stosq, and has too much alignment overhead strcat.S: sophisticated, probably to a fault. I think it uses the same algorithm as the C version to find the end of the source and target strings, with large code to manually inline this for each. Probably not called enough to matter strcmp.S: sophisticated again. Uses the same method as the C strlen (0x8080 trick) to find the end(s). The C version doesn't bother with this. strcpy.S: like strcmp.S. Again, the C version is unsophisticated. Note that strlen.S is not present here. This is because the compiler normally inlines strlen() using "repnz scasb", so the sophisticated C version and the i386 asm version are not normally used and the missing amd64 asm version is not normally missed. "Normally" involves compiling with -O. Howver, with -O2 gcc knows that it doesn't understand string instructions, and it generates a call to strlen() instead of "optimizing" it to "repnz scasb"! The "optimization" seems to be independent of the target -- it happens even on amd64 without -O2. gcc understands the string instruction memcpy() better than this -- it used generate "rep movs*", but now either calls memcpy() or does load-store with non-string instructions. ls -C of src/libc/amd64/string shows too many string functions: Makefile.inc index.S memset.S strcpy.S wcschr.S bcmp.S memchr.S rindex.S strlen.S wcscmp.S bcopy.S memcmp.S strcat.S strncmp.S wcslen.S bzero.S memcpy.S strchr.S strrchr.S wmemchr.S ffs.S memmove.S strcmp.S swab.S bcmp.S: like amd64 bcopy.S: like amd64 bzero.S like amd64 ffs.S: another mistake. gcc normally inlines this, and cperciva has a faster version using a lookup table, and I have a faster version using floating point. The last 2 methods may only be faster for 32-bit ffs (they don't scale so well to 64 bits), but only the 32 bit one has an "optimized" asm version index.S: unsophisticated, but avoids string instructions memchr.S: like index.S memcmp.S: like bcmp.S memcpy.S: just wraps bcopy.S memmove.S: just wraps bcopy.S memset.S: link rindex.S: like index.S strcat.S: unsophisticated, but avoids string instructions strchr.S: even more like index.s strcmp.S: unsophisticated, but avoids string instructions strcpy.S: unsophisticated, but avoids string instructions strlen.S: when it existed: unsophisticated, but avoids string instructions strncmp.S: like strcmp.S strrchr.S: like index.S swab.S: by far the best example of how to optimize using string instructions, but only for an 8088. It uses "lodsw; rorw; stosw;". wcschr.S: like strchr.S (?) wcscmp.S: like strcmp.S (?) wcslen.S: uses a simple loop with 8-fold unrolling and no strong instructions. The C version uses a simple loop with no unrolling. Probably a silly optimization -- if the unrolling is good, then put it in the C version and let the compiler generate almost the same code as here, or better. The best unrolling factor may be MD, but I doubt 8 is best for all x86's. The 0x8080 trick doesn't really apply to wide characters, so this function (and wider anything) are inherently slower. wmemchr.S: like memchr.S (?) Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20101002053412.C873>