From owner-freebsd-arch@FreeBSD.ORG Thu Jul 12 10:37:37 2007 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4230016A46D for ; Thu, 12 Jul 2007 10:37:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au [211.29.132.190]) by mx1.freebsd.org (Postfix) with ESMTP id D55C213C4C6 for ; Thu, 12 Jul 2007 10:37:36 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l6CAbXVQ003905 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 12 Jul 2007 20:37:34 +1000 Date: Thu, 12 Jul 2007 20:37:33 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: "Sean C. Farley" In-Reply-To: <20070711134721.D2385@thor.farley.org> Message-ID: <20070712191616.A4682@delplex.bde.org> References: <20070711134721.D2385@thor.farley.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-arch@freebsd.org Subject: Re: Assembly string functions in i386 libc X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Jul 2007 10:37:37 -0000 On Wed, 11 Jul 2007, Sean C. Farley wrote: > While looking at increasing the speed of strlen(), I noticed that on > i386 platforms (PIII, P4 and Athlon XP) the performance is abysmal in > libc compared to the version I was writing. After more testing, I found > it was only the assembly version that is really slow. The C version is > fairly quick. Is there a need to continue to use the assembly versions > of string functions on i386? Does it mainly help slower systems such as > those with i386 or i486 CPU's? I think you are mistaken about the asm version being slow. In my tests (mainly on P4 and AXP), all versions are just mediocre, with the asm version usually slightly faster. It is hard to handle all cases efficiently, so generic string routines tend to be mediocre (even if they are bloated with special handling for all cases, the bloat tends to make them mediocre, especially for the usual (?) case of short strings). Anyway, the asm version is almost never used on amd64 or i386. gcc's builtin strlen is used at optimization levels >= 1 unless you do something to turn it off. gcc generates essentially the same mediocre code as the i386 asm version (repnz scasb, with not much other overhead except for the usual overhead for string instructions). Perhaps I misremember the asm version being faster. All string instructions except movs and stos are quite slow. On AXP they take: rep movs 24 + 4/3*count rep stos 24 + 1*count rep scas 15 + 5/2*count others are worse I think the C loop for strlen() takes N + 2*count on AXP. On A64 in i386 mode it does take N+2.0*count, while the builtin and the library (asm rep scasb) both take N+2.1*count. The speed of the C version depends on lots of pipelines which AXP and A64 have (not sure about P4). The critical loop is: 1: incl %eax cmpb $0,(%eax) jne 1b The cmpb in this takes 2 cycles, but the loop overhead takes no more due to pipelining. (On AXP and A64, all small loops take 2 cycles. You can add a one or two independent integer instructions in the loop without slowing it down, since up to 6 integer micro-ops can be executed in 2 cycles (2 cycles times 3 integer pipelines), and the above takes 4 or 5. memcmp() is more interesting than strlen() since it takes 5, 6 or 7.) The above loop is far from optimal. It could be unrolled, but this would be just a pessimization on AXP and A64, since the loop overhead is already null. It could use a better algorithm involving comparing 4 bytes at a time on i386 and 8 bytes on amd64. glibc (tege) implemented such algorithms almost 20 years ago. It is unclear if they are actually better, since they have more overhead outside of the loop. I suspect that the C version seems to be much better in your benchmarh because you only tested short strings. For short strings, the setup and finalization overhead (or more perhaps main memory overhead) dominates and both i386 string instructions and sophisticated algorithms lose because theor overheads are higher than for the simple C version. As to slower systems: versions using the string instructions are much better on older systems, because the old systems only have 1 slow pipeline instead of 2 or 3 fast pipelines, and they have slow branches and no branch cache; however they had string instructions only slightly larger (or even smaller) parameters than the above. libc/i386/string also has micro-optimizations for original i386's. libc/amd64/string is better, mainly by being almost empty, but it is still too muuch like libc/i386/string, having micro-optimizations for original i386's. E.g., it does things like using rep movsb in bcopy.S to copy the last 0-7 bytes without checking if the count is nonzero. This may have been a good optimization for original i386's, but it is one that the axp and amd64 optimization manuals say not to do. No one notices because it is only a minor pessimization. The kernel i386 bcopy has somehow always been optimized for AXP's and not for original i386's. > I have the results from my P4 (Id = 0xf24 Stepping = 4) system and the > test program here[1]. strlen.tar.bz2 is the archive of it for anyone's > testing. In the strlen/results subdirectory, there are the results for > strings of increasing lengths. Sorry, I didn't look at this. I just wrote a quick re-test and ran it on an A64 in i386 mode. In a previous test on P4, I tried to benchmark mainly branch overhead because the loop overhead is uninteresting ("all loops take 2 cycles"). I tried to randomize string lengths in a way the would defeat branch caches in the same way as average usage. The results were inconclusive. I don't know what average usage is, but I suspect average usage is to rarely usage strlen() so its efficiency doesn't matter %-). Bruce