From owner-freebsd-hackers  Tue May 28 01:44:12 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.5/8.7.3) id BAA11330
          for hackers-outgoing; Tue, 28 May 1996 01:44:12 -0700 (PDT)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.7.5/8.7.3) with SMTP id BAA11309
          for <hackers@FreeBSD.ORG>; Tue, 28 May 1996 01:44:02 -0700 (PDT)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.12/8.6.9) id SAA08598; Tue, 28 May 1996 18:36:08 +1000
Date: Tue, 28 May 1996 18:36:08 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199605280836.SAA08598@godzilla.zeta.org.au>
To: charnier@lirmm.fr, hackers@FreeBSD.ORG
Subject: Re: strcpy, strcat: not the same look & feel.
Sender: owner-hackers@FreeBSD.ORG
X-Loop: FreeBSD.org
Precedence: bulk

>Which one is faster, the old version or the one with this patch applied?
>Libc uses another one (assembler) but this could at least make libkern
>faster. Or is it even better to use the libc's version? I'm not really sure
>about my results but it seems that the following patch make strcpy 8% faster
>(-O0) 6% faster (-O) and 0% faster (-O2) on my i486 according to gprof.

>...
>-	for (; *to = *from; ++from, ++to);
>+	while (*to++ = *from++);

They are essentially the same, But gcc doesn't recognise this at any
optimization level, and generates slightly different code that happens
to be faster or slower depending on the cpu.  I get quite different
results for one test with a short string (of length 5) on a Pentium:

-O0: 29% faster (16.79s reduced   to 11.96s)
-O1:  5% slower (12.23s increased to 12.85s)
-O2:  9% slower (11.34s increased to 12.40s)
-O3: 13% faster  (2.57s reduced   to  2.27s)

The speed actually depends more on the surrounding code than on the loop.
Essentially the same code is generated for the loop in all cases except
-O0.  -O3 is much faster because the copy function got inlined.  Slightly
different setup code for the other tests gives significantly different
results.  Only the results for -O0 case are easy to understand.  The
unoptimized code for the while loop happens to be less pessimal on the
i386.

Bruce