Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 27 Apr 2015 22:30:24 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-bugs@freebsd.org
Subject:   Re: [Bug 199587] libc strncmp() performance
Message-ID:  <20150427213036.Y1916@besplex.bde.org>
In-Reply-To: <bug-199587-8@https.bugs.freebsd.org/bugzilla/>
References:  <bug-199587-8@https.bugs.freebsd.org/bugzilla/>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 21 Apr 2015 bugzilla-noreply@freebsd.org wrote:

> I've been tinkering with the code, out of curiosity, and I've reimplemented
> strncmp() to check the performance. Here are my results and the benchmark code:
>
>         42359800221 cycles -- FreeBSD strncmp()
>         42113090043 cycles -- FreeBSD strncmp()
>         42145558182 cycles -- FreeBSD strncmp()
>         39479522958 cycles -- My Implementation
>         39447595998 cycles -- My Implementation
>         39445708599 cycles -- My Implementation
>
> My implementation always runs a bit faster and seems more clean.

This is basically confusing the compiler to produce not so good code
in a different way.

Your implementation is a bit cleaner since it doesn't arrange the source
code in a way that it thinks will be good for the object code.  This
results in it being slower for old compilers, faster for some in-between
compilers, and no different for new compilers.  However, all the C versions
are now faster than the asm versions on amd64 and i386 on 2 i7 CPUs.  I
added tests for the latter, and sprinkled some volatiles to stop the
compiler optimizing away the whole loop for the asm (libc) versions.

i386, 4790K @ 4.28GHz:
     gcc-3.3.3 -O (but no -march etc. complications):
         10.0 Gcycles  --  libc strncmp() (asm source, external linkage)
         10.1 Gcycles  --  libc strncmp() (copy of the C version)
         11.3 Gcycles  --  My Implementation

     gcc-3.3.3 -O2:
         12.0 Gcycles  --  libc strncmp() (asm source, external linkage)
          9.4 Gcycles  --  libc strncmp() (copy of the C version)
         10.2 Gcycles  --  My Implementation
     libc asm strncmp() really was made 20% slower by increasing the
     optimization level from -O to -O2, although strncmp() itself didn't
     change.  This might be due to the loop being poorly aligned.
     Tuning with -march might be needed to avoid 20% differences, so the
     mere 10% differences in these tests might be noise.  (I didn't bother
     giving many data data points, since nose from rerunning the tests is
     much smaller than 10-20% differences from tuning.)

     gcc-4.2.1 -O:
         11.4 Gcycles  --  libc strncmp() (asm source, external linkage)
         13.1 Gcycles  --  libc strncmp() (copy of the C version)
         12.1 Gcycles  --  My Implementation
     gcc-4.2.1 -O is much slower than gcc-3.3.3, but not so bad for your
     implementation.

     gcc-4.2.1 -O2:
         10.1 Gcycles  --  libc strncmp() (asm source, external linkage)
          9.5 Gcycles  --  libc strncmp() (copy of the C version)
          9.3 Gcycles  --  My Implementation
     gcc-4.2.1 is OK.

amd64, Xeon 5650 @ 2.67GHz:
     clang -O:
     The calls to *strcmp() were almost all optimized away.  I fixed
     this by replacing str1 in the call to str1 + v, where v is a
     volatile int with value 0.
         13.8 Gcycles  --  libc strncmp() (C source, external linkage)
         13.8 Gcycles  --  libc strncmp() (copy of the C version)
         13.8 Gcycles  --  My Implementation
     libc asm strncmp() is of interest here although it doesn't exist --
     if it existed, then it would be more bogus that on i386, since amd64
     doesn't run on the 1990 modem CPUs where the asm version was probably
     faster.  The asm i386 version as tuned for original i386's and barely
     changed since then.  Just as well, since it would be very messy with
     tuning for 10-20 generations of CPUs with several classes of CPU per
     generation.  amd64 libc string functions used to be missing all silly
     optimizations like this, but now optimizes the almost-never-used
     function stpcpy(), and its asm versions of strcat() and strcmp()
     are probably mistakes too.

i386, Xeon 5650 @ 2.67GHz:
     clang -O [-march=native makes no difference]
         12.0 Gcycles  --  libc strncmp() (asm source, external linkage)
         15.1 Gcycles  --  libc strncmp() (copy of the C version)
         11.5 Gcycles  --  My Implementation
     clang is even more confused by the copy of libc C strncmp() than
     gcc-4.2.1.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150427213036.Y1916>