Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 31 Aug 2012 15:02:39 GMT
From:      Bruce Evans <bde@FreeBSD.org>
To:        glebius@freebsd.org, scottl@freebsd.org
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r239940 - head/sys/dev/ixgbe
Message-ID:  <201208311502.q7VF2dEv098318@ref10-i386.freebsd.org>
In-Reply-To: <20120831101100.GL90597@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
glebius wrote:

> On Fri, Aug 31, 2012 at 10:07:38AM +0000, Scott Long wrote:
> S> +/*
> S> + * Optimized bcopy thanks to Luigi Rizzo's investigative work.  Assumes
> S> + * non-overlapping regions and 32-byte padding on both src and dst.
> S> + */
> S> +static __inline int
> S> +ixgbe_bcopy(void *_src, void *_dst, int l)
> S> +{
> S> +	uint64_t *src = _src;
> S> +	uint64_t *dst = _dst;
> S> +
> S> +	for (; l > 0; l -= 32) {
> S> +		*dst++ = *src++;
> S> +		*dst++ = *src++;
> S> +		*dst++ = *src++;
> S> +		*dst++ = *src++;
> S> +	}
> S> +	return (0);
> S> +}
> S> +
> 
>   Shouldn't this go to libkern?

It's bogus, so it belongs in /dev/null.

Bogusness starts with its name and its parameter names (its semantics
are closer to memcpy() than bcopy(), but its arg order is that of
bcopy()...).

In a quick test, it is slightly slower than builtin memcpy when l = 64,
and when l = 128 on amd64:

% #include <stdint.h>
% 
% static __inline int
% ixgbe_bcopy(void *_src, void *_dst, int l)
% {
% 	uint64_t *src = _src;
% 	uint64_t *dst = _dst;
% 
% 	for (; l > 0; l -= 32) {
% 		*dst++ = *src++;
% 		*dst++ = *src++;
% 		*dst++ = *src++;
% 		*dst++ = *src++;
% 	}
% 	return (0);
% }
% 
% int dst[512];
% int src[512];
% 
% main()
% {
% 	int i;
% 
% 	for (i = 0; i < 100000000; i++)
% #if 0
% 		__builtin_memcpy(dst, src, 64);
% #else
% 		ixgbe_bcopy(src, dst, 64);
% #endif
% }

Builtin memcpy generates lots of unrolling, with up to 64 bytes copied
in the builtin's inner loop on i386, and up to 128 on amd64.  This is
excessive, but is what makes the builtin slightly faster than the
hand-rolled version in bogus micro-benchmarks like this.

Times on FreeBSD cluster machines (core2):
       amd64 l = 128: builtin 1.00 seconds, hand-rolled 1.16 seconds
       i386  l = 64:  builtin 0.99 seconds, hand-rolled 1.43 seconds

Above 64 or 128 bytes, the builtin switches to calling memcpy().  Now
the hand-rolled version is faster.  This is essentially accidental.
gcc knows that it doesn't understand copying memory and switches to
the extern memcpy() above a certain threshold for the size, in the
hope that the extern memcpy understands.  The threshold is hard-coded
but depends on -march.  The hand-rolled version doesn't know that it 
doesn't understand copying memory, and uses a hard-coded magic number
of 32 related to the threshold.  The thresholds of 64/128 are possibly
a little too small on the test hardware given the unsmartness of the
extern memcpy(), but it is hard to do better without tuning for the
CPU, its memory system, and other things.  Perhaps ixgbe could know
something about the context, but libkern can't.

Times:
       amd64 l = 256: builtin 4.09 seconds, hand-rolled 2.28 seconds
       i386  l = 128: builtin 4.75 seconds, hand-rolled 2.67 seconds

These times show some interesting unrelated pessimizations:
- i386 is about twice as slow as amd64, on the same hardware.  It
  is handicapped by only having 32-bit integer registers.  With SSE,
  there would be little difference, but the setup overhead is probably
  too large to use SSE for such small copies, even in userland.
- amd64 goes from being about twice as fast as i386 with builtin
  memcpy below the threshold to not much faster above the threshold.
  This is surprising, since the extern memcpy uses "rep movsq" on
  amd64, and in other benchmarks on other (amd64) machines, using
  "rep movsq" instead of "rep movsl"" retains the 2-flow speed
  advantage (because "rep movsq" goes at cache speed, but "rep
  movsl" can't keep up.  String instructions tend to be slower` but
  are often fast enough.  I recently read a claim that movs* is
  the fastest method on SandyBridge but not fastest on any older
  CPU.

Anyway, the speed of copying from the L1 cache is unimportant.  If
anyone cared about it, then they would have noticed when automatic
use of builtin memcpy was broken for kernels ~20 years ago by
turning off all builtins.  BTW, NOTES still hasn't caught up with
this change.  It still has -fno-builtin in makeoptions.  I put this
there to test the inversion of the default -fbuiltin.  But the
default was reversed without changing this (-fno-builtin is not
explicit, but is implied by -ffreestanding).

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201208311502.q7VF2dEv098318>