From owner-freebsd-hackers  Sun Dec 24 02:06:21 1995
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id CAA17431
          for hackers-outgoing; Sun, 24 Dec 1995 02:06:21 -0800 (PST)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id CAA17413
          for <freebsd-hackers@freebsd.org>; Sun, 24 Dec 1995 02:06:09 -0800 (PST)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.9/8.6.9) id VAA25049; Sun, 24 Dec 1995 21:06:13 +1100
Date: Sun, 24 Dec 1995 21:06:13 +1100
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199512241006.VAA25049@godzilla.zeta.org.au>
To: imb@scgt.oz.au, tege@matematik.su.se
Subject: Re: Pentium bcopy
Cc: freebsd-hackers@freebsd.org
Sender: owner-hackers@freebsd.org
Precedence: bulk

>  > The reason that this is so much faster is that it uses the dual-ported
>  > cache is a near-optimal way.

>  Does this approach demonstrate any significant penalties with less
>  sophisticated cache architectures, for example 386DX or non-pipelined ?

>The approach has a significant penalty on a 386 (3x slower).

>I suspect it might be a tad bit slower on a 486 with a write-through L1
>cache.  But the approach should help on 486 systems with write-back cache.

>I don't have any 486 systems, so I cannot tell for sure.  Here is a simple
>test program that you can use for timing tests:

On my 486DX2/66 with an unknown writing strategy, copy() is about 20%
faster than memcpy() (*) but can be improved another 20% by changing the
cache line allocation strategy slightly: replace the load of 28(%edi) by
a load of 12(%edi) and add a load of 28(%edi) in the middle of the loop.
The pairing stuff and the nops make little difference.  cache-line
alignment of the source and target made little difference.

(*) When memcpy() is run a second time, it is as fast as the fastest
version as copy()!

On my 486DX/33 with a "write buffer" (which is faster than "write back"
on the same machine), the fancy copies are all much the same speed, the
speed of memcpy() is independent of the cache state and is 30% faster
than the speed of the fancy copies.

>unsigned long
>cputime ()
>{
>  struct rusage rus;

>  getrusage (0, &rus);
>  return rus.ru_utime.tv_sec * 1000 + rus.ru_utime.tv_usec / 1000;
                                                            ^^^^^^
>}

Not accurate enough.  Use weights of 1000000 and 1 instead of 1000
and 1/1000, or double precision.

Actual results:

function	486DX2/66	486DX/33
--------	---------	--------
memcpy		11353454	 9242061
copy		 9389321	12595028
copy1		 6841713	12888324
copy2		 7055773	12823391
memcpy		 6952372	 9219855

copy1() is copy() with the above changes.  copy2() is copy1() with
half as much unrolling and only one word copied at a time.

Bruce