From owner-freebsd-hackers  Tue Jan 23 06:03:58 1996
Return-Path: owner-hackers
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id GAA19841
          for hackers-outgoing; Tue, 23 Jan 1996 06:03:58 -0800 (PST)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id GAA19783
          for <freebsd-hackers@freefall.freebsd.org>; Tue, 23 Jan 1996 06:03:33 -0800 (PST)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.9/8.6.9) id AAA29712; Wed, 24 Jan 1996 00:43:55 +1100
Date: Wed, 24 Jan 1996 00:43:55 +1100
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199601231343.AAA29712@godzilla.zeta.org.au>
To: davidg@Root.COM, terry@lambert.org
Subject: Re: stanford benchmark/usenix
Cc: freebsd-hackers@freefall.freebsd.org, hasty@rah.star-gate.com,
        rmallory@wiley.csusb.edu
Sender: owner-hackers@FreeBSD.ORG
Precedence: bulk

>Do you remember Bruce's message regarding reordering the cache line
>loads in the P5 optimized bcopy?  He said:

>| On my 486DX2/66 with an unknown writing strategy, copy() is about 20%
>| faster than memcpy() (*) but can be improved another 20% by changing the
>| cache line allocation strategy slightly: replace the load of 28(%edi) by
>| a load of 12(%edi) and add a load of 28(%edi) in the middle of the loop.
>| The pairing stuff and the nops make little difference.  cache-line
>| alignment of the source and target made little difference.
>| 
>| (*) When memcpy() is run a second time, it is as fast as the fastest
>| version as copy()!

>I didn't quite follow the reasoning, since it would write the contents
>of 12(%edi) into 28(%edi)?!?

>I mailed Bruce about this directly, but haven't seen a response yet...

I didn't see the mail.  The contents of 12(%edi) should be loaded into
a free register and not stored anywhere (cmpl $0,12(%edi) can be used if
it is too inconvenient to have a free register, but takes longer).

All this is very machine-dependent (not just cpu-dependent).  For my
486DX2/66 with an unknown writing strategy, the best way to write is
to write a (16 byte) cache line at a time and prefetch that cache line
by reading an aligned 32 bit word from it.  For my 486DX/33 with a
write buffer, the prefetch just wastes time.

The (*) has something to do with data being left in a cache by a
previous benchmark.  Prefetching apparently helps by causing a cache hit
for the writes; already cached data works in the same way.  Memory
benchmarks should do something to set all caches to nearly a known state
before starting.

The best way to read depends on whether the data is in the L1 cache, the
L2 cache, or in main memory.  For 486's: (1) if it's in the L1 cache,
just reading aligned 32 bit words at a time works as fast as possible (1
cycle/word).  If it's in the L2 cache but not in the L1 cache, then
prefetching (16 byte) cache lines (the next one you need, not the
current one as for writes) works best (stalls for prefetching can be
overlapped with stalls for reading).  If it's only in main memory, then
prefetching makes little difference on my systems.  There might be a
difference on systems with faster main memory or different L2 caches...

Copying requires both reading and writing.  I don't know what the best
combined strategy is.  There are cases when you have a good idea which
cache the data is in.  E.g., after busting the L1 cache by reading >= 8K
data from an IDE drive, the data is probably in the L2 cache and may
be in the L1 cache, so it would be good to copy it immediately to user
space to satisfy any user reads; bcopy could take lots of flags telling
it where you think the data is or which strategy you think is best.  If
you don't plan to copy the data immediately, then you should have
disabled caching before reading it.

Bruce