From owner-freebsd-current  Sat Apr  6 12:06:49 1996
Return-Path: owner-current
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id MAA12806
          for current-outgoing; Sat, 6 Apr 1996 12:06:49 -0800 (PST)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id MAA12801
          for <current@FreeBSD.org>; Sat, 6 Apr 1996 12:06:46 -0800 (PST)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.12/8.6.9) id GAA08258; Sun, 7 Apr 1996 06:01:26 +1000
Date: Sun, 7 Apr 1996 06:01:26 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199604062001.GAA08258@godzilla.zeta.org.au>
To: bde@zeta.org.au, tege@matematik.su.se
Subject: Re: optimized bzeros found harmful (was: fast memory copy ...)
Cc: asami@cs.berkeley.edu, current@FreeBSD.org, hasty@rah.star-gate.com,
        mrami@minerva.cis.yale.edu, nisha@cs.berkeley.edu
Sender: owner-current@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

>  This behaviour is consistent with the data being zeroed usually not being
>  in the L2 cache.  RBW is 33% slower in that case on my system.  Other
>  cases: if the data is in the L2 cache but not in the L1 cache, then RBW
>  is between 0% and 33% faster; if data the data is in the L1 cache, then
>  RBW is 8.5 times faster (740MB/s!).

>This must be a misunderstanding!

>If the data is really in the L1 cache, the read-before-write is wasted and
>just contributes to the overhead.

It must not be in the L1 cache.  (Why not?)  `perfmon' in -currrent shows
much more bus activity for write test 3 than for write test 4.  E.g.,
counter 25 (PMC5_WRITE_BACKUP_STALL) is about 117e6 events for test 3
and only 5e6 for test 4.  This is for copying a total amount of 100e6
bytes.

Let's see your output for `./w -5' and your explanation of it.

>The read-before-write is effective if and only if the data is not in the L1
>cache.  In that case, it forces allocation of the cache line in the L1
>cache, and thereby allows a 14x peak speedup.

>If other behaviours are observed, the timing framework confuses you.

Let's see you output for `./w -l 65536 -5'.  64K should fit in the L2
cache (512K).  Why does read-before-write give only a 25% speedup?

>All other CPUs I know of have caches that do allocate-on-write.

Perhaps the Pentium behaviour is best.  It seems to penalize writing to
the same location without reading it, but this is abnormal behaviour.

Bruce