From owner-freebsd-current Sat Apr 6 12:06:49 1996 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id MAA12806 for current-outgoing; Sat, 6 Apr 1996 12:06:49 -0800 (PST) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id MAA12801 for ; Sat, 6 Apr 1996 12:06:46 -0800 (PST) Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.12/8.6.9) id GAA08258; Sun, 7 Apr 1996 06:01:26 +1000 Date: Sun, 7 Apr 1996 06:01:26 +1000 From: Bruce Evans Message-Id: <199604062001.GAA08258@godzilla.zeta.org.au> To: bde@zeta.org.au, tege@matematik.su.se Subject: Re: optimized bzeros found harmful (was: fast memory copy ...) Cc: asami@cs.berkeley.edu, current@FreeBSD.org, hasty@rah.star-gate.com, mrami@minerva.cis.yale.edu, nisha@cs.berkeley.edu Sender: owner-current@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk > This behaviour is consistent with the data being zeroed usually not being > in the L2 cache. RBW is 33% slower in that case on my system. Other > cases: if the data is in the L2 cache but not in the L1 cache, then RBW > is between 0% and 33% faster; if data the data is in the L1 cache, then > RBW is 8.5 times faster (740MB/s!). >This must be a misunderstanding! >If the data is really in the L1 cache, the read-before-write is wasted and >just contributes to the overhead. It must not be in the L1 cache. (Why not?) `perfmon' in -currrent shows much more bus activity for write test 3 than for write test 4. E.g., counter 25 (PMC5_WRITE_BACKUP_STALL) is about 117e6 events for test 3 and only 5e6 for test 4. This is for copying a total amount of 100e6 bytes. Let's see your output for `./w -5' and your explanation of it. >The read-before-write is effective if and only if the data is not in the L1 >cache. In that case, it forces allocation of the cache line in the L1 >cache, and thereby allows a 14x peak speedup. >If other behaviours are observed, the timing framework confuses you. Let's see you output for `./w -l 65536 -5'. 64K should fit in the L2 cache (512K). Why does read-before-write give only a 25% speedup? >All other CPUs I know of have caches that do allocate-on-write. Perhaps the Pentium behaviour is best. It seems to penalize writing to the same location without reading it, but this is abnormal behaviour. Bruce