Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 06 Aug 1997 04:32:49 -0500
From:      Tony Overfield <tony@dell.com>
To:        Curt Sampson <cjs@portal.ca>
Cc:        hackers@FreeBSD.ORG
Subject:   Re: Pentium II?
Message-ID:  <3.0.2.32.19970806043249.006df3e4@bugs.us.dell.com>
In-Reply-To: <Pine.NEB.3.93.970803031523.7035A-100000@gnostic.cynic.net>
References:  <3.0.2.32.19970803041915.006a69e4@bugs.us.dell.com>

next in thread | previous in thread | raw e-mail | index | archive | help
At 03:17 AM 8/3/97 -0700, Curt Sampson wrote:
>I wasn't interested in what you think as much as which particular
>benchmarks indicate this. Feel free to provide references.

I claimed that a larger L1 cache makes the processor faster, which 
at least partially offsets the effect of the slower L2 cache.  
This is ordinarily a self-evident truth which needs no references.  
I have no desire to search for references at the behest of skeptics.

>> It should be easy to agree that larger L1 caches have higher hit rates.
>
>Sure. But the L2 cache in the PPro is running at the same speed as
>the L1 cache in the PPro and the PII. Thus, I don't think that
>having twice the L1 cache is going to make a lot of difference.
>Feel free to show me the actual benchmarks that prove me wrong.
>
>cjs
>
>Curt Sampson    cjs@portal.ca		Info at http://www.portal.ca/
>Internet Portal Services, Inc.		`And malt does more than Milton can
>Vancouver, BC   (604) 257-9400		 To justify God's ways to man.' 

You're wrong.  The L1 cache in the PPro is faster than its L2 cache.

Since the size of the L1 cache can't be adjusted on PPro processors, 
it's not easy to find a ready-made benchmark that proves that a larger 
L1 cache is beneficial.  One way that this can be shown is to compare 
the Pentium processors to the Pentium w/ MMX processors.  In 
comparisons between these, the MMX is invariably faster, due (for 
non-MMX benchmarks) entirely to the larger L1 cache.  However, as 
you said, this only helps if the L2 cache is slower than the L1 
cache.  But *that* can be easily proven.

The performance of a cache depends on more than the clock speed at 
which it runs.  The L1 cache in the PPro and PII is split between an 
instruction cache and a dual-ported data cache.  Thus, the L1 cache can 
transfer up to three sets of data per cycle.  This means the processor 
can simultaneously read code from the code cache, read data from the 
data cache, and write data to the data cache.  

The L2 cache, on the other hand, is a unified instruction and data cache
with a 64 bit data bus.  This L2 cache is much improved over the Pentium 
(P5) architecture because it has a dedicated bus.  The dedicated L2 cache 
bus prevents L2 cache accesses from competing for bandwidth with the 
external CPU data bus, which may be busy with ordinary CPU traffic,
traffic from PCI master cycles and traffic from other processors.  

Even though the built-in L2 cache is very fast, it is not as fast as
the more tightly integrated L1 cache.

Some benchmark data is included below.

First the benchmark pseudocode:
(If you want the DOS x86 assembly source code, ask me.)

loop              (for a variety of sizes)
{
	wbinvd     (empty the L1 and L2 caches)
	rep movsd  (move, in place, the test memory)
	rtsc       (read time stamp counter -> start time)
	rep movsd  (move, in place, the test memory)
	rtsc       (read time stamp counter -> end time)
}

This simple little benchmark shows:

1. The PPro L1 data cache is 8KB.
2. The PII L1 data cache is 16KB.
3. The PII L2 cache is half-speed with respect to the PPro.
4. My PPro's L2 cache is 256KB.
5. My PII's L2 cache is 512KB.
6. DRAM is much slower than the L2 cache (of course).
7. The PPro's L2 cache is about two times slower than its L1 cache.
8. The PII's L2 cache is about 4 or 5 times slower than its L1 cache.

The results:

PPro 200/256

Moving    2KB  -  Clocks: 0x0000023A  Clocks/KB moved:   285
Moving    4KB  -  Clocks: 0x000003B9  Clocks/KB moved:   238
Moving    8KB  -  Clocks: 0x000006C9  Clocks/KB moved:   217
Moving   12KB  -  Clocks: 0x0000186E  Clocks/KB moved:   521
Moving   16KB  -  Clocks: 0x0000206E  Clocks/KB moved:   518
Moving   24KB  -  Clocks: 0x0000306E  Clocks/KB moved:   516
Moving   32KB  -  Clocks: 0x0000406E  Clocks/KB moved:   515
Moving   64KB  -  Clocks: 0x0000806E  Clocks/KB moved:   513
Moving  128KB  -  Clocks: 0x0001006E  Clocks/KB moved:   512
Moving  256KB  -  Clocks: 0x00020127  Clocks/KB moved:   513
Moving  384KB  -  Clocks: 0x000DBA60  Clocks/KB moved:  2342
Moving  512KB  -  Clocks: 0x00124D1D  Clocks/KB moved:  2342
Moving  768KB  -  Clocks: 0x001B72BE  Clocks/KB moved:  2342
Moving 1024KB  -  Clocks: 0x00249796  Clocks/KB moved:  2341

PII 233/512

Moving    2KB  -  Clocks: 0x0000023A  Clocks/KB moved:   285
Moving    4KB  -  Clocks: 0x000003BA  Clocks/KB moved:   238
Moving    8KB  -  Clocks: 0x000006BA  Clocks/KB moved:   215
Moving   12KB  -  Clocks: 0x000009BA  Clocks/KB moved:   207
Moving   16KB  -  Clocks: 0x00000CF8  Clocks/KB moved:   207
Moving   24KB  -  Clocks: 0x0000661B  Clocks/KB moved:  1089
Moving   32KB  -  Clocks: 0x0000881E  Clocks/KB moved:  1088
Moving   64KB  -  Clocks: 0x0001101E  Clocks/KB moved:  1088
Moving  128KB  -  Clocks: 0x00022024  Clocks/KB moved:  1088
Moving  256KB  -  Clocks: 0x0004401A  Clocks/KB moved:  1088
Moving  384KB  -  Clocks: 0x00066029  Clocks/KB moved:  1088
Moving  512KB  -  Clocks: 0x000880C6  Clocks/KB moved:  1088
Moving  768KB  -  Clocks: 0x0016A600  Clocks/KB moved:  1932
Moving 1024KB  -  Clocks: 0x0026AC66  Clocks/KB moved:  2475

Clocks are measured in actual CPU clocks, so these numbers 
don't change much when the clock speed is changed, except 
for those which are affected by DRAM accesses, since DRAM 
speed doesn't scale with CPU speed.

-
Tony





Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3.0.2.32.19970806043249.006df3e4>