Date: Fri, 20 Mar 1998 17:39:57 +1100 From: Bruce Evans <bde@zeta.org.au> To: hasty@rah.star-gate.com, reilly@zeta.org.au Cc: freebsd-current@FreeBSD.ORG, lamaster@george.arc.nasa.gov Subject: Re: Stream_d benchmark... Wow, there really are differences in har dware Message-ID: <199803200639.RAA08667@godzilla.zeta.org.au>
next in thread | raw e-mail | index | archive | help
>>> > > Box 1 is a SuperMicro P6DNE:
>>> > > Function Rate (MB/s) RMS time Min time Max time
>>> > > Copy: 60.7395 0.2704 0.2634 0.2832
>>> > > Triad: 71.1647 0.3494 0.3372 0.3565
>>>
>>> Typical for Natoma with FP DRAM I would guess.
>
>I have to say that these are all really terrible numbers! Does anyone
>know what the DRAM controller on these motherboards is doing?
Pentium Pros and K6's normally use write allocation. This reduces
write bandwith by a factor of 2 and thrashes the L2 cache to get no
benefits in the stream benchmark.
60MB/sec for `copy' is still very bad. Assuming a typical main memory
bandwidth of about 180MB/sec and no other penalties, 60MB/sec would
be used for reading, 60MB/sec for write allocation and 60MB/sec for
writing. The stream benchmark would report this as 120MB/sec.
Ignore half of what I said in other mail about -malign-double being
important. It is important for P5's but not for CPUs with write
allocation.
I get the following speeds for copying:
P5/133 ASUS Triton 1 non-EDO: stream, misaligned doubles: 85MB/s (1)
P5/133 ASUS Triton 1 non-EDO: stream, aligned doubles: 78MB/s (2)
P5/133 ASUS Triton 1 non-EDO: kernel bcopy (3): 156MB/s (4)
K6/233 FIC PA2007 SDRAM: stream, misaligned doubles: 99MB/s
K6/233 FIC PA2007 SDRAM: stream, aligned doubles: 98MB/s
K6/233 FIC PA2007 SDRAM: best bcopy (5): 98MB/s (4) (6)
(1) 1MB is 1000000 bytes.
(2) Yes, aligned copying is slower. Alignment makes all the other stream
benchmarks significantly faster. The slowdown is probably caused by
the penalty for accessing a cache line that is being loaded.
(3) A slightly optimized version of FreeBSD's kernel bcopy, running in
user space. It copies through the FPU in a similar way to the
stream benchmark but is careful to avoid the P5 cache access and
memory system penalties. The "non-EDO" RAM is FastPage IIRC. It
has access cycles of x-3-3-3 (read) and (x-2-2-2) write at 66 MHz.
4K at a time is first read into the L1 cache at 3-3-3-3; then it is
written at not quite 3-3-3-3 (2-2-2-2 is not possible on a P5/133,
since the very slow `fistpq' instruction must be used for writing,
and it takes 6 cycles (6 * 66 / 133 = 3 bus cycles). Speeds of
more than 160MB/sec have been reported for slighyly faster systems
with EDO RAM.
On i386's, gcc generates `fstl' for the corresponding part of the
copy in the stream benchmark. `fstl' is much faster than `fistpq',
so it is possible for a stream-like benchmark to saturate the bus
with this h/w configuration. Going back an forth like the stream
copy benchmark actually does is the second worst reasonable way to
copy on P5's. The h/w `rep movsl' is the worst :-).
(4) Translated to stream benchmark units (2 * bytes/sec copied).
(5) `rep movsl' is the best.
(6) After multipying by 3/2 to allow for write allocation, the K6/233
system is still slightly slower at copying than the P5/133 system,
although it has SDRAM instead of nondescript RAM. The SDRAM seems
to be only 1 bus cycle per burst faster in practice (11 instead of
12). K6's apparently have worse cache access penalties than P5.
The P5 trick of reading ahead doesn't help on K6's.
>Posit:
>
>A Pentium or Pentium pro memory system is 64 bits wide (8 bytes),
>clocked at 66MHz, or 15ns/cycle. EDO dram shouldn't have trouble doing
>four cycle bursts as 4-1-1-1, or perhaps 5-1-1-1: say 120ns/cache line
>of 32 bytes. That's 265M/s in my book. I assume that the benchmark
Most memory systems are not that fast. I believe 5-2-2-2 is the best
possible for EDO (except the `5' in it can probably be reduced for
sequential accesses). SDRAM can do 5-1-1-1, but I haven't seen that.
>code for stream is small, sits in the internal cache, and just thrashes
>through long vectors, which should result in back-to-back cache reads
>(and writes?) Does anyone know where that factor of two is going?
>Maybe PC's only get EDO to do -2-2-2?
It's a factor of 4 for the system that gets only 60MB/sec for the stream
copy benchmark :-].
>Do any PC chipsets notice sequential address blocks and avoid the
>unnecessary row address cycles? Seemingly not...
Even Triton 1 does something good for sequential accesses, but the
stream benchmark defeats sequentiality by going back and forth to read
and write.
Bruce
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199803200639.RAA08667>
