Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Mar 1998 17:39:57 +1100
From:      Bruce Evans <bde@zeta.org.au>
To:        hasty@rah.star-gate.com, reilly@zeta.org.au
Cc:        freebsd-current@FreeBSD.ORG, lamaster@george.arc.nasa.gov
Subject:   Re: Stream_d benchmark... Wow, there really are differences in  har dware
Message-ID:  <199803200639.RAA08667@godzilla.zeta.org.au>

next in thread | raw e-mail | index | archive | help
>>> > > Box 1 is a SuperMicro P6DNE:
>>> > > Function      Rate (MB/s)   RMS time     Min time     Max time
>>> > > Copy:          60.7395       0.2704       0.2634       0.2832
>>> > > Triad:         71.1647       0.3494       0.3372       0.3565
>>> 
>>> Typical for Natoma with FP DRAM I would guess.
>
>I have to say that these are all really terrible numbers!  Does anyone
>know what the DRAM controller on these motherboards is doing?

Pentium Pros and K6's normally use write allocation.  This reduces
write bandwith by a factor of 2 and thrashes the L2 cache to get no
benefits in the stream benchmark.

60MB/sec for `copy' is still very bad.  Assuming a typical main memory
bandwidth of about 180MB/sec and no other penalties, 60MB/sec would
be used for reading, 60MB/sec for write allocation and 60MB/sec for
writing.  The stream benchmark would report this as 120MB/sec.

Ignore half of what I said in other mail about -malign-double being
important.  It is important for P5's but not for CPUs with write
allocation.

I get the following speeds for copying:

P5/133 ASUS Triton 1 non-EDO: stream, misaligned doubles:  85MB/s (1)
P5/133 ASUS Triton 1 non-EDO: stream,    aligned doubles:  78MB/s (2)
P5/133 ASUS Triton 1 non-EDO: kernel bcopy (3):           156MB/s (4)
K6/233 FIC PA2007    SDRAM:   stream, misaligned doubles:  99MB/s
K6/233 FIC PA2007    SDRAM:   stream,    aligned doubles:  98MB/s
K6/233 FIC PA2007    SDRAM:   best bcopy (5):              98MB/s (4) (6)

(1) 1MB is 1000000 bytes.
(2) Yes, aligned copying is slower.  Alignment makes all the other stream
    benchmarks significantly faster.  The slowdown is probably caused by
    the penalty for accessing a cache line that is being loaded.
(3) A slightly optimized version of FreeBSD's kernel bcopy, running in
    user space.  It copies through the FPU in a similar way to the
    stream benchmark but is careful to avoid the P5 cache access and
    memory system penalties.  The "non-EDO" RAM is FastPage IIRC.  It
    has access cycles of x-3-3-3 (read) and (x-2-2-2) write at 66 MHz.
    4K at a time is first read into the L1 cache at 3-3-3-3; then it is
    written at not quite 3-3-3-3 (2-2-2-2 is not possible on a P5/133,
    since the very slow `fistpq' instruction must be used for writing,
    and it takes 6 cycles (6 * 66 / 133 = 3 bus cycles).  Speeds of
    more than 160MB/sec have been reported for slighyly faster systems
    with EDO RAM.
    On i386's, gcc generates `fstl' for the corresponding part of the
    copy in the stream benchmark.  `fstl' is much faster than `fistpq',
    so it is possible for a stream-like benchmark to saturate the bus
    with this h/w configuration.  Going back an forth like the stream
    copy benchmark actually does is the second worst reasonable way to
    copy on P5's.  The h/w `rep movsl' is the worst :-).
(4) Translated to stream benchmark units (2 * bytes/sec copied).
(5) `rep movsl' is the best.
(6) After multipying by 3/2 to allow for write allocation, the K6/233
    system is still slightly slower at copying than the P5/133 system,
    although it has SDRAM instead of nondescript RAM.  The SDRAM seems
    to be only 1 bus cycle per burst faster in practice (11 instead of
    12).  K6's apparently have worse cache access penalties than P5.
    The P5 trick of reading ahead doesn't help on K6's.

>Posit:
>
>A Pentium or Pentium pro memory system is 64 bits wide (8 bytes),
>clocked at 66MHz, or 15ns/cycle.  EDO dram shouldn't have trouble doing
>four cycle bursts as 4-1-1-1, or perhaps 5-1-1-1: say 120ns/cache line
>of 32 bytes.  That's 265M/s in my book.  I assume that the benchmark

Most memory systems are not that fast.  I believe 5-2-2-2 is the best
possible for EDO (except the `5' in it can probably be reduced for
sequential accesses).  SDRAM can do 5-1-1-1, but I haven't seen that.

>code for stream is small, sits in the internal cache, and just thrashes
>through long vectors, which should result in back-to-back cache reads
>(and writes?)  Does anyone know where that factor of two is going? 
>Maybe PC's only get EDO to do -2-2-2?

It's a factor of 4 for the system that gets only 60MB/sec for the stream
copy benchmark :-].

>Do any PC chipsets notice sequential address blocks and avoid the
>unnecessary row address cycles?  Seemingly not...

Even Triton 1 does something good for sequential accesses, but the
stream benchmark defeats sequentiality by going back and forth to read
and write.

Bruce

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199803200639.RAA08667>