Date: Fri, 20 Mar 1998 17:39:57 +1100 From: Bruce Evans <bde@zeta.org.au> To: hasty@rah.star-gate.com, reilly@zeta.org.au Cc: freebsd-current@FreeBSD.ORG, lamaster@george.arc.nasa.gov Subject: Re: Stream_d benchmark... Wow, there really are differences in har dware Message-ID: <199803200639.RAA08667@godzilla.zeta.org.au>
next in thread | raw e-mail | index | archive | help
>>> > > Box 1 is a SuperMicro P6DNE: >>> > > Function Rate (MB/s) RMS time Min time Max time >>> > > Copy: 60.7395 0.2704 0.2634 0.2832 >>> > > Triad: 71.1647 0.3494 0.3372 0.3565 >>> >>> Typical for Natoma with FP DRAM I would guess. > >I have to say that these are all really terrible numbers! Does anyone >know what the DRAM controller on these motherboards is doing? Pentium Pros and K6's normally use write allocation. This reduces write bandwith by a factor of 2 and thrashes the L2 cache to get no benefits in the stream benchmark. 60MB/sec for `copy' is still very bad. Assuming a typical main memory bandwidth of about 180MB/sec and no other penalties, 60MB/sec would be used for reading, 60MB/sec for write allocation and 60MB/sec for writing. The stream benchmark would report this as 120MB/sec. Ignore half of what I said in other mail about -malign-double being important. It is important for P5's but not for CPUs with write allocation. I get the following speeds for copying: P5/133 ASUS Triton 1 non-EDO: stream, misaligned doubles: 85MB/s (1) P5/133 ASUS Triton 1 non-EDO: stream, aligned doubles: 78MB/s (2) P5/133 ASUS Triton 1 non-EDO: kernel bcopy (3): 156MB/s (4) K6/233 FIC PA2007 SDRAM: stream, misaligned doubles: 99MB/s K6/233 FIC PA2007 SDRAM: stream, aligned doubles: 98MB/s K6/233 FIC PA2007 SDRAM: best bcopy (5): 98MB/s (4) (6) (1) 1MB is 1000000 bytes. (2) Yes, aligned copying is slower. Alignment makes all the other stream benchmarks significantly faster. The slowdown is probably caused by the penalty for accessing a cache line that is being loaded. (3) A slightly optimized version of FreeBSD's kernel bcopy, running in user space. It copies through the FPU in a similar way to the stream benchmark but is careful to avoid the P5 cache access and memory system penalties. The "non-EDO" RAM is FastPage IIRC. It has access cycles of x-3-3-3 (read) and (x-2-2-2) write at 66 MHz. 4K at a time is first read into the L1 cache at 3-3-3-3; then it is written at not quite 3-3-3-3 (2-2-2-2 is not possible on a P5/133, since the very slow `fistpq' instruction must be used for writing, and it takes 6 cycles (6 * 66 / 133 = 3 bus cycles). Speeds of more than 160MB/sec have been reported for slighyly faster systems with EDO RAM. On i386's, gcc generates `fstl' for the corresponding part of the copy in the stream benchmark. `fstl' is much faster than `fistpq', so it is possible for a stream-like benchmark to saturate the bus with this h/w configuration. Going back an forth like the stream copy benchmark actually does is the second worst reasonable way to copy on P5's. The h/w `rep movsl' is the worst :-). (4) Translated to stream benchmark units (2 * bytes/sec copied). (5) `rep movsl' is the best. (6) After multipying by 3/2 to allow for write allocation, the K6/233 system is still slightly slower at copying than the P5/133 system, although it has SDRAM instead of nondescript RAM. The SDRAM seems to be only 1 bus cycle per burst faster in practice (11 instead of 12). K6's apparently have worse cache access penalties than P5. The P5 trick of reading ahead doesn't help on K6's. >Posit: > >A Pentium or Pentium pro memory system is 64 bits wide (8 bytes), >clocked at 66MHz, or 15ns/cycle. EDO dram shouldn't have trouble doing >four cycle bursts as 4-1-1-1, or perhaps 5-1-1-1: say 120ns/cache line >of 32 bytes. That's 265M/s in my book. I assume that the benchmark Most memory systems are not that fast. I believe 5-2-2-2 is the best possible for EDO (except the `5' in it can probably be reduced for sequential accesses). SDRAM can do 5-1-1-1, but I haven't seen that. >code for stream is small, sits in the internal cache, and just thrashes >through long vectors, which should result in back-to-back cache reads >(and writes?) Does anyone know where that factor of two is going? >Maybe PC's only get EDO to do -2-2-2? It's a factor of 4 for the system that gets only 60MB/sec for the stream copy benchmark :-]. >Do any PC chipsets notice sequential address blocks and avoid the >unnecessary row address cycles? Seemingly not... Even Triton 1 does something good for sequential accesses, but the stream benchmark defeats sequentiality by going back and forth to read and write. Bruce To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199803200639.RAA08667>