Date: Sat, 26 Oct 1996 15:53:20 +1000 From: Bruce Evans <bde@zeta.org.au> To: asami@freebsd.org, mark@quickweb.com Cc: current@freebsd.org, ejs@bfd.com, michaelv@MindBender.serv.net, rgrimes@gndrsh.aac.dev.com, scrappy@ki.net, smp@freebsd.org Subject: Re: Recommendations... Message-ID: <199610260553.PAA14727@godzilla.zeta.org.au>
next in thread | raw e-mail | index | archive | help
>> * What low memory bandwidth on the Natoma??? That thing smokes when comparied >> * to a 430HX chipset. >> >> That contradicts our findings. A P5-133 with Triton or Triton II can >> move 70-80MB/s (depending on EDO or non-EDO), but I can't get more >> than 45MB/s out of a P6-200 with Natoma/server (at least that's what >> Intel told us). > >That's odd, here are my speeds on a P6-200 with Natoma (440fx)/server >board straight from intel: > >Function Rate (MB/s) RMS time Min time Max time >Copy: 76.1639 0.0633 0.0630 0.0648 >Scale: 75.5894 0.0636 0.0635 0.0638 >Add: 81.3670 0.0886 0.0885 0.0887 >Triad: 80.6036 0.0894 0.0893 0.0896 This is because the 4 Rates reported by the STREAM benchmark are scaled by factors of 2, 2, 3 and 3, respectively, and Natoma is very slow :-). On a P5-133 with Triton 1 (ASUS P55TP4XE) with non-EDO RAM (66 MHz memory clock): Function Rate (MB/s) RMS time Min time Max time Copy: 88.7256 0.1446 0.1443 0.1471 Scale: 80.4207 0.1608 0.1592 0.1624 Add: 89.6191 0.2222 0.2142 0.2318 Triad: 88.3433 0.2232 0.2173 0.2318 This is still slow. This machine can copy at > 75MB/s throughput or 150 MB/s on the same scale as the STREAM tests. Getting this throughput involves prefetching the source bytes a few K at a time and then using FP operations to store them (and perforce FP operations to load them). gcc "optimizes" the Copy benchmark to not use FP at all. This is why the more complicated Add an Triad benchmarks can be faster. I guess the more complicated benchmarks would be speeded up to only about 120MB/s by the same method. The full memory bandwidth of 176MB/sec (on this system) isn't quite reachable even for copying because the FPU is too slow (fistpq takes 6 cycles, which is more than the minimum memory cycle time and leaves no time for loop overheads). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199610260553.PAA14727>