Date: Sat, 6 Apr 1996 06:55:42 +1000 From: Bruce Evans <bde@zeta.org.au> To: asami@cs.berkeley.edu, current@FreeBSD.org Cc: hasty@rah.star-gate.com, nisha@cs.berkeley.edu, tege@matematik.su.se Subject: Re: fast memory copy for large data sizes Message-ID: <199604052055.GAA23015@godzilla.zeta.org.au>
next in thread | raw e-mail | index | archive | help
>We've put together a fast memory copy that uses floating point >registers to speed up large transfers. The original idea was taken Oops. I put together 5 fast memory copies that don't use floating point registers. Speeds range from 40K/sec to 340K/sec. on a 133MHz Pentium (ASUS), Triton chipset, 512KB PB cache, 60ns non-EDO main memory. This is after attempting to minimize the differences caused by the cache state. Details in other mail. The speed differences are so large and the cache state is so variable that it is easy to create benchmarks showing that all methods are the best :-). We seemed to have fooled ourselves with the optimized kernel bzeros already. On the above i586 system, the i586-optimized bzero is the slowest for compiling the kernel; for fork-exec of small processes it is significantly the slowest. >from Amancio Hasty's old post to use floating point registers to move >8 bytes at a time. (We tried using integer registers too but with our >wits we could only get 10MB/s less than the FP case.) This seemed like a bad idea. I added a test using it (just 8 fldl's followed by 8 fstpl's, storing in reverse order - this works for at least all-zero data) and got good results, but I still think it is a bad idea. Perhaps it can the duplicated by copying via integer registers through the L1 cache. >133MHz Pentium (sunrise), Triton chipset, 512KB (pipeline burst) cache: new columns vvvvvvvvv vvvvvvvvv vvvvvvvv > size libc ours mine-libc mine-best(int) mine-fp > 32 N/A 30.517578 MB/s 51493147 98069887 > 64 61.035156 MB/s 30.517578 MB/s 65049070 196997754 > 128 40.690104 MB/s 40.690104 MB/s 74971005 254666769 > 256 40.690104 MB/s 40.690104 MB/s 80998485 327390112 > 512 40.690104 MB/s 48.828125 MB/s 84416182 376524453 > 1024 40.690104 MB/s 51.398026 MB/s 85214370 379715593 > 2048 39.859694 MB/s 51.398026 MB/s 86936111 350385424 > 4096 39.859694 MB/s 52.083333 MB/s 87266431 326943762 > 8192 39.457071 MB/s 52.787162 MB/s 84805486 97567163 > 16384 39.556962 MB/s 52.966102 MB/s 65103489 97472157 > 32768 39.506953 MB/s 53.146259 MB/s 66593990 99217964 93604474 > 65536 39.457071 MB/s 53.282182 MB/s 61407673 79866591 93721503 > 131072 39.457071 MB/s 53.327645 MB/s 65457449 68011573 79960595 > 262144 39.345294 MB/s 53.350405 MB/s 51273532 53702491 75576993 > 524288 39.044198 MB/s 53.430220 MB/s 49370136 50029142 67400433 > 1048576 38.086533 MB/s 53.447354 MB/s 44054746 44095308 58624791 > 2097152 37.706680 MB/s 53.387433 MB/s 42742240 42770154 56946700 > 4194304 37.628643 MB/s 53.280763 MB/s 43381238 43381238 57727588 >As you can see, from a certain size and onwards, it is much faster >than the libc version. ("size" is in bytes.) >The program allocates two 4MB buffers and calls libc's bcopy (which is >essentially a string move using rep/movsl; see below for more on this) My tests are obviously not equivalent for small copies - the libc times are about twice as high. This is because I keep copying the same data. I want to do this to test in-cache copies. Not-in-cache copies get tested as a side effect when the buffer is much larger that the cache (L1 or L2). Your test gives similar times on my system. It tests the speed of copying data that isn't in the cache. This seems to be the usual case for kernel bzeros - that's why the i586 optimizations are pessimizations. >operation. (You can't use fld and fst because they will trap on >illegal (as a floating point number) bit patterns -- by the way, the Only if traps are enabled. Rounding may be a problem. >Pentium FP regs are 80 bits with a 64-bit mantissa so there's no loss >of data by using the integer load/store.) Useing 64-bit precision may be enough to avoid rounding problems. fldl is much faster than fildl if the data is in the cache. >Please type "make" and it will compile & run the tests. The output It didn't :-). It assumes that "." is in the $PATH. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199604052055.GAA23015>