From owner-freebsd-hackers Sun Dec 24 02:06:21 1995 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id CAA17431 for hackers-outgoing; Sun, 24 Dec 1995 02:06:21 -0800 (PST) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id CAA17413 for ; Sun, 24 Dec 1995 02:06:09 -0800 (PST) Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.9/8.6.9) id VAA25049; Sun, 24 Dec 1995 21:06:13 +1100 Date: Sun, 24 Dec 1995 21:06:13 +1100 From: Bruce Evans Message-Id: <199512241006.VAA25049@godzilla.zeta.org.au> To: imb@scgt.oz.au, tege@matematik.su.se Subject: Re: Pentium bcopy Cc: freebsd-hackers@freebsd.org Sender: owner-hackers@freebsd.org Precedence: bulk > > The reason that this is so much faster is that it uses the dual-ported > > cache is a near-optimal way. > Does this approach demonstrate any significant penalties with less > sophisticated cache architectures, for example 386DX or non-pipelined ? >The approach has a significant penalty on a 386 (3x slower). >I suspect it might be a tad bit slower on a 486 with a write-through L1 >cache. But the approach should help on 486 systems with write-back cache. >I don't have any 486 systems, so I cannot tell for sure. Here is a simple >test program that you can use for timing tests: On my 486DX2/66 with an unknown writing strategy, copy() is about 20% faster than memcpy() (*) but can be improved another 20% by changing the cache line allocation strategy slightly: replace the load of 28(%edi) by a load of 12(%edi) and add a load of 28(%edi) in the middle of the loop. The pairing stuff and the nops make little difference. cache-line alignment of the source and target made little difference. (*) When memcpy() is run a second time, it is as fast as the fastest version as copy()! On my 486DX/33 with a "write buffer" (which is faster than "write back" on the same machine), the fancy copies are all much the same speed, the speed of memcpy() is independent of the cache state and is 30% faster than the speed of the fancy copies. >unsigned long >cputime () >{ > struct rusage rus; > getrusage (0, &rus); > return rus.ru_utime.tv_sec * 1000 + rus.ru_utime.tv_usec / 1000; ^^^^^^ >} Not accurate enough. Use weights of 1000000 and 1 instead of 1000 and 1/1000, or double precision. Actual results: function 486DX2/66 486DX/33 -------- --------- -------- memcpy 11353454 9242061 copy 9389321 12595028 copy1 6841713 12888324 copy2 7055773 12823391 memcpy 6952372 9219855 copy1() is copy() with the above changes. copy2() is copy1() with half as much unrolling and only one word copied at a time. Bruce