From owner-freebsd-current Fri May 10 08:12:09 1996 Return-Path: owner-current Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id IAA25184 for current-outgoing; Fri, 10 May 1996 08:12:09 -0700 (PDT) Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id IAA25092 for ; Fri, 10 May 1996 08:11:55 -0700 (PDT) Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.12/8.6.9) id BAA18981; Sat, 11 May 1996 01:11:08 +1000 Date: Sat, 11 May 1996 01:11:08 +1000 From: Bruce Evans Message-Id: <199605101511.BAA18981@godzilla.zeta.org.au> To: asami@cs.berkeley.edu, current@freebsd.org Subject: Re: some more on fast bcopy Cc: nisha@cs.berkeley.edu Sender: owner-current@freebsd.org X-Loop: FreeBSD.org Precedence: bulk >Index: support.s >=================================================================== >RCS file: /usr/cvs/src/sys/i386/i386/support.s,v >retrieving revision 1.35 >diff -u -r1.35 support.s >--- support.s 1996/05/03 21:01:00 1.35 >+++ support.s 1996/05/10 09:59:57 >@@ -453,6 +453,16 @@ > /* bcopy(%esi, %edi, %ebx) */ > 3: > movl %ebx,%ecx >+#ifdef I586_FAST_BCOPY >+ cmpl $128,%ecx >+ jbe slow_copyout 128 is too small. fnsave/frstor moves 108 bytes and takes a minimum of about 200 cycles on a 586, so the FP method can't be better for sizes of less than about 256, because its data movement part is less than twice as fast as movsl. I'd make it about 4096, since executing the unrolled FP code depletes the I-cache. >+ movl %esi,%eax >+ andl $7,%eax /* check if src addr is multiple of 8 */ >+ jnz fastmove_tail Checking for alignment is probably a waste of time. Anyway, write it as `testl $7,%esi'. >... >+ fildq 48(%esi) >+ fildq 56(%esi) >+ fxch %st(7) >+ fistpq 0(%edi) >+ fxch %st(5) >+ fistpq 8(%edi) Did you try `fistpq 56(%esi); fistpq 48(%esi); ...' to avoid the fxch's? The fxch's should pair, but it's simpler without them. >... >+ fistpq 240(%edi) >+ fistpq 248(%edi) >+ addl $-256,%ecx >+ addl $256,%esi >+ addl $256,%edi >+ cmpl $255,%ecx >+ ja fastmove_loop >Bruce said we shouldn't try to unroll it too much but it's less than >500 bytes and there was quite a large drop between 256 and 128 on our >system so I tried a little agressively. (The latest summary is on I still think it is over-unrolled. 500 bytes is a lot for copying only 4096 bytes (not to mention 128 bytes). If it isn't in the L1 cache then it is an overhead of about 500/8192. I think only the prefetching is important. Bruce