From owner-freebsd-current  Fri May 10 08:12:09 1996
Return-Path: owner-current
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id IAA25184
          for current-outgoing; Fri, 10 May 1996 08:12:09 -0700 (PDT)
Received: from godzilla.zeta.org.au (godzilla.zeta.org.au [203.2.228.19])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id IAA25092
          for <current@freebsd.org>; Fri, 10 May 1996 08:11:55 -0700 (PDT)
Received: (from bde@localhost) by godzilla.zeta.org.au (8.6.12/8.6.9) id BAA18981; Sat, 11 May 1996 01:11:08 +1000
Date: Sat, 11 May 1996 01:11:08 +1000
From: Bruce Evans <bde@zeta.org.au>
Message-Id: <199605101511.BAA18981@godzilla.zeta.org.au>
To: asami@cs.berkeley.edu, current@freebsd.org
Subject: Re: some more on fast bcopy
Cc: nisha@cs.berkeley.edu
Sender: owner-current@freebsd.org
X-Loop: FreeBSD.org
Precedence: bulk

>Index: support.s
>===================================================================
>RCS file: /usr/cvs/src/sys/i386/i386/support.s,v
>retrieving revision 1.35
>diff -u -r1.35 support.s
>--- support.s	1996/05/03 21:01:00	1.35
>+++ support.s	1996/05/10 09:59:57
>@@ -453,6 +453,16 @@
> 	/* bcopy(%esi, %edi, %ebx) */
> 3:
> 	movl	%ebx,%ecx
>+#ifdef I586_FAST_BCOPY
>+	cmpl	$128,%ecx
>+	jbe	slow_copyout

128 is too small.  fnsave/frstor moves 108 bytes and takes a minimum of
about 200 cycles on a 586, so the FP method can't be better for sizes of
less than about 256, because its data movement part is less than twice
as fast as movsl.  I'd make it about 4096, since executing the unrolled
FP code depletes the I-cache.

>+	movl %esi,%eax
>+	andl $7,%eax	/* check if src addr is multiple of 8 */
>+	jnz fastmove_tail

Checking for alignment is probably a waste of time.  Anyway, write it
as `testl $7,%esi'.

>...
>+	fildq 48(%esi)
>+	fildq 56(%esi)
>+	fxch %st(7)
>+	fistpq 0(%edi)
>+	fxch %st(5)
>+	fistpq 8(%edi)

Did you try `fistpq 56(%esi); fistpq 48(%esi); ...' to avoid the fxch's?
The fxch's should pair, but it's simpler without them.

>...
>+	fistpq 240(%edi)
>+	fistpq 248(%edi)
>+	addl $-256,%ecx
>+	addl $256,%esi
>+	addl $256,%edi
>+	cmpl $255,%ecx
>+	ja fastmove_loop

>Bruce said we shouldn't try to unroll it too much but it's less than
>500 bytes and there was quite a large drop between 256 and 128 on our
>system so I tried a little agressively.  (The latest summary is on

I still think it is over-unrolled.  500 bytes is a lot for copying only
4096 bytes (not to mention 128 bytes).  If it isn't in the L1 cache then
it is an overhead of about 500/8192.

I think only the prefetching is important.

Bruce