From owner-freebsd-hackers Sat Dec 23 17:16:09 1995 Return-Path: owner-hackers Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id RAA08356 for hackers-outgoing; Sat, 23 Dec 1995 17:16:09 -0800 (PST) Received: from insanus.matematik.su.se (insanus.matematik.su.se [130.237.198.12]) by freefall.freebsd.org (8.7.3/8.7.3) with ESMTP id RAA08351 for ; Sat, 23 Dec 1995 17:16:07 -0800 (PST) Received: from localhost (prudens.matematik.su.se [130.237.198.5]) by insanus.matematik.su.se (8.7.1/8.6.9) with ESMTP id CAA26645 for ; Sun, 24 Dec 1995 02:16:00 +0100 (MET) Message-Id: <199512240116.CAA26645@insanus.matematik.su.se> X-Address: Department of Mathematics, Stockholm University S-106 91 Stockholm SWEDEN X-Phone: int+46 8 162000 X-Fax: int+46 8 6126717 X-Url: http://www.matematik.su.se To: freebsd-hackers@freebsd.org Subject: Pentium bcopy Date: Sun, 24 Dec 1995 02:15:58 +0100 From: Torbjorn Granlund Sender: owner-hackers@freebsd.org Precedence: bulk I sent you patches to improve the support.s bcopy a few months ago. I have not heard anything back (sic). Maybe I should just give up, and use some other operating system, where bug reports and contributions from external people are considered? Well, I won't give up just yet! ;-) Now, that is a diplomatic way of starting a message... This time I want to help improving the bcopy/memcpy/memmove functions for the Pentium (and 486). Here is a skeleton bcopy/memcpy that runs about 5 times faster than your current implementation on a Pentium. This bcopy handles up to about 350 MB/s on a Pentium 133, compared to the current 70 MB/s. The reason that this is so much faster is that it uses the dual-ported cache is a near-optimal way. Your code seems to rely on rep+movsl, which is much slower. Well, I haven't bothered to integrate this into your infrastructure since that might be a waste of my time, if you just keep ignoring my messages. If you are interested in this optimization, I volunteer to do the rest of the work. Note that bzero can be sped up in the same way. I have a feeling that bcopy/bzero are used now and then by the VM system... /* Pentium bcopy */ .text .align 4 .globl _copy _copy: pushl %edi pushl %esi movl 12(%esp),%edi /* destination pointer */ movl 16(%esp),%esi /* source pointer */ movl 20(%esp),%ecx /* size (in 32-bit words) */ shrl $3,%ecx /* count for unrolled loop */ jz Lend /* if zero, skip unrolled loop */ movl (%edi),%eax /* Fetch destination cache line */ .align 2,0x90 /* supply 0x90 for broken assemblers */ Loop: movl 28(%edi),%eax /* allocate cache line for destination */ nop /* we want these two insn to pair! */ movl (%esi),%eax /* read words pairwise */ movl 4(%esi),%edx movl %eax,(%edi) /* store words pairwise */ movl %edx,4(%edi) movl 8(%esi),%eax movl 12(%esi),%edx movl %eax,8(%edi) movl %edx,12(%edi) movl 16(%esi),%eax movl 20(%esi),%edx movl %eax,16(%edi) movl %edx,20(%edi) movl 24(%esi),%eax movl 28(%esi),%edx movl %eax,24(%edi) movl %edx,28(%edi) addl $32,%esi /* update source pointer */ addl $32,%edi /* update destnation pointer */ decl %ecx /* decr loop count */ jnz Loop /* Copy last 0-7 words */ Lend: movl 20(%esp),%ecx andl $7,%ecx cld rep movsl popl %esi popl %edi ret