Date: Fri, 19 Feb 1999 01:12:08 -0500 (EST) From: Alfred Perlstein <bright@cygnus.rush.net> To: Peter Jeremy <peter.jeremy@auss2.alcatel.com.au> Cc: hackers@FreeBSD.ORG Subject: Re: vm_page_zero_fill Message-ID: <Pine.BSF.3.96.990219010251.10060R-100000@cygnus.rush.net> In-Reply-To: <99Feb19.123711est.40325@border.alcanet.com.au>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 19 Feb 1999, Peter Jeremy wrote: > Alfred Perlstein <bright@cygnus.rush.net> wrote: > >After playing with "gcc -O -S bcmp.c" on several platforms, i386, > >sparc32, alpha. It seems to me that the function ought to be > >replaced with this: > [deleted] > > The code given is portable, but not optimal for any of these > architectures - especially the Alpha. The original Alpha chips don't > have character instructions so character handling is quite poor (and > gcc2.7.x doesn't include support for the new character instructions). > > Optimal code for the Alpha would read 8-byte long-word aligned chunks > from memory, then appropriately re-align and compare them. (There's > some discussion about this, though not actual code, in the early Alpha > white papers). > > A similar strategy probably holds for the SPARC (but 4-bytes loads > except on UltraSPARCs). Something similar could be done on the ix86, > but I'm not certain about the advantages. > > This _is_ one area where carefully hand-crafted code is worth the > effort (especially on the RISC architectures). Yes, but after 'gcc -O -S' my code reduces the number of branches, and other ops on all archs. It's really a non-issue as the i386 code has this hand done in asm. I think it's more effecient because gcc is smart enough to use the index instead of 2 seperate pointers. > > >it uses the "rep cmpsl" opcode, i have heard that using "movs/lods/cmps" > >was no longer optimal after the 486 line, but i'm unsure. > Sort of true. In theory, an explicit loop is faster than "rep cmps". > Lack of CPU<->RAM bandwidth tends to make this less of an issue unless > both strings are in L1 cache. Aren't they both forced into L1 as soon as they are first accessed, making the rest of the code execute quicker? (at least on i586+) Next time i'll check if it's already hand coded asm before I pipe up about something like this. :) -Alfred > > Peter > > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-hackers" in the body of the message > To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.3.96.990219010251.10060R-100000>