From owner-cvs-src@FreeBSD.ORG Thu Mar 27 23:44:29 2003 Return-Path: Delivered-To: cvs-src@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C0D0337B401; Thu, 27 Mar 2003 23:44:29 -0800 (PST) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1973243F85; Thu, 27 Mar 2003 23:44:28 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id SAA01159; Fri, 28 Mar 2003 18:44:07 +1100 Date: Fri, 28 Mar 2003 18:44:06 +1100 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?= In-Reply-To: Message-ID: <20030328174850.M6165@gamplex.bde.org> References: <20030327180247.D1825@gamplex.bde.org> <20030327212647.GA64029@walton.maths.tcd.ie> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Spam-Status: No, hits=-29.4 required=5.0 tests=AWL,EMAIL_ATTRIBUTION,IN_REP_TO,PATCH_UNIFIED_DIFF, QUOTED_EMAIL_TEXT,REFERENCES,REPLY_WITH_QUOTES version=2.50 X-Spam-Level: X-Spam-Checker-Version: SpamAssassin 2.50 (1.173-2003-02-20-exp) cc: David Malone cc: src-committers@FreeBSD.org cc: Nate Lawson cc: cvs-src@FreeBSD.org cc: Mike Silbersack cc: cvs-all@FreeBSD.org Subject: Re: Checksum/copy X-BeenThere: cvs-src@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: CVS commit messages for the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 28 Mar 2003 07:44:32 -0000 On Thu, 27 Mar 2003, Dag-Erling [iso-8859-1] Sm=F8rgrav wrote: > David Malone writes: > > On Thu, Mar 27, 2003 at 09:57:35AM +0100, des@ofug.org wrote: > > > Might it be a good idea to have separate b{copy,zero} implementations > > > for special purposes like pmap_{copy,zero}_page? > > We do have a i686_pagezero already, which seems to be used in > > pmap_zero_page - I guess it may not be well tuned to modern processors, > > as it is almost 5 years old. Indeed. > i686_pagezero uses 'rep stosl' after an initial 'rep scasl' to check > if the page was already zero (which is a pessimization unless we zero > a lot of pages that are already zeroed). SSE can do far better than > that. Even integer instructions can do significantly better than scasl/stosl on "686"s (PentiumPro and similar CPUs). Implementation bugs in i686_pagezero() include: - scasl is one of the slowest ways to read memory, at least on old Celerons (I believe PPro's have similar timing for string operations). It is a bit slower than lodsl, which is about 3.5 times slower than a lightly unrolled movl loop for the L1-cached case and about 2 times slower for the uncached case. - the code apparently intends to check 16 words at a time, but due to getting a comparison backwards it actually zeros everything else as soon as it finds a nonzero word, with extra obfuscations and pessimizations when it is within 16 words of the end. Implementation non-bugs include using stosl to do the zeroing. Unlike lodsl and scasl, stosl is actually useful for its intended purpos on "686"s. Instead of fixing the comparison and any other logic bugs, I rewrote the function using orl instead of scasl, and simpler logic (ignore the changes for the previous function in the same hunk). %%% Index: support.s =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /home/ncvs/src/sys/i386/i386/support.s,v retrieving revision 1.93 diff -u -2 -r1.93 support.s --- support.s=0922 Sep 2002 04:45:20 -0000=091.93 +++ support.s=0922 Sep 2002 09:51:27 -0000 @@ -333,70 +337,58 @@ =09movl=09%edx,%edi =09xorl=09%eax,%eax -=09shrl=09$2,%ecx =09cld +=09shrl=09$2,%ecx =09rep =09stosl =09movl=0912(%esp),%ecx =09andl=09$3,%ecx -=09jne=091f -=09popl=09%edi -=09ret - -1: +=09je=091f =09rep =09stosb +1: =09popl=09%edi =09ret -#endif /* I586_CPU && defined(DEV_NPX) */ +#endif /* I586_CPU && DEV_NPX */ +#ifdef I686_CPU ENTRY(i686_pagezero) -=09pushl=09%edi -=09pushl=09%ebx - -=09movl=0912(%esp), %edi +=09movl=094(%esp), %edx =09movl=09$1024, %ecx -=09cld =09ALIGN_TEXT 1: -=09xorl=09%eax, %eax -=09repe -=09scasl -=09jnz=092f +=09movl=09(%edx), %eax +=09orl=094(%edx), %eax +=09orl=098(%edx), %eax +=09orl=0912(%edx), %eax +=09orl=0916(%edx), %eax +=09orl=0920(%edx), %eax +=09orl=0924(%edx), %eax +=09orl=0928(%edx), %eax +=09jne=092f + +=09addl=09$32, %edx +=09subl=09$32/4, %ecx +=09jne=091b -=09popl=09%ebx -=09popl=09%edi =09ret =09ALIGN_TEXT - 2: -=09incl=09%ecx -=09subl=09$4, %edi +=09movl=09$0, (%edx) +=09movl=09$0, 4(%edx) +=09movl=09$0, 8(%edx) +=09movl=09$0, 12(%edx) +=09movl=09$0, 16(%edx) +=09movl=09$0, 20(%edx) +=09movl=09$0, 24(%edx) +=09movl=09$0, 28(%edx) + +=09addl=09$32, %edx +=09subl=09$32/4, %ecx +=09jne=091b -=09movl=09%ecx, %edx -=09cmpl=09$16, %ecx - -=09jge=093f - -=09movl=09%edi, %ebx -=09andl=09$0x3f, %ebx -=09shrl=09%ebx -=09shrl=09%ebx -=09movl=09$16, %ecx -=09subl=09%ebx, %ecx - -3: -=09subl=09%ecx, %edx -=09rep -=09stosl - -=09movl=09%edx, %ecx -=09testl=09%edx, %edx -=09jnz=091b - -=09popl=09%ebx -=09popl=09%edi =09ret +#endif /* I686_CPU */ /* fillw(pat, base, cnt) */ %%% Implementation notes: using orl might not be best (due to pipelining issues= ). Using movl instead of stosl might not be best (I used it to simplify the logic and reduce initilization overheads). This hasn't been tested recently. I've had it disabled in pmap.c for as long as I can remember, to prepare for complete testing (my pmap.c just uses bzero()). The importance of optimizing this function can be gauged by the number of people who have noticed that it never worked right and the number of commits to make it work right. Zeroing pages is not completely unimportant, however. The pagezero task takes about 5% of the time for a makeworld here. Most of this time is "free" here since pagezero can run while the system is waiting for disks, and I don't run much else while doing makeworld benchmarks. However, it is not free time under different/heavier loads. Bruce