From owner-cvs-src@FreeBSD.ORG Sun Mar 30 22:31:51 2003 Return-Path: Delivered-To: cvs-src@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 401F937B401; Sun, 30 Mar 2003 22:31:51 -0800 (PST) Received: from mailman.zeta.org.au (mailman.zeta.org.au [203.26.10.16]) by mx1.FreeBSD.org (Postfix) with ESMTP id 341F943FA3; Sun, 30 Mar 2003 22:31:49 -0800 (PST) (envelope-from bde@zeta.org.au) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailman.zeta.org.au (8.9.3/8.8.7) with ESMTP id QAA30585; Mon, 31 Mar 2003 16:31:37 +1000 Date: Mon, 31 Mar 2003 16:31:36 +1000 (EST) From: Bruce Evans X-X-Sender: bde@gamplex.bde.org To: Peter Jeremy In-Reply-To: <20030328073513.GA20464@cirb503493.alcatel.com.au> Message-ID: <20030331151946.X17526@gamplex.bde.org> References: <20030327232742.GA80113@wantadilla.lemis.com> <20030328073513.GA20464@cirb503493.alcatel.com.au> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: src-committers@FreeBSD.org cc: Nate Lawson cc: Greg 'groggy' Lehey cc: cvs-src@FreeBSD.org cc: Mike Silbersack cc: cvs-all@FreeBSD.org Subject: Re: Checksum/copy (was: Re: cvs commit: src/sys/netinet ip_output.c) X-BeenThere: cvs-src@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: CVS commit messages for the src tree List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 31 Mar 2003 06:31:55 -0000 On Fri, 28 Mar 2003, Peter Jeremy wrote: > On Fri, Mar 28, 2003 at 05:04:21PM +1100, Bruce Evans wrote: > >"i686" basically means "second generation Pentium" (PentiumPro/PII/Celeron) > >(later x86's are mostly handled better using CPU features instead of > >a 1-dimensional class number). Hand-"optimized" bzero's are especially > >pessimal for this class of CPU. > > That matches my memory of my test results as well. The increasing > clock multipliers mean that it doesn't matter how slow "rep stosl" is > in clock cycle terms - maim memory is always going to be slower. There are still some surprising differences (see attached timings for some examples), but I think they are more for how the code affects caches and write buffers. The exact behaviour is very machine-dependent so it is hard to optimize in general-purpose production code. > >Benefits from SSE for bzeroing and bcopying, if any, would probably > >come more from bypassing caches and/or not doing read-before-write > >(SSE instructions give control over this) than from operating on wider > >data. I'm dubious about practical benefits. Obviously it is not useful > >to bust the cache when bzeroing 8MB of data, but real programs and OS's > >mostly operate on smaller buffers. It is negatively useful not to put > >bzero'ed data in the (L[1-2]) cache if the data will be used soon, and > >generally hard to predict if it will be used soon. > > Unless Intel have fixed the P4 caches, you definitely don't want to > use the L1 cache for page sized bzero/bcopy. Athlons have many similarities to Celerons here. > Avoiding read-before-write should roughly double bzero speed and give > you about 50% speedup on bcopy - this should be worthwhile. Caching It actually gives a 66% speedop for bzero on my AthlonXP. For some reason, at least for very large buffers, read accesses through the cache can use only 1/2 of the memory bandwidth, and write accesses can use only 1/3 of it (and this is after tuning for bank organization -- I get a 33% speedup for the write benchmark and 0% for real work by including a bit for the bank number in the page color in a deteriministic way, and almost as much for including the bit in a random way). Using SSE instructions (mainly movntps) gives the full bandwidth for at least bzero for large buffers (3x better), but it reduces bandwidth for small already cached buffers (more than 3x worse): %%% Times on an AthlonXP-1600 overclocked by 146/133, with 1024MB of PC2700 memory and all memory timings tuned as low as possible (CAS2, but 2T cmds): 4K buffer (almost always cached): zero0: 5885206293 B/s (6959824 us) (stosl) zero1: 7842053086 B/s (5223122 us) (unroll 16) zero2: 7049051312 B/s (5810711 us) (unroll 16 preallocate) zero3: 9377720907 B/s (4367799 us) (unroll 32) zero4: 7803040290 B/s (5249236 us) (unroll 32 preallocate) zero5: 9802682719 B/s (4178448 us) (unroll 64) zero6: 8432350664 B/s (4857483 us) (unroll 64 preallocate) zero7: 5957318200 B/s (6875577 us) (fstl) zero8: 3007928933 B/s (13617343 us) (movl) zero9: 4011348905 B/s (10211029 us) (unroll 8) zeroA: 5835984056 B/s (7018525 us) (generic_bzero) zeroB: 8334888325 B/s (4914283 us) (i486_bzero) zeroC: 2545022700 B/s (16094159 us) (i586_bzero) zeroD: 7650723550 B/s (5353742 us) (i686_pagezero) zeroE: 5755535593 B/s (7116627 us) (bzero (stosl)) zeroF: 2282741753 B/s (17943335 us) (movntps) movntps is the SSE method. It's significantly slower for this case. 400MB buffer (never cached): zero0: 714045391 B/s ( 573633 us) (stosl) zero1: 705180737 B/s ( 580844 us) (unroll 16) zero2: 670897998 B/s ( 610525 us) (unroll 16 preallocate) zero3: 690538809 B/s ( 593160 us) (unroll 32) zero4: 661854647 B/s ( 618867 us) (unroll 32 preallocate) zero5: 670525682 B/s ( 610864 us) (unroll 64) zero6: 663334877 B/s ( 617486 us) (unroll 64 preallocate) zero7: 781025057 B/s ( 524439 us) (fstl) zero8: 608491547 B/s ( 673140 us) (movl) zero9: 696489665 B/s ( 588092 us) (unroll 8) zeroA: 713958268 B/s ( 573703 us) (generic_bzero) zeroB: 689875870 B/s ( 593730 us) (i486_bzero) zeroC: 721477338 B/s ( 567724 us) (i586_bzero) zeroD: 746453616 B/s ( 548728 us) (i686_pagezero) zeroE: 714016763 B/s ( 573656 us) (bzero (stosl)) zeroF: 2240602162 B/s ( 182808 us) (movntps) Now movntps is about 3 times faster than everything else. This is the first time I've seen a magic number near 2100 for memory named with a magic number near 2100. This machine used to use PC2100 with the same timing, but it developed errors (burnt out?). Now it has PC2700 memory so it is within spec an can reasonably be expected to run a little faster than PC2100 should. %%% > is more dubious - placing a slow-zeroed page in L1 cache is very > probably a waste of time. At least part of an on-demand zeroed page > is likely to be used in the near future - but probably not all of it. > Copying is even harder to predict - at least one word of a COW page is > going to be used immediately, but bcopy() won't be able to tell which > word. For makeworld, using movntps in i686_pagezero() gives a whole 14 seconds (0.7%) improvement: %%% Before: bde-current with ... + KSEIII + idlezero_enable + pmap - even coloring async mounted /c my-Makefile after perl removal and new gcc and ufs2 and aout utilities removal with 2 fairly new drives 1532 MHz AthlonXP 1600 1024MB make catches SIGCHLD i686_bzero not used -------------------------------------------------------------- >>> elf make world completed on Mon Mar 31 02:10:47 EST 2003 (started on Mon Mar 31 01:38:24 EST 2003) -------------------------------------------------------------- 1943.14 real 1575.25 user 218.88 sys 40204 maximum resident set size 2166 average shared memory size 1988 average unshared data size 128 average unshared stack size 13039568 page reclaims 11639 page faults 0 swaps 20008 block input operations 6265 block output operations 0 messages sent 0 messages received 33037 signals received 207588 voluntary context switches 518358 involuntary context switches After: bde-current with ... + KSEIII + idlezero_enable + pmap - even coloring async mounted /c my-Makefile after perl removal and new gcc and ufs2 and aout utilities removal with 2 fairly new drives 1532 MHz AthlonXP 1600 1024MB make catches SIGCHLD i686_bzero used and replaced by one that uses SSE (movntps) -------------------------------------------------------------- >>> elf make world completed on Mon Mar 31 02:46:43 EST 2003 (started on Mon Mar 31 02:14:35 EST 2003) -------------------------------------------------------------- 1929.02 real 1576.67 user 205.30 sys 40204 maximum resident set size 2166 average shared memory size 1990 average unshared data size 128 average unshared stack size 13039590 page reclaims 11645 page faults 0 swaps 20014 block input operations 6416 block output operations 0 messages sent 0 messages received 33037 signals received 208376 voluntary context switches 512820 involuntary context switches %%% Whether 14 seconds is a lot depends on your viewpoint. It is a lot out of the kernel time of 218 seconds considering that only one function was optimized and some of the optimization doesn't affect the real time since it is done at idle priority in pagezero. pagezero's time was reduced from 57 seconds to 28 seconds. Code for the above (no warranties; only works for !SMP and I didn't check that the FP context switching is safe...): %%% Index: support.s =================================================================== RCS file: /home/ncvs/src/sys/i386/i386/support.s,v retrieving revision 1.93 diff -u -2 -r1.93 support.s --- support.s 22 Sep 2002 04:45:20 -0000 1.93 +++ support.s 31 Mar 2003 02:37:02 -0000 @@ -66,4 +68,9 @@ .space 3 #endif +#define HACKISH_SSE_PAGEZERO +#ifdef HACKISH_SSE_PAGEZERO +zero: + .long 0, 0, 0, 0 +#endif .text @@ -333,70 +342,101 @@ movl %edx,%edi xorl %eax,%eax - shrl $2,%ecx cld + shrl $2,%ecx rep stosl movl 12(%esp),%ecx andl $3,%ecx - jne 1f - popl %edi - ret - -1: + je 1f rep stosb +1: popl %edi ret -#endif /* I586_CPU && defined(DEV_NPX) */ +#endif /* I586_CPU && DEV_NPX */ +#ifdef I686_CPU ENTRY(i686_pagezero) - pushl %edi - pushl %ebx + movl 4(%esp),%edx + movl $PAGE_SIZE, %ecx - movl 12(%esp), %edi - movl $1024, %ecx - cld +#ifdef HACKISH_SSE_PAGEZERO + pushfl + cli + movl %cr0,%eax + clts + subl $16,%esp + movups %xmm0,(%esp) + movups zero,%xmm0 + ALIGN_TEXT +1: + movntps %xmm0,(%edx) + movntps %xmm0,16(%edx) + movntps %xmm0,32(%edx) + movntps %xmm0,48(%edx) + addl $64,%edx + subl $64,%ecx + jne 1b + movups (%esp),%xmm0 + addl $16,%esp + movl %eax,%cr0 + popfl + ret +2: +#endif /* HACKISH_SSE_PAGEZERO */ ALIGN_TEXT 1: - xorl %eax, %eax - repe - scasl - jnz 2f + movl (%edx), %eax + orl 4(%edx), %eax + orl 8(%edx), %eax + orl 12(%edx), %eax + orl 16(%edx), %eax + orl 20(%edx), %eax + orl 24(%edx), %eax + orl 28(%edx), %eax + jne 2f + movl 32(%edx), %eax + orl 36(%edx), %eax + orl 40(%edx), %eax + orl 44(%edx), %eax + orl 48(%edx), %eax + orl 52(%edx), %eax + orl 56(%edx), %eax + orl 60(%edx), %eax + jne 3f + + addl $64, %edx + subl $64, %ecx + jne 1b - popl %ebx - popl %edi ret ALIGN_TEXT - 2: - incl %ecx - subl $4, %edi - - movl %ecx, %edx - cmpl $16, %ecx - - jge 3f - - movl %edi, %ebx - andl $0x3f, %ebx - shrl %ebx - shrl %ebx - movl $16, %ecx - subl %ebx, %ecx - + movl $0, (%edx) + movl $0, 4(%edx) + movl $0, 8(%edx) + movl $0, 12(%edx) + movl $0, 16(%edx) + movl $0, 20(%edx) + movl $0, 24(%edx) + movl $0, 28(%edx) 3: - subl %ecx, %edx - rep - stosl - - movl %edx, %ecx - testl %edx, %edx - jnz 1b + movl $0, 32(%edx) + movl $0, 36(%edx) + movl $0, 40(%edx) + movl $0, 44(%edx) + movl $0, 48(%edx) + movl $0, 52(%edx) + movl $0, 56(%edx) + movl $0, 60(%edx) + + addl $64, %edx + subl $64, %ecx + jne 1b - popl %ebx - popl %edi ret +#endif /* I686_CPU */ /* fillw(pat, base, cnt) */ %%% > I don't know how much control SSE gives you over caching - is it just > cache/no-cache, or can you control L1+L2/L2-only/none? In the latter > case, telling bzero and bcopy destination to use L2-only is probably a > reasonable compromise. The bcopy source should probably not evict > cache data - if data is cached, use it, otherwise fetch from main > memory and bypass caches. There seems to be control in individual instructions for reads, but only a complete bypass for writes (movntps from an SSE register to memory). Writing can still be tuned with explicit reads or prefetches after writes. I've only looked briefly at 3-year-old Intel manuals. > Finally, how many different bcopy/bzero variants to we want? A I don't want many :-). Bruce