Date: Thu, 25 Feb 1999 10:35:36 +1100 From: Peter Jeremy <peter.jeremy@auss2.alcatel.com.au> To: hackers@FreeBSD.ORG Cc: bright@cygnus.rush.net Subject: bcmp() [was Re: vm_page_zero_fill] Message-ID: <99Feb25.102442est.40372@border.alcanet.com.au>
next in thread | raw e-mail | index | archive | help
On Fri, 19 Feb 1999, Alfred Perlstein <bright@cygnus.rush.net> wrote: > I think it's more effecient because gcc is smart enough to >use the index instead of 2 seperate pointers. On the i[34]86 (I'm not sure about the Pentium and later), having base+index slows down the operation. In theory, there's no difference in speed between the following two lines: movl (%esi),%eax; incl %esi movl (%esi,%ebx),%eax >> Sort of true. In theory, an explicit loop is faster than "rep cmps". >> Lack of CPU<->RAM bandwidth tends to make this less of an issue unless >> both strings are in L1 cache. > >Aren't they both forced into L1 as soon as they are first accessed, making >the rest of the code execute quicker? (at least on i586+) This is true on all cached architectures. For a large or infrequent compare, the strings are unlikely to already be in the case, so it comes back to CPU (or cache) <-> RAM bandwidth. The following partially unrolled code shows the i486 behaviour (assuming %esi and %edi both point at addresses not in L1 cache): movb (%esi),%al Initiate line[1] load. stall[a] until (%esi) available cmpb (%edi),%al stall[b] until [1] complete, Initiate line[2] load. stall[c] until (%edi) available decl %ecx jne ... movb 1(%esi),%al line[1] loaded - no stall cmpb 1(%edi),%al stall[d] until [2] complete, decl %ecx jne ... Note the large number of stalls. In particular, stall[d] occurs because the 486 can't use partially-filled cache lines - it will short-cut the initial reference, but the next reference will wait until the load is complete. (I don't know if this behaviour is also true of newer ix86 processors). There's not much that can be done about stall[b] since there's only one memory bus - but the Alpha implements a packetized memory bus, so the processor can initiate the line[2] load before line[1] is loaded - this partially overlaps stall[b] and stall[c]. I did some real testing on a variety of processors (i386SX-25 (no cache), i486DX2-50 (256k L2), pentium-133 (512k L2), PII-266 (512k L2), AlphaServer 4100 5/466 (4M L2)), comparing a number of different bcmp() implementations - with various string alignments. These tests were all on fairly long strings, so algorithm overheads don't affect results. The code I used was as follows: cmc1 C byte-compare using pointers (ie libkern/bcmp.c) cmc2 C byte-compare using index register (ie Alfred's code) cmc3 C long compare using aligned loads and shift/mask compares cmi1 i386 repe cmpsb cmi2 i386 repe cmpsl (ie i386/support.s) cmi3 i386 long compare using aligned loads and shift/mask compares (shld) bcmp Digital UNIX AXP libc.a bcmp() The actual code (except for the DU bcmp) is at the end of this article. Note that the code for both cmc3 and cmi3 access the longword beyond the end of the strings. This would need to be corrected for a standard implementation (though the impact on speed would be negligible). The code was compiled with gcc-2.7.2.1 except on the 486 and Alpha - where gcc-2.8.1 was used. In all cases, -O2 was specified. The absolute speed of the best implementation was (as expected) in the above order, though the implementation that performed best did vary. The long compares (cmc3, cmi2, cmi3, bcmp) are sensitive to operand alignment - cmi2 does best with both operands are longword aligned, the rest when the alignment is the same for both operands - there's around a 2:1 variation between optimal and worst case for cmc3, somewhat less for cmi3 and bcmp. In all cases, cmc3 was the fastest of the C algorithms. For the 386, the best code was cmi2 - 5-8 times as fast as cmc1 (which was faster then cmc2). For the 486DX2 and P-133, the best code depends on the operand alignment: cmi3 faster when both strings had the same non-longword alignment, cmi2 otherwise. As the strings get bigger (ie as cache effects reduce), the differences become smaller. cmc1 is faster than cmc2. The PII-266 results were similar, but weighted more towards cmi3. The Alpha results most clearly show the advantages of well-written assembler: for aligned operands, DEC's bcmp was 27 times as fast as cmc1 (which was itself about 50% faster than cmc2) - when both operands were in L1 cache. cmc3 was several times as fast as cmc1 - though nowhere as good as DEC's code. This order remains unchanged (though the ratios reduce) as the strings get larger. When the DEC cc was used (and the optimisation level cranked up), the C code was several times faster - gcc is _very_ poor at generating Alpha code - in fact cmc3 outperformed DEC's bcmp. (But unfortunately this isn't an option for FreeBSD/Alpha). Overall: On the i386, it's probably not worthwhile moving away from what we currently have. On the Alpha, we can make bcmp (and presumable bcopy and bzero) >30 times faster by hand-coding it in assembler. [If one of the Alpha developers wants it, I can probably send them the output from 'cc -O5 -S' on cmc3 - I don't think this infringes on any DEC copyrights]. ---------------------------------------------------------------- typedef const void *cvp; typedef const char *string; typedef unsigned long ul; typedef const unsigned long *culp; int cmc1(cvp b1, cvp b2, size_t len) { register string p1, p2; if (len == 0) return(0); p1 = (string)b1; p2 = (string)b2; do { if (*p1++ != *p2++) break; } while (--len); return(len); } int cmc2(cvp b1, cvp b2, size_t len) { size_t i; for (i = 0; i < len; i++) { if (i[(string)b1] != i[(string)b2]) break; } return (i < len); } static int cmc3(cvp b1, cvp b2, size_t _len) { int shl, shr, len = _len; string p1 = b1, p2 = b2; ul va, vb; if (len == 0) return(0); /* * align p1 to a longword boundary */ while (((long)p1) & (sizeof(long) - 1)) { if (*p1++ != *p2++) return (1); if (--len <= 0) return (0); } /* * align p2 to longword boundary and calculate the shift required to * align p1 and p2 */ shr = ((long)p2) & (sizeof(long) - 1); if (shr != 0) { p2 -= shr; shr <<= 3; shl = (sizeof(long) << 3) - shr; va = *(culp)p2; p2 += sizeof(long); while ((len -= sizeof(long)) >= 0) { vb = *(culp)p2; p2 += sizeof(long); if (*(culp)p1 != ((va >> shr) | (vb << shl))) return (1); p1 += sizeof(long); va = vb; } /* * At this point, len is between -sizeof(long) and -1, * representing 0 .. sizeof(long)-1 bytes remaining. * do the final masked compare. */ if (!(len += sizeof(long))) return (0); return ((*(culp)p1 ^ ((va >> shr) | (*(culp)p2 << shl))) & ((1L << (len << 3)) - 1)); } else { while ((len -= sizeof(long)) >= 0) { if (*(culp)p1 != *(culp)p2) return (1); p1 += sizeof(long); p2 += sizeof(long); } /* * At this point, len is between -sizeof(long) and -1, * representing 0 .. sizeof(long)-1 bytes remaining. * do the final masked compare. */ if (!(len += sizeof(long))) return (0); return ((*(culp)p1 ^ *(culp)p2) & ((1L << (len << 3)) - 1)); } } static int cmi1(cvp b1, cvp b2, size_t len) { int ret; __asm("cld\n" /* set compare direction forward */ " repe; cmpsb\n" " je 2f\n" "1: incl %0\n" "2:" : "=a" (ret) : "0" (0), "S" (b1), "D" (b2), "c" (len)); return (ret); } static int cmi2(cvp b1, cvp b2, size_t len) { int ret; __asm("cld\n" /* set compare direction forward */ " movl %4,%%ecx\n" /* compare by words */ " shrl $2,%%ecx\n" " repe; cmpsl\n" " jne 1f\n" " movl %4,%%ecx\n" /* compare remainder by bytes */ " andl $3,%%ecx\n" " repe; cmpsb\n" " je 2f\n" "1: incl %0\n" "2:" : "=a" (ret) : "0" (0), "S" (b1), "D" (b2), "rm" (len) : "%ecx"); return (ret); } static int cmi3(cvp b1, cvp b2, size_t len) { int ret; if (len == 0) return(0); __asm("cld\n" /* set compare direction forward */ "1: testl $3,%1\n" /* align b1 to word boundary */ " je 2f\n" " movb (%1),%b0\n" /* if (*b1 != *b2) */ " incl %1\n" " xorb (%2),%b0\n" " jne 8f\n" /* return (non-zero) */ " incl %2\n" " decl %3\n" /* if (--len <= 0) return (0) */ " jg 1b\n" " jmp 9f\n" /* return (0) */ "2: movl %2,%%ecx\n" /* offs = (((int)b2) & 3) << 3 */ " andl $3,%%ecx\n" " je 5f\n" /* b1 & b2 are aligned */ " subl %%ecx,%2\n" /* align b2 to word */ " shl $3,%%ecx\n" " movl (%2),%0; addl $4,%2\n" " subl $4,%3\n" " jl 4f\n" /* while (len >= 4) */ ".align 4,0x90\n" "3: movl (%2),%%edx; addl $4,%2\n" " shrd %%cl,%%edx,%0\n" " xorl (%1),%0\n" /* aligned compare */ " jne 8f\n" /* return (non-zero) */ " addl $4,%1\n" " movl %%edx,%0\n" " subl $4,%3\n" " jge 3b\n" /* * At this point, len (in %3) is between -4 and -1, * representing 0..3 bytes remaining */ "4: addl $4,%3\n" " je 9f\n" /* if (!(len += 4)) return (0); */ " movl (%2),%%edx\n" " shrd %%cl,%%edx,%0\n" " xorl (%1),%0\n" /* eax = (*(long *)b1 ^ *(long *)b2) */ " shll $3,%3\n" " movl %3,%%ecx\n" " movl $1,%%edx\n" " shll %%cl,%%edx\n" " decl %%edx\n" " andl %%edx,%0\n" /* eax & ((1 << (len << 3)) - 1) */ " jmp 8f\n" "5: subl $4,%3\n" /* aligned operands */ " jl 7f\n" /* while (len >= 4) */ " .align 4,0x90\n" "6: movl (%2),%0; addl $4,%2\n" " xorl (%1),%0\n" /* aligned compare */ " jne 8f\n" /* return (non-zero) */ " addl $4,%1\n" " subl $4,%3\n" " jge 6b\n" /* * At this point, len (in %3) is between -4 and -1, * representing 0..3 bytes remaining */ "7: addl $4,%3\n" " je 9f\n" /* if (!(len += 4)) return (0); */ " movl (%2),%0\n" " xorl (%1),%0\n" /* eax = (*(long *)b1 ^ *(long *)b2) */ " shll $3,%3\n" " movl %3,%%ecx\n" " movl $1,%%edx\n" " shll %%cl,%%edx\n" " decl %%edx\n" " andl %%edx,%0\n" /* eax & ((1 << (len << 3)) - 1) */ " jmp 8f\n" "9: xor %0,%0\n" "8:" : "=a" (ret) : "S" (b1), "D" (b2), "b" (len) : "%ecx", "%edx"); return (ret); } ---------------------------------------------------------------- Peter To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?99Feb25.102442est.40372>