From owner-freebsd-hackers  Wed Feb 24 15:36:32 1999
Delivered-To: freebsd-hackers@freebsd.org
Received: from alcanet.com.au (border.alcanet.com.au [203.62.196.10])
	by hub.freebsd.org (Postfix) with ESMTP id 5763114E0A
	for <hackers@FreeBSD.ORG>; Wed, 24 Feb 1999 15:36:08 -0800 (PST)
	(envelope-from peter.jeremy@auss2.alcatel.com.au)
Received: by border.alcanet.com.au id <40372>; Thu, 25 Feb 1999 10:24:42 +1100
Date: Thu, 25 Feb 1999 10:35:36 +1100
From: Peter Jeremy <peter.jeremy@auss2.alcatel.com.au>
Subject: bcmp() [was Re: vm_page_zero_fill]
To: hackers@FreeBSD.ORG
Cc: bright@cygnus.rush.net
Message-Id: <99Feb25.102442est.40372@border.alcanet.com.au>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Fri, 19 Feb 1999, Alfred Perlstein <bright@cygnus.rush.net> wrote:
>  I think it's more effecient because gcc is smart enough to
>use the index instead of 2 seperate pointers.

On the i[34]86 (I'm not sure about the Pentium and later), having
base+index slows down the operation.  In theory, there's no difference
in speed between the following two lines:
	movl (%esi),%eax; incl %esi
	movl (%esi,%ebx),%eax

>> Sort of true.  In theory, an explicit loop is faster than "rep cmps".
>> Lack of CPU<->RAM bandwidth tends to make this less of an issue unless
>> both strings are in L1 cache.
>
>Aren't they both forced into L1 as soon as they are first accessed, making
>the rest of the code execute quicker?  (at least on i586+)

This is true on all cached architectures.  For a large or infrequent
compare, the strings are unlikely to already be in the case, so it
comes back to CPU (or cache) <-> RAM bandwidth.  The following partially
unrolled code shows the i486 behaviour (assuming %esi and %edi both point
at addresses not in L1 cache):

  movb	(%esi),%al   Initiate line[1] load. stall[a] until (%esi) available
  cmpb  (%edi),%al   stall[b] until [1] complete, Initiate line[2] load.
                     stall[c] until (%edi) available
  decl  %ecx
  jne  ...
  movb  1(%esi),%al  line[1] loaded - no stall
  cmpb  1(%edi),%al  stall[d] until [2] complete, 
  decl  %ecx
  jne  ...

Note the large number of stalls.  In particular, stall[d] occurs
because the 486 can't use partially-filled cache lines - it will
short-cut the initial reference, but the next reference will wait
until the load is complete.  (I don't know if this behaviour is also
true of newer ix86 processors).  There's not much that can be done
about stall[b] since there's only one memory bus - but the Alpha
implements a packetized memory bus, so the processor can initiate the
line[2] load before line[1] is loaded - this partially overlaps
stall[b] and stall[c].

I did some real testing on a variety of processors (i386SX-25 (no
cache), i486DX2-50 (256k L2), pentium-133 (512k L2), PII-266 (512k
L2), AlphaServer 4100 5/466 (4M L2)), comparing a number of different
bcmp() implementations - with various string alignments.  These
tests were all on fairly long strings, so algorithm overheads don't
affect results.

The code I used was as follows:
cmc1	C byte-compare using pointers (ie libkern/bcmp.c)
cmc2	C byte-compare using index register (ie Alfred's code)
cmc3	C long compare using aligned loads and shift/mask compares
cmi1	i386 repe cmpsb
cmi2	i386 repe cmpsl (ie i386/support.s)
cmi3	i386 long compare using aligned loads and shift/mask compares (shld)
bcmp	Digital UNIX AXP libc.a bcmp()

The actual code (except for the DU bcmp) is at the end of this article.
Note that the code for both cmc3 and cmi3 access the longword beyond the
end of the strings.  This would need to be corrected for a standard
implementation (though the impact on speed would be negligible).

The code was compiled with gcc-2.7.2.1 except on the 486 and Alpha -
where gcc-2.8.1 was used.  In all cases, -O2 was specified.

The absolute speed of the best implementation was (as expected) in the
above order, though the implementation that performed best did vary.
The long compares (cmc3, cmi2, cmi3, bcmp) are sensitive to operand
alignment - cmi2 does best with both operands are longword aligned,
the rest when the alignment is the same for both operands - there's
around a 2:1 variation between optimal and worst case for cmc3,
somewhat less for cmi3 and bcmp.  In all cases, cmc3 was the fastest
of the C algorithms.

For the 386, the best code was cmi2 - 5-8 times as fast as cmc1 (which
was faster then cmc2).

For the 486DX2 and P-133, the best code depends on the operand
alignment: cmi3 faster when both strings had the same non-longword
alignment, cmi2 otherwise.  As the strings get bigger (ie as cache
effects reduce), the differences become smaller.  cmc1 is faster than
cmc2.

The PII-266 results were similar, but weighted more towards cmi3.

The Alpha results most clearly show the advantages of well-written
assembler: for aligned operands, DEC's bcmp was 27 times as fast as
cmc1 (which was itself about 50% faster than cmc2) - when both
operands were in L1 cache.  cmc3 was several times as fast as cmc1 -
though nowhere as good as DEC's code.  This order remains unchanged
(though the ratios reduce) as the strings get larger.

When the DEC cc was used (and the optimisation level cranked up), the
C code was several times faster - gcc is _very_ poor at generating
Alpha code - in fact cmc3 outperformed DEC's bcmp.  (But unfortunately
this isn't an option for FreeBSD/Alpha).

Overall: On the i386, it's probably not worthwhile moving away from
what we currently have.  On the Alpha, we can make bcmp (and
presumable bcopy and bzero) >30 times faster by hand-coding it in
assembler.  [If one of the Alpha developers wants it, I can probably
send them the output from 'cc -O5 -S' on cmc3 - I don't think this
infringes on any DEC copyrights].

----------------------------------------------------------------


typedef	const void	*cvp;
typedef	const char	*string;
typedef unsigned long	ul;
typedef const unsigned long	*culp;

int cmc1(cvp b1, cvp b2, size_t len)
{
	register string p1, p2;

	if (len == 0)
		return(0);
	p1 = (string)b1;
	p2 = (string)b2;
	do {
		if (*p1++ != *p2++)
			break;
	} while (--len);

	return(len);
}

int cmc2(cvp b1, cvp b2, size_t len)
{
	size_t	i;

	for (i = 0; i < len; i++) {
		if (i[(string)b1] != i[(string)b2])
			break;
	}
	return (i < len);
}

static int cmc3(cvp b1, cvp b2, size_t _len)
{
	int	shl, shr, len = _len;
	string	p1 = b1, p2 = b2;
	ul	va, vb;

	if (len == 0)
		return(0);

	/*
	 * align p1 to a longword boundary
	 */
	while (((long)p1) & (sizeof(long) - 1)) {
		if (*p1++ != *p2++)
			return (1);
		if (--len <= 0)
			return (0);
	}

	/*
	 * align p2 to longword boundary and calculate the shift required to
	 * align p1 and p2
	 */
	shr = ((long)p2) & (sizeof(long) - 1);
	if (shr != 0) {
		p2 -= shr;
		shr <<= 3;
		shl = (sizeof(long) << 3) - shr;

		va = *(culp)p2;
		p2 += sizeof(long);

		while ((len -= sizeof(long)) >= 0) {
			vb = *(culp)p2;
			p2 += sizeof(long);
			if (*(culp)p1 != ((va >> shr) | (vb << shl)))
				return (1);
			p1 += sizeof(long);
			va = vb;
		}
		/*
		 * At this point, len is between -sizeof(long) and -1,
		 * representing 0 .. sizeof(long)-1 bytes remaining.
		 * do the final masked compare.
		 */
		if (!(len += sizeof(long)))
			return (0);
		return ((*(culp)p1 ^ ((va >> shr) | (*(culp)p2 << shl))) &
				((1L << (len << 3)) - 1));
	} else {
		while ((len -= sizeof(long)) >= 0) {
			if (*(culp)p1 != *(culp)p2)
				return (1);
			p1 += sizeof(long);
			p2 += sizeof(long);
		}

		/*
		 * At this point, len is between -sizeof(long) and -1,
		 * representing 0 .. sizeof(long)-1 bytes remaining.
		 * do the final masked compare.
		 */
		if (!(len += sizeof(long)))
			return (0);
		return ((*(culp)p1 ^ *(culp)p2) & ((1L << (len << 3)) - 1));
	}

}

static int cmi1(cvp b1, cvp b2, size_t len)
{
	int	ret;

	__asm("cld\n"			/* set compare direction forward */
	"      repe; cmpsb\n"
	"      je 2f\n"
	"1:    incl %0\n"
	"2:"
		: "=a" (ret)
		: "0" (0), "S" (b1), "D" (b2), "c" (len));

	return (ret);
}

static int cmi2(cvp b1, cvp b2, size_t len)
{
	int	ret;

	__asm("cld\n"			/* set compare direction forward */
	"      movl %4,%%ecx\n"		/* compare by words */
	"      shrl $2,%%ecx\n"
	"      repe; cmpsl\n"
	"      jne 1f\n"
	"      movl %4,%%ecx\n"		/* compare remainder by bytes */
	"      andl $3,%%ecx\n"
	"      repe; cmpsb\n"
	"      je 2f\n"
	"1:    incl %0\n"
	"2:"
		: "=a" (ret)
		: "0" (0), "S" (b1), "D" (b2), "rm" (len)
		: "%ecx");
	return (ret);
}

static int cmi3(cvp b1, cvp b2, size_t len)
{
	int	ret;

	if (len == 0)
		return(0);

	__asm("cld\n"			/* set compare direction forward */
	"1:    testl $3,%1\n"		/* align b1 to word boundary */
	"      je 2f\n"
	"      movb (%1),%b0\n"		/* if (*b1 != *b2) */
	"      incl %1\n"
	"      xorb (%2),%b0\n"
	"      jne 8f\n"		/* return (non-zero) */
	"      incl %2\n"
	"      decl %3\n"		/* if (--len <= 0) return (0) */
	"      jg 1b\n"
	"      jmp 9f\n"		/* return (0) */

	"2:    movl %2,%%ecx\n"		/* offs = (((int)b2) & 3) << 3 */
	"      andl $3,%%ecx\n"
	"      je 5f\n"			/* b1 & b2 are aligned */
	"      subl %%ecx,%2\n"		/* align b2 to word */
	"      shl $3,%%ecx\n"

	"      movl (%2),%0; addl $4,%2\n"
	"      subl $4,%3\n"
	"      jl 4f\n"			/* while (len >= 4) */
		  ".align 4,0x90\n"
	"3:    movl (%2),%%edx; addl $4,%2\n"
	"      shrd %%cl,%%edx,%0\n"
	"      xorl (%1),%0\n"		/* aligned compare */
	"      jne 8f\n"		/* return (non-zero) */
	"      addl $4,%1\n"
	"      movl %%edx,%0\n"
	"      subl $4,%3\n"
	"      jge 3b\n"

		/*
		 * At this point, len (in %3) is between -4 and -1,
		 * representing 0..3 bytes remaining
		 */
	"4:    addl $4,%3\n"
	"      je 9f\n"			/* if (!(len += 4)) return (0); */
	"      movl (%2),%%edx\n"
	"      shrd %%cl,%%edx,%0\n"
	"      xorl (%1),%0\n"		/* eax = (*(long *)b1 ^ *(long *)b2) */
	"      shll $3,%3\n"
	"      movl %3,%%ecx\n"
	"      movl $1,%%edx\n"
	"      shll %%cl,%%edx\n"
	"      decl %%edx\n"
	"      andl %%edx,%0\n"		/* eax & ((1 << (len << 3)) - 1) */
	"      jmp 8f\n"

	"5:    subl $4,%3\n"		/* aligned operands */
	"      jl 7f\n"			/* while (len >= 4) */
	"      .align 4,0x90\n"
	"6:    movl (%2),%0; addl $4,%2\n"
	"      xorl (%1),%0\n"		/* aligned compare */
	"      jne 8f\n"		/* return (non-zero) */
	"      addl $4,%1\n"
	"      subl $4,%3\n"
	"      jge 6b\n"

		  /*
		   * At this point, len (in %3) is between -4 and -1,
		   * representing 0..3 bytes remaining
		   */
	"7:    addl $4,%3\n"
	"      je 9f\n"			/* if (!(len += 4)) return (0); */
	"      movl (%2),%0\n"
	"      xorl (%1),%0\n"		/* eax = (*(long *)b1 ^ *(long *)b2) */
	"      shll $3,%3\n"
	"      movl %3,%%ecx\n"
	"      movl $1,%%edx\n"
	"      shll %%cl,%%edx\n"
	"      decl %%edx\n"
	"      andl %%edx,%0\n"		/* eax & ((1 << (len << 3)) - 1) */
	"      jmp 8f\n"
	"9:    xor %0,%0\n"
	"8:"
		: "=a" (ret)
		: "S" (b1), "D" (b2), "b" (len)
		: "%ecx", "%edx");

	return (ret);
}

----------------------------------------------------------------
Peter


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message