Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Aug 2012 02:32:21 GMT
From:      Bruce Evans <bde@FreeBSD.org>
To:        marius@alchemy.franken.de, rizzo@iet.unipi.it
Cc:        freebsd-hackers@FreeBSD.org, mitya@cabletv.dp.ua, freebsd-net@FreeBSD.org
Subject:   Re: Replace bcopy() to update ether_addr
Message-ID:  <201208220232.q7M2WLCL020204@ref10-i386.freebsd.org>
In-Reply-To: <20120821112415.GA50078@onelab2.iet.unipi.it>

next in thread | previous in thread | raw e-mail | index | archive | help
luigi wrote:

> even more orthogonal:
> 
> I found that copying 8n + (5, 6 or 7) bytes was much much slower than
> copying a multiple of 8 bytes. For n=0, 1,2,4,8 bytes are efficient,
> other cases are slow (turned into 2 or 3 different writes).
> 
> The netmap code uses a pkt_copy routine that does exactly this
> rounding, gaining some 10-20ns per packet for small sizes.

I don't believe 10-20ns for just the extra bytes.  memcpy() ends up
with a movsb to copy the extra bytes.  This can be slow, but I don't
believe 10-20ns (except on machines running at i486 speeds of course).

% ENTRY(memcpy)
% 	pushl	%edi
% 	pushl	%esi
% 	movl	12(%esp),%edi
% 	movl	16(%esp),%esi
% 	movl	20(%esp),%ecx
% 	movl	%edi,%eax
% 	shrl	$2,%ecx				/* copy by 32-bit words */
% 	cld					/* nope, copy forwards */
% 	rep
% 	movsl
% 	movl	20(%esp),%ecx
% 	andl	$3,%ecx				/* any bytes left? */

This avoids a branch.  Some optimization manuals say that the branch is
actually better for some machines,

The above 2 instructions have a throughput of 1 per cycle each on
modern x86.  Latency might be 6 cycles.

% 	rep

Maybe 5-15 cycles throughput.

% 	movsb

Now hopefully at most 1 cycle/byte.  Some hardware might combine the
bytes as much as possible, so the whole function should use 1 single
"rep movsb" and let the hardware do it all.

% 	popl	%esi
% 	popl	%edi
% 	ret

Well, it's easy to get a latency of 20 cycles 5-10 ns) and maybe even
a throughput of that.  But all of thus is out of order on modern x86.
The extra cycles for the movsb might not cost at all if nothing accesses
the part of the target that they were written to soon.

With builtin memcpy, 6 bytes would be done using load/store of 4+2 bytes
and thus take the same time as 8 bytes on i386, but on amd64 8 bytes
would be faster.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201208220232.q7M2WLCL020204>