From owner-freebsd-net@FreeBSD.ORG  Wed Aug 22 02:32:21 2012
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: freebsd-net@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DC9871065672;
	Wed, 22 Aug 2012 02:32:21 +0000 (UTC) (envelope-from bde@FreeBSD.org)
Received: from ref10-i386.freebsd.org (unknown [IPv6:2001:4f8:fff6::5e])
	by mx1.freebsd.org (Postfix) with ESMTP id C7E948FC0A;
	Wed, 22 Aug 2012 02:32:21 +0000 (UTC)
Received: from ref10-i386.freebsd.org (localhost [127.0.0.1])
	by ref10-i386.freebsd.org (8.14.5/8.14.5) with ESMTP id q7M2WLmY020205; 
	Wed, 22 Aug 2012 02:32:21 GMT
	(envelope-from bde@ref10-i386.freebsd.org)
Received: (from bde@localhost)
	by ref10-i386.freebsd.org (8.14.5/8.14.5/Submit) id q7M2WLCL020204;
	Wed, 22 Aug 2012 02:32:21 GMT (envelope-from bde)
Date: Wed, 22 Aug 2012 02:32:21 GMT
From: Bruce Evans <bde@FreeBSD.org>
Message-Id: <201208220232.q7M2WLCL020204@ref10-i386.freebsd.org>
To: marius@alchemy.franken.de, rizzo@iet.unipi.it
In-Reply-To: <20120821112415.GA50078@onelab2.iet.unipi.it>
Cc: freebsd-hackers@FreeBSD.org, mitya@cabletv.dp.ua, freebsd-net@FreeBSD.org
Subject: Re: Replace bcopy() to update ether_addr
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 22 Aug 2012 02:32:22 -0000

luigi wrote:

> even more orthogonal:
> 
> I found that copying 8n + (5, 6 or 7) bytes was much much slower than
> copying a multiple of 8 bytes. For n=0, 1,2,4,8 bytes are efficient,
> other cases are slow (turned into 2 or 3 different writes).
> 
> The netmap code uses a pkt_copy routine that does exactly this
> rounding, gaining some 10-20ns per packet for small sizes.

I don't believe 10-20ns for just the extra bytes.  memcpy() ends up
with a movsb to copy the extra bytes.  This can be slow, but I don't
believe 10-20ns (except on machines running at i486 speeds of course).

% ENTRY(memcpy)
% 	pushl	%edi
% 	pushl	%esi
% 	movl	12(%esp),%edi
% 	movl	16(%esp),%esi
% 	movl	20(%esp),%ecx
% 	movl	%edi,%eax
% 	shrl	$2,%ecx				/* copy by 32-bit words */
% 	cld					/* nope, copy forwards */
% 	rep
% 	movsl
% 	movl	20(%esp),%ecx
% 	andl	$3,%ecx				/* any bytes left? */

This avoids a branch.  Some optimization manuals say that the branch is
actually better for some machines,

The above 2 instructions have a throughput of 1 per cycle each on
modern x86.  Latency might be 6 cycles.

% 	rep

Maybe 5-15 cycles throughput.

% 	movsb

Now hopefully at most 1 cycle/byte.  Some hardware might combine the
bytes as much as possible, so the whole function should use 1 single
"rep movsb" and let the hardware do it all.

% 	popl	%esi
% 	popl	%edi
% 	ret

Well, it's easy to get a latency of 20 cycles 5-10 ns) and maybe even
a throughput of that.  But all of thus is out of order on modern x86.
The extra cycles for the movsb might not cost at all if nothing accesses
the part of the target that they were written to soon.

With builtin memcpy, 6 bytes would be done using load/store of 4+2 bytes
and thus take the same time as 8 bytes on i386, but on amd64 8 bytes
would be faster.

Bruce