From owner-freebsd-net@FreeBSD.ORG Wed Aug 22 02:32:21 2012 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DC9871065672; Wed, 22 Aug 2012 02:32:21 +0000 (UTC) (envelope-from bde@FreeBSD.org) Received: from ref10-i386.freebsd.org (unknown [IPv6:2001:4f8:fff6::5e]) by mx1.freebsd.org (Postfix) with ESMTP id C7E948FC0A; Wed, 22 Aug 2012 02:32:21 +0000 (UTC) Received: from ref10-i386.freebsd.org (localhost [127.0.0.1]) by ref10-i386.freebsd.org (8.14.5/8.14.5) with ESMTP id q7M2WLmY020205; Wed, 22 Aug 2012 02:32:21 GMT (envelope-from bde@ref10-i386.freebsd.org) Received: (from bde@localhost) by ref10-i386.freebsd.org (8.14.5/8.14.5/Submit) id q7M2WLCL020204; Wed, 22 Aug 2012 02:32:21 GMT (envelope-from bde) Date: Wed, 22 Aug 2012 02:32:21 GMT From: Bruce Evans Message-Id: <201208220232.q7M2WLCL020204@ref10-i386.freebsd.org> To: marius@alchemy.franken.de, rizzo@iet.unipi.it In-Reply-To: <20120821112415.GA50078@onelab2.iet.unipi.it> Cc: freebsd-hackers@FreeBSD.org, mitya@cabletv.dp.ua, freebsd-net@FreeBSD.org Subject: Re: Replace bcopy() to update ether_addr X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Aug 2012 02:32:22 -0000 luigi wrote: > even more orthogonal: > > I found that copying 8n + (5, 6 or 7) bytes was much much slower than > copying a multiple of 8 bytes. For n=0, 1,2,4,8 bytes are efficient, > other cases are slow (turned into 2 or 3 different writes). > > The netmap code uses a pkt_copy routine that does exactly this > rounding, gaining some 10-20ns per packet for small sizes. I don't believe 10-20ns for just the extra bytes. memcpy() ends up with a movsb to copy the extra bytes. This can be slow, but I don't believe 10-20ns (except on machines running at i486 speeds of course). % ENTRY(memcpy) % pushl %edi % pushl %esi % movl 12(%esp),%edi % movl 16(%esp),%esi % movl 20(%esp),%ecx % movl %edi,%eax % shrl $2,%ecx /* copy by 32-bit words */ % cld /* nope, copy forwards */ % rep % movsl % movl 20(%esp),%ecx % andl $3,%ecx /* any bytes left? */ This avoids a branch. Some optimization manuals say that the branch is actually better for some machines, The above 2 instructions have a throughput of 1 per cycle each on modern x86. Latency might be 6 cycles. % rep Maybe 5-15 cycles throughput. % movsb Now hopefully at most 1 cycle/byte. Some hardware might combine the bytes as much as possible, so the whole function should use 1 single "rep movsb" and let the hardware do it all. % popl %esi % popl %edi % ret Well, it's easy to get a latency of 20 cycles 5-10 ns) and maybe even a throughput of that. But all of thus is out of order on modern x86. The extra cycles for the movsb might not cost at all if nothing accesses the part of the target that they were written to soon. With builtin memcpy, 6 bytes would be done using load/store of 4+2 bytes and thus take the same time as 8 bytes on i386, but on amd64 8 bytes would be faster. Bruce