From owner-freebsd-current@FreeBSD.ORG  Wed May  2 18:06:13 2012
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 704891065672
	for <current@freebsd.org>; Wed,  2 May 2012 18:06:13 +0000 (UTC)
	(envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
	by mx1.freebsd.org (Postfix) with ESMTP id 2F0788FC19
	for <current@freebsd.org>; Wed,  2 May 2012 18:06:13 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
	id 671777300A; Wed,  2 May 2012 20:25:57 +0200 (CEST)
Date: Wed, 2 May 2012 20:25:57 +0200
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: current@freebsd.org, net@frebsd.org
Message-ID: <20120502182557.GA93838@onelab2.iet.unipi.it>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.4.2.3i
Cc: 
Subject: fast bcopy...
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 02 May 2012 18:06:13 -0000

as part of my netmap investigations, i was looking at how
expensive are memory copies, and here are a couple of findings
(first one is obvious, the second one less so)

1. especially on 64bit machines, always use multiple of at
   least 8 bytes (possibly even larger units). The bcopy code
   in amd64 seems to waste an extra 20ns (on a 3.4 GHz machine)
   when processing blocks of size 8n + {4,5,6,7}.
   The difference is relevant, on that machine i have

	bcopy(src, dst,  1) ~12.9ns	(data in L1 cache)
	bcopy(src, dst,  3) ~12.9ns	(data in L1 cache)
	bcopy(src, dst,  4) ~33.4ns	(data in L1 cache) <--- NOTE
	bcopy(src, dst, 32) ~12.9ns	(data in L1 cache)
	bcopy(src, dst, 63) ~33.4ns	(data in L1 cache) <--- NOTE
	bcopy(src, dst, 64) ~12.9ns	(data in L1 cache)
   Note how the two marked lines are much slower than the others.
   Same thing happens with data not in L1

	bcopy(src, dst, 64) ~ 22ns	(not in L1)
	bcopy(src, dst, 63) ~ 44ns	(not in L1)
		...

   Continuing the tests on larger sizes, for the next item:
	bcopy(src, dst,256) ~19.8ns	(data in L1 cache)
	bcopy(src, dst,512) ~28.8ns	(data in L1 cache)
	bcopy(src, dst,1K)  ~39.6ns	(data in L1 cache)
	bcopy(src, dst,4K)  ~95.2ns	(data in L1 cache)


   An older P4 running FreeBSD4/32 bit the operand size seems less
   sensitive to odd sizes.

2. apparently, bcopy is not the fastest way to copy memory.
   For small blocks and multiples of 32-64 bytes, i noticed that
   the following is a lot faster (breaking even at about 1 KBytes)

	static inline void
	fast_bcopy(void *_src, void *_dst, int l)
	{
		uint64_t *src = _src;
		uint64_t *dst = _dst;
		for (; l > 0; l-=32) {
			*dst++ = *src++;
			*dst++ = *src++;
			*dst++ = *src++;
			*dst++ = *src++;
		}
	}

	fast_bcopy(src, dst, 32) ~ 1.8ns	(data in L1 cache)
	fast_bcopy(src, dst, 64) ~ 2.9ns	(data in L1 cache)
	fast_bcopy(src, dst,256) ~10.1ns	(data in L1 cache)
	fast_bcopy(src, dst,512) ~19.5ns	(data in L1 cache)
	fast_bcopy(src, dst,1K)  ~38.4ns	(data in L1 cache)
	fast_bcopy(src, dst,4K) ~152.0ns	(data in L1 cache)

	fast_bcopy(src, dst, 32) ~15.3ns	(not in L1)
	fast_bcopy(src, dst,256) ~38.7ns	(not in L1)
		...

   The old P4/32 bit also exhibits similar results.

Conclusion: if you have to copy packets you might be better off
padding the length to a multiple of 32, and using the following
function to get the best of both worlds.

Sprinkle some prefetch() for better taste.

	// XXX only for multiples of 32 bytes, non overlapped.
	static inline void
	good_bcopy(void *_src, void *_dst, int l)
	{
		uint64_t *src = _src;
		uint64_t *dst = _dst;
	#define likely(x)       __builtin_expect(!!(x), 1)
	#define unlikely(x)       __builtin_expect(!!(x), 0)
		if (unlikely(l >= 1024)) {
			bcopy(src, dst, l);
			return;
		}
		for (; l > 0; l-=32) {
			*dst++ = *src++;
			*dst++ = *src++;
			*dst++ = *src++;
			*dst++ = *src++;
		}
	}

cheers
luigi