From owner-freebsd-current@FreeBSD.ORG Wed May 2 18:06:13 2012 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 704891065672 for ; Wed, 2 May 2012 18:06:13 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 2F0788FC19 for ; Wed, 2 May 2012 18:06:13 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 671777300A; Wed, 2 May 2012 20:25:57 +0200 (CEST) Date: Wed, 2 May 2012 20:25:57 +0200 From: Luigi Rizzo To: current@freebsd.org, net@frebsd.org Message-ID: <20120502182557.GA93838@onelab2.iet.unipi.it> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.3i Cc: Subject: fast bcopy... X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 May 2012 18:06:13 -0000 as part of my netmap investigations, i was looking at how expensive are memory copies, and here are a couple of findings (first one is obvious, the second one less so) 1. especially on 64bit machines, always use multiple of at least 8 bytes (possibly even larger units). The bcopy code in amd64 seems to waste an extra 20ns (on a 3.4 GHz machine) when processing blocks of size 8n + {4,5,6,7}. The difference is relevant, on that machine i have bcopy(src, dst, 1) ~12.9ns (data in L1 cache) bcopy(src, dst, 3) ~12.9ns (data in L1 cache) bcopy(src, dst, 4) ~33.4ns (data in L1 cache) <--- NOTE bcopy(src, dst, 32) ~12.9ns (data in L1 cache) bcopy(src, dst, 63) ~33.4ns (data in L1 cache) <--- NOTE bcopy(src, dst, 64) ~12.9ns (data in L1 cache) Note how the two marked lines are much slower than the others. Same thing happens with data not in L1 bcopy(src, dst, 64) ~ 22ns (not in L1) bcopy(src, dst, 63) ~ 44ns (not in L1) ... Continuing the tests on larger sizes, for the next item: bcopy(src, dst,256) ~19.8ns (data in L1 cache) bcopy(src, dst,512) ~28.8ns (data in L1 cache) bcopy(src, dst,1K) ~39.6ns (data in L1 cache) bcopy(src, dst,4K) ~95.2ns (data in L1 cache) An older P4 running FreeBSD4/32 bit the operand size seems less sensitive to odd sizes. 2. apparently, bcopy is not the fastest way to copy memory. For small blocks and multiples of 32-64 bytes, i noticed that the following is a lot faster (breaking even at about 1 KBytes) static inline void fast_bcopy(void *_src, void *_dst, int l) { uint64_t *src = _src; uint64_t *dst = _dst; for (; l > 0; l-=32) { *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; } } fast_bcopy(src, dst, 32) ~ 1.8ns (data in L1 cache) fast_bcopy(src, dst, 64) ~ 2.9ns (data in L1 cache) fast_bcopy(src, dst,256) ~10.1ns (data in L1 cache) fast_bcopy(src, dst,512) ~19.5ns (data in L1 cache) fast_bcopy(src, dst,1K) ~38.4ns (data in L1 cache) fast_bcopy(src, dst,4K) ~152.0ns (data in L1 cache) fast_bcopy(src, dst, 32) ~15.3ns (not in L1) fast_bcopy(src, dst,256) ~38.7ns (not in L1) ... The old P4/32 bit also exhibits similar results. Conclusion: if you have to copy packets you might be better off padding the length to a multiple of 32, and using the following function to get the best of both worlds. Sprinkle some prefetch() for better taste. // XXX only for multiples of 32 bytes, non overlapped. static inline void good_bcopy(void *_src, void *_dst, int l) { uint64_t *src = _src; uint64_t *dst = _dst; #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect(!!(x), 0) if (unlikely(l >= 1024)) { bcopy(src, dst, l); return; } for (; l > 0; l-=32) { *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; } } cheers luigi