From owner-freebsd-net@FreeBSD.ORG Wed Aug 22 02:07:54 2012 Return-Path: Delivered-To: freebsd-net@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B8BE0106564A; Wed, 22 Aug 2012 02:07:54 +0000 (UTC) (envelope-from bde@FreeBSD.org) Received: from ref10-i386.freebsd.org (unknown [IPv6:2001:4f8:fff6::5e]) by mx1.freebsd.org (Postfix) with ESMTP id A1A468FC15; Wed, 22 Aug 2012 02:07:54 +0000 (UTC) Received: from ref10-i386.freebsd.org (localhost [127.0.0.1]) by ref10-i386.freebsd.org (8.14.5/8.14.5) with ESMTP id q7M27svc020139; Wed, 22 Aug 2012 02:07:54 GMT (envelope-from bde@ref10-i386.freebsd.org) Received: (from bde@localhost) by ref10-i386.freebsd.org (8.14.5/8.14.5/Submit) id q7M27r3A020138; Wed, 22 Aug 2012 02:07:53 GMT (envelope-from bde) Date: Wed, 22 Aug 2012 02:07:53 GMT From: Bruce Evans Message-Id: <201208220207.q7M27r3A020138@ref10-i386.freebsd.org> To: marius@alchemy.franken.de, mitya@cabletv.dp.ua In-Reply-To: <20120821102630.GA89551@alchemy.franken.de> Cc: freebsd-hackers@FreeBSD.org, freebsd-net@FreeBSD.org Subject: Re: Replace bcopy() to update ether_addr X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 22 Aug 2012 02:07:54 -0000 > On Mon, Aug 20, 2012 at 05:46:12PM +0300, Mitya wrote: > > Hi. > > I found some overhead code in /src/sys/net/if_ethersubr.c and > > /src/sys/netgraph/ng_ether.c > > > > It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN); > > When src and dst are "struct ether_addr*", and ETHER_ADDR_LEN equal 6. Only ng_ether.c contains such strings. if_ethersubr.c doesn't exist. if_ether.c exists, but was converted to use memcpy() 17.25 years ago. > > This code call every time, when we send Ethernet packet. > > On example, on my machine in invoked nearly 20K per second. > > > > Why we are use bcopy(), to copy only 6 bytes? > > Answer - in some architectures we are can not directly copy unaligned data. > > > > I propose this solution. > > > > In file /usr/src/include/net/ethernet.h add this lines: > > > > static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) { > > #if defined(__i386__) || defined(__amd64__) > > *dst = *src; > > #else > > bcopy(src, dst, ETHER_ADDR_LEN); > > #endif > > } > > > > On platform i386 gcc produce like this code: > > leal -30(%ebp), %eax > > leal 6(%eax), %ecx > > leal -44(%ebp), %edx > > movl (%edx), %eax > > movl %eax, (%ecx) > > movzwl 4(%edx), %eax > > movw %ax, 4(%ecx) > > And clang produce this: > > movl -48(%ebp), %ecx > > movl %ecx, -26(%ebp) > > movw -44(%ebp), %si > > movw %si, -22(%ebp) > > > > All this variants are much faster, than bcopy() You mean "as much as 5 nanoseconds faster". Possibly even 10 nanoseconds faster. > A bit orthogonal to this but also related to the performance > impact of these bcopy() calls, for !__NO_STRICT_ALIGNMENT > architectures these places probably should use memcpy() > instead as bcopy() additionally has to check for overlap > while the former does not. Overlaps unlikely are an issue > in these cases and at least NetBSD apparently has done the > switch to memcpy() 5.5 years ago. This is essentially just a style bug. FreeBSD switched to memcpy() 17.25 years ago for selected networking copying. memcpy() is supposed to be used if and only if compilers can optimize it. This means that the size must be fixed and small, and of course that the copies don't overlap. Otherwise, compilers can't do any better than call an extern copying routine, which is memcpy() in practice. memcpy() was interntionally left out in FreeBSD until it was added 17.25 years ago to satisfy the changes to use memcpy() in networking code (since with -O0, memcpy() won't be inlined and the extern memcpy() gets used). Other uses are style bugs, but there are many now :-(. bcopy() is still the primary interface, and might be faster than memcpy(), especially for misaligned cases, but in practice it isn't, except in the kernel on Pentium1 in 1996-1998 where I only optimized (large) bcopy()s. Since it has to support overlapping copies it is inherently slower. Although compilers can't do better for large copying than call an extern routine, some compilers bogusly inline it to something like "rep movsd" on i386, (or worse, to a very large number of loads and stores). gcc used to have a very large threshold for inlining moderately-sized copies and/or for switching between "rep movsd" and load/store. It now understands better than ut doesn't understand memory, and has more reasonable thresholds. Or rather the thresholds are more MD. gcc still makes a mess with some CFLAGS: % struct foo % { % short x; % struct bar { % short yy[31]; % } y; % } s, t; % % foo() % { % s.y = t.y; % } With just -O, gcc-4.2.1 -O on i386 handles this very badly, by generating 15 misaligned 32-bit load/stores followed by 1 aligned 16-bit load/store. With -march=, it generates 1 16-bit aligned load-store followed by an aligned "rep movsd" with a count of 15. The latter is not too bad. Similarly for yy[32]. But for yy[33] it switches to a call to memcpy() even with plain -O. However, improvements in memory systems and branch prediction since Pentium1 in 1996-98 mean that optimimizing copying mostly gets nowhere. Copying is either from the cache[s], in which case it is fast (~1 nanosecond per 128 bits for L1), or it is not from the caches in which case it is slow (~50-200 nanseconds per cache miss). With 6-byte ethernet addresses, using bcopy() does slow the copying to considerably below 1 nanosecond per 128 bits (a sloppy estimate gives 5-10 ns/call), but it's hard for a single call to be made often enough to make a significant difference. Someone mentioned 20000 calls. That's the same as 0 calls: 20000 * 10 nsec = 200 usec = 0.05% of 1 CPU. If anyone cared about this, then they would use __builtin_memcpy() instead of memcpy(). (Note that the point of the 17.25-year old optimization has been defeated for ~10 years by turning off _all_ builtins, which was initially done mainly to kill builtin putc(). (-ffreestanding should have done that.) So gcc inlines struct copying for small structs, but never inlines memcpy(), or strlen(), or other unimportant things.) I once used the following to partially fix this, but stopped using it because it made no observable difference: % diff -c2 ./sys/systm.h~ ./sys/systm.h % *** ./sys/systm.h~ Sat Oct 13 12:43:54 2007 % --- ./sys/systm.h Sat Oct 13 12:43:56 2007 % *************** % *** 188,193 **** % --- 188,204 ---- % void bcopy(const void *from, void *to, size_t len) __nonnull(1) __nonnull(2); % void bzero(void *buf, size_t len) __nonnull(1); % + #if 1 % + #define bzero(p, n) ({ \ % + if (__builtin_constant_p(n) && (n) <= 64) \ % + __builtin_memset((p), 0, (n)); \ % + else \ % + (bzero)((p), (n)); \ % + }) % + #endif % % void *memcpy(void *to, const void *from, size_t len) __nonnull(1) __nonnull(2); % + #if 1 % + #define memcpy(to, from, n) __builtin_memcpy((to), (from), (n)) % + #endif % % int copystr(const void * __restrict kfaddr, void * __restrict kdaddr, In another set of dead patches, I changed lots of memcpy()s in networking code to __builtin_memcpy()'s. There is another set of style bugs and micro-pessimizations involving other mem* functions starting with bzero(). The above avoids the micro-pessimizations for the usual case of memset() to 0, at a cost of rewarding the style bug. Summary: use builtin memcpy() for small copies, and don't try hard to otherwise optimize this. Bruce