Date: Wed, 22 Aug 2012 09:31:07 +0300 From: Mitya <mitya@cabletv.dp.ua> To: Bruce Evans <bde@FreeBSD.org>, freebsd-net@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: Replace bcopy() to update ether_addr Message-ID: <50347CAB.30309@cabletv.dp.ua> In-Reply-To: <201208220207.q7M27r3A020138@ref10-i386.freebsd.org> References: <201208220207.q7M27r3A020138@ref10-i386.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
22.08.2012 05:07, Bruce Evans написал: >> On Mon, Aug 20, 2012 at 05:46:12PM +0300, Mitya wrote: >>> Hi. >>> I found some overhead code in /src/sys/net/if_ethersubr.c and >>> /src/sys/netgraph/ng_ether.c >>> >>> It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN); >>> When src and dst are "struct ether_addr*", and ETHER_ADDR_LEN equal 6. > Only ng_ether.c contains such strings. if_ethersubr.c doesn't exist. > if_ether.c exists, but was converted to use memcpy() 17.25 years ago. > >>> This code call every time, when we send Ethernet packet. >>> On example, on my machine in invoked nearly 20K per second. >>> >>> Why we are use bcopy(), to copy only 6 bytes? >>> Answer - in some architectures we are can not directly copy unaligned data. >>> >>> I propose this solution. >>> >>> In file /usr/src/include/net/ethernet.h add this lines: >>> >>> static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) { >>> #if defined(__i386__) || defined(__amd64__) >>> *dst = *src; >>> #else >>> bcopy(src, dst, ETHER_ADDR_LEN); >>> #endif >>> } >>> >>> On platform i386 gcc produce like this code: >>> leal -30(%ebp), %eax >>> leal 6(%eax), %ecx >>> leal -44(%ebp), %edx >>> movl (%edx), %eax >>> movl %eax, (%ecx) >>> movzwl 4(%edx), %eax >>> movw %ax, 4(%ecx) >>> And clang produce this: >>> movl -48(%ebp), %ecx >>> movl %ecx, -26(%ebp) >>> movw -44(%ebp), %si >>> movw %si, -22(%ebp) >>> >>> All this variants are much faster, than bcopy() > You mean "as much as 5 nanoseconds faster". Possibly even 10 nanoseconds > faster. > >> A bit orthogonal to this but also related to the performance >> impact of these bcopy() calls, for !__NO_STRICT_ALIGNMENT >> architectures these places probably should use memcpy() >> instead as bcopy() additionally has to check for overlap >> while the former does not. Overlaps unlikely are an issue >> in these cases and at least NetBSD apparently has done the >> switch to memcpy() 5.5 years ago. > This is essentially just a style bug. FreeBSD switched to memcpy() > 17.25 years ago for selected networking copying. memcpy() is supposed > to be used if and only if compilers can optimize it. This means that > the size must be fixed and small, and of course that the copies don't > overlap. Otherwise, compilers can't do any better than call an extern > copying routine, which is memcpy() in practice. memcpy() was > interntionally left out in FreeBSD until it was added 17.25 years > ago to satisfy the changes to use memcpy() in networking code (since > with -O0, memcpy() won't be inlined and the extern memcpy() gets > used). Other uses are style bugs, but there are many now :-(. > > bcopy() is still the primary interface, and might be faster than > memcpy(), especially for misaligned cases, but in practice it isn't, > except in the kernel on Pentium1 in 1996-1998 where I only optimized > (large) bcopy()s. Since it has to support overlapping copies it is > inherently slower. > > Although compilers can't do better for large copying than call an > extern routine, some compilers bogusly inline it to something like > "rep movsd" on i386, (or worse, to a very large number of loads and > stores). gcc used to have a very large threshold for inlining > moderately-sized copies and/or for switching between "rep movsd" and > load/store. It now understands better than ut doesn't understand > memory, and has more reasonable thresholds. Or rather the thresholds > are more MD. gcc still makes a mess with some CFLAGS: > > % struct foo > % { > % short x; > % struct bar { > % short yy[31]; > % } y; > % } s, t; > % > % foo() > % { > % s.y = t.y; > % } > > With just -O, gcc-4.2.1 -O on i386 handles this very badly, by > generating 15 misaligned 32-bit load/stores followed by 1 aligned > 16-bit load/store. With -march=<almost anything>, it generates 1 > 16-bit aligned load-store followed by an aligned "rep movsd" with a > count of 15. The latter is not too bad. Similarly for yy[32]. But > for yy[33] it switches to a call to memcpy() even with plain -O. > > However, improvements in memory systems and branch prediction since > Pentium1 in 1996-98 mean that optimimizing copying mostly gets > nowhere. Copying is either from the cache[s], in which case it is > fast (~1 nanosecond per 128 bits for L1), or it is not from the > caches in which case it is slow (~50-200 nanseconds per cache miss). > With 6-byte ethernet addresses, using bcopy() does slow the copying > to considerably below 1 nanosecond per 128 bits (a sloppy estimate > gives 5-10 ns/call), but it's hard for a single call to be made often > enough to make a significant difference. Someone mentioned 20000 > calls. That's the same as 0 calls: 20000 * 10 nsec = 200 usec = > 0.05% of 1 CPU. > > If anyone cared about this, then they would use __builtin_memcpy() > instead of memcpy(). (Note that the point of the 17.25-year old > optimization has been defeated for ~10 years by turning off _all_ > builtins, which was initially done mainly to kill builtin putc(). > (-ffreestanding should have done that.) So gcc inlines struct > copying for small structs, but never inlines memcpy(), or strlen(), > or other unimportant things.) I once used the following to > partially fix this, but stopped using it because it made no > observable difference: > > % diff -c2 ./sys/systm.h~ ./sys/systm.h > % *** ./sys/systm.h~ Sat Oct 13 12:43:54 2007 > % --- ./sys/systm.h Sat Oct 13 12:43:56 2007 > % *************** > % *** 188,193 **** > % --- 188,204 ---- > % void bcopy(const void *from, void *to, size_t len) __nonnull(1) __nonnull(2); > % void bzero(void *buf, size_t len) __nonnull(1); > % + #if 1 > % + #define bzero(p, n) ({ \ > % + if (__builtin_constant_p(n)&& (n)<= 64) \ > % + __builtin_memset((p), 0, (n)); \ > % + else \ > % + (bzero)((p), (n)); \ > % + }) > % + #endif > % > % void *memcpy(void *to, const void *from, size_t len) __nonnull(1) __nonnull(2); > % + #if 1 > % + #define memcpy(to, from, n) __builtin_memcpy((to), (from), (n)) > % + #endif > % > % int copystr(const void * __restrict kfaddr, void * __restrict kdaddr, > > In another set of dead patches, I changed lots of memcpy()s in networking > code to __builtin_memcpy()'s. > > There is another set of style bugs and micro-pessimizations involving > other mem* functions starting with bzero(). The above avoids the > micro-pessimizations for the usual case of memset() to 0, at a cost > of rewarding the style bug. > > Summary: use builtin memcpy() for small copies, and don't try hard to > otherwise optimize this. You are very surprised to learn that bcopy() and memcpy() are used for copy "struct in_addr", whose length is equal to 4 bytes ? I found this in sys/netinet/if_ether.c. And I think, there is a chance to find them in other files. And I found bzero(mtod(m, void *), sizeof(struct in_addr)); in ip_options.c > Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50347CAB.30309>