Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 22 Aug 2012 09:31:07 +0300
From:      Mitya <mitya@cabletv.dp.ua>
To:        Bruce Evans <bde@FreeBSD.org>, freebsd-net@freebsd.org,  freebsd-hackers@freebsd.org
Subject:   Re: Replace bcopy() to update ether_addr
Message-ID:  <50347CAB.30309@cabletv.dp.ua>
In-Reply-To: <201208220207.q7M27r3A020138@ref10-i386.freebsd.org>
References:  <201208220207.q7M27r3A020138@ref10-i386.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
22.08.2012 05:07, Bruce Evans написал:
>> On Mon, Aug 20, 2012 at 05:46:12PM +0300, Mitya wrote:
>>> Hi.
>>> I found some overhead code in /src/sys/net/if_ethersubr.c and
>>> /src/sys/netgraph/ng_ether.c
>>>
>>> It contains strings, like bcopy(src, dst, ETHER_ADDR_LEN);
>>> When src and dst are "struct ether_addr*", and ETHER_ADDR_LEN equal 6.
> Only ng_ether.c contains such strings.  if_ethersubr.c doesn't exist.
> if_ether.c exists, but was converted to use memcpy() 17.25 years ago.
>
>>> This code call every time, when we send Ethernet packet.
>>> On example, on my machine in invoked nearly 20K per second.
>>>
>>> Why we are use bcopy(), to copy only 6 bytes?
>>> Answer - in some architectures we are can not directly copy unaligned data.
>>>
>>> I propose this solution.
>>>
>>> In file /usr/src/include/net/ethernet.h add this lines:
>>>
>>> static inline void ether_addr_copy(ether_addr* src, ether_addr* dst) {
>>> #if defined(__i386__) || defined(__amd64__)
>>>      *dst = *src;
>>> #else
>>>      bcopy(src, dst, ETHER_ADDR_LEN);
>>> #endif
>>> }
>>>
>>> On platform i386 gcc produce like this code:
>>>      leal    -30(%ebp), %eax
>>>      leal    6(%eax), %ecx
>>>      leal    -44(%ebp), %edx
>>>      movl    (%edx), %eax
>>>      movl    %eax, (%ecx)
>>>      movzwl  4(%edx), %eax
>>>      movw    %ax, 4(%ecx)
>>> And clang produce this:
>>>      movl    -48(%ebp), %ecx
>>>      movl    %ecx, -26(%ebp)
>>>      movw    -44(%ebp), %si
>>>      movw    %si, -22(%ebp)
>>>
>>> All this variants are much faster, than bcopy()
> You mean "as much as 5 nanoseconds faster".  Possibly even 10 nanoseconds
> faster.
>
>> A bit orthogonal to this but also related to the performance
>> impact of these bcopy() calls, for !__NO_STRICT_ALIGNMENT
>> architectures these places probably should use memcpy()
>> instead as bcopy() additionally has to check for overlap
>> while the former does not. Overlaps unlikely are an issue
>> in these cases and at least NetBSD apparently has done the
>> switch to memcpy() 5.5 years ago.
> This is essentially just a style bug.  FreeBSD switched to memcpy()
> 17.25 years ago for selected networking copying.  memcpy() is supposed
> to be used if and only if compilers can optimize it.  This means that
> the size must be fixed and small, and of course that the copies don't
> overlap.  Otherwise, compilers can't do any better than call an extern
> copying routine, which is memcpy() in practice.  memcpy() was
> interntionally left out in FreeBSD until it was added 17.25 years
> ago to satisfy the changes to use memcpy() in networking code (since
> with -O0, memcpy() won't be inlined and the extern memcpy() gets
> used).  Other uses are style bugs, but there are many now :-(.
>
> bcopy() is still the primary interface, and might be faster than
> memcpy(), especially for misaligned cases, but in practice it isn't,
> except in the kernel on Pentium1 in 1996-1998 where I only optimized
> (large) bcopy()s.  Since it has to support overlapping copies it is
> inherently slower.
>
> Although compilers can't do better for large copying than call an
> extern routine, some compilers bogusly inline it to something like
> "rep movsd" on i386, (or worse, to a very large number of loads and
> stores).  gcc used to have a very large threshold for inlining
> moderately-sized copies and/or for switching between "rep movsd" and
> load/store.  It now understands better than ut doesn't understand
> memory, and has more reasonable thresholds.   Or rather the thresholds
> are more MD.  gcc still makes a mess with some CFLAGS:
>
> % struct foo
> % {
> % 	short x;
> % 	struct bar {
> % 		short yy[31];
> % 	} y;
> % } s, t;
> %
> % foo()
> % {
> % 	s.y = t.y;
> % }
>
> With just -O, gcc-4.2.1 -O on i386 handles this very badly, by
> generating 15 misaligned 32-bit load/stores followed by 1 aligned
> 16-bit load/store.  With -march=<almost anything>, it generates 1
> 16-bit aligned load-store followed by an aligned "rep movsd" with a
> count of 15.  The latter is not too bad.  Similarly for yy[32].  But
> for yy[33] it switches to a call to memcpy() even with plain -O.
>
> However, improvements in memory systems and branch prediction since
> Pentium1 in 1996-98 mean that optimimizing copying mostly gets
> nowhere.  Copying is either from the cache[s], in which case it is
> fast (~1 nanosecond per 128 bits for L1), or it is not from the
> caches in which case it is slow (~50-200 nanseconds per cache miss).
> With 6-byte ethernet addresses, using bcopy() does slow the copying
> to considerably below 1 nanosecond per 128 bits (a sloppy estimate
> gives 5-10 ns/call), but it's hard for a single call to be made often
> enough to make a significant difference.  Someone mentioned 20000
> calls.  That's the same as 0 calls: 20000 * 10 nsec = 200 usec =
> 0.05% of 1 CPU.
>
> If anyone cared about this, then they would use __builtin_memcpy()
> instead of memcpy().  (Note that the point of the 17.25-year old
> optimization has been defeated for ~10 years by turning off _all_
> builtins, which was initially done mainly to kill builtin putc().
> (-ffreestanding should have done that.)  So gcc inlines struct
> copying for small structs, but never inlines memcpy(), or strlen(),
> or other unimportant things.)   I once used the following to
> partially fix this, but stopped using it because it made no
> observable difference:
>
> % diff -c2 ./sys/systm.h~ ./sys/systm.h
> % *** ./sys/systm.h~	Sat Oct 13 12:43:54 2007
> % --- ./sys/systm.h	Sat Oct 13 12:43:56 2007
> % ***************
> % *** 188,193 ****
> % --- 188,204 ----
> %   void	bcopy(const void *from, void *to, size_t len) __nonnull(1) __nonnull(2);
> %   void	bzero(void *buf, size_t len) __nonnull(1);
> % + #if 1
> % + #define	bzero(p, n) ({						\
> % + 	if (__builtin_constant_p(n)&&  (n)<= 64)		\
> % + 		__builtin_memset((p), 0, (n));			\
> % + 	else							\
> % + 		(bzero)((p), (n));				\
> % + })
> % + #endif
> %
> %   void	*memcpy(void *to, const void *from, size_t len) __nonnull(1) __nonnull(2);
> % + #if 1
> % + #define	memcpy(to, from, n)	__builtin_memcpy((to), (from), (n))
> % + #endif
> %
> %   int	copystr(const void * __restrict kfaddr, void * __restrict kdaddr,
>
> In another set of dead patches, I changed lots of memcpy()s in networking
> code to __builtin_memcpy()'s.
>
> There is another set of style bugs and micro-pessimizations involving
> other mem* functions starting with bzero().  The above avoids the
> micro-pessimizations for the usual case of memset() to 0, at a cost
> of rewarding the style bug.
>
> Summary: use builtin memcpy() for small copies, and don't try hard to
> otherwise optimize this.
You are very surprised to learn that bcopy() and memcpy() are used for 
copy "struct in_addr", whose length is equal to 4 bytes ?
I found this in sys/netinet/if_ether.c. And I think, there is a chance 
to find them in other files.

And I found bzero(mtod(m, void *), sizeof(struct in_addr)); in ip_options.c
> Bruce




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50347CAB.30309>