From owner-svn-src-head@FreeBSD.ORG Fri Aug 31 15:02:39 2012 Return-Path: Delivered-To: svn-src-head@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5DEDA106566B; Fri, 31 Aug 2012 15:02:39 +0000 (UTC) (envelope-from bde@FreeBSD.org) Received: from ref10-i386.freebsd.org (unknown [IPv6:2001:4f8:fff6::5e]) by mx1.freebsd.org (Postfix) with ESMTP id 48B508FC12; Fri, 31 Aug 2012 15:02:39 +0000 (UTC) Received: from ref10-i386.freebsd.org (localhost [127.0.0.1]) by ref10-i386.freebsd.org (8.14.5/8.14.5) with ESMTP id q7VF2dcQ098319; Fri, 31 Aug 2012 15:02:39 GMT (envelope-from bde@ref10-i386.freebsd.org) Received: (from bde@localhost) by ref10-i386.freebsd.org (8.14.5/8.14.5/Submit) id q7VF2dEv098318; Fri, 31 Aug 2012 15:02:39 GMT (envelope-from bde) Date: Fri, 31 Aug 2012 15:02:39 GMT From: Bruce Evans Message-Id: <201208311502.q7VF2dEv098318@ref10-i386.freebsd.org> To: glebius@freebsd.org, scottl@freebsd.org In-Reply-To: <20120831101100.GL90597@FreeBSD.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r239940 - head/sys/dev/ixgbe X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Aug 2012 15:02:39 -0000 glebius wrote: > On Fri, Aug 31, 2012 at 10:07:38AM +0000, Scott Long wrote: > S> +/* > S> + * Optimized bcopy thanks to Luigi Rizzo's investigative work. Assumes > S> + * non-overlapping regions and 32-byte padding on both src and dst. > S> + */ > S> +static __inline int > S> +ixgbe_bcopy(void *_src, void *_dst, int l) > S> +{ > S> + uint64_t *src = _src; > S> + uint64_t *dst = _dst; > S> + > S> + for (; l > 0; l -= 32) { > S> + *dst++ = *src++; > S> + *dst++ = *src++; > S> + *dst++ = *src++; > S> + *dst++ = *src++; > S> + } > S> + return (0); > S> +} > S> + > > Shouldn't this go to libkern? It's bogus, so it belongs in /dev/null. Bogusness starts with its name and its parameter names (its semantics are closer to memcpy() than bcopy(), but its arg order is that of bcopy()...). In a quick test, it is slightly slower than builtin memcpy when l = 64, and when l = 128 on amd64: % #include % % static __inline int % ixgbe_bcopy(void *_src, void *_dst, int l) % { % uint64_t *src = _src; % uint64_t *dst = _dst; % % for (; l > 0; l -= 32) { % *dst++ = *src++; % *dst++ = *src++; % *dst++ = *src++; % *dst++ = *src++; % } % return (0); % } % % int dst[512]; % int src[512]; % % main() % { % int i; % % for (i = 0; i < 100000000; i++) % #if 0 % __builtin_memcpy(dst, src, 64); % #else % ixgbe_bcopy(src, dst, 64); % #endif % } Builtin memcpy generates lots of unrolling, with up to 64 bytes copied in the builtin's inner loop on i386, and up to 128 on amd64. This is excessive, but is what makes the builtin slightly faster than the hand-rolled version in bogus micro-benchmarks like this. Times on FreeBSD cluster machines (core2): amd64 l = 128: builtin 1.00 seconds, hand-rolled 1.16 seconds i386 l = 64: builtin 0.99 seconds, hand-rolled 1.43 seconds Above 64 or 128 bytes, the builtin switches to calling memcpy(). Now the hand-rolled version is faster. This is essentially accidental. gcc knows that it doesn't understand copying memory and switches to the extern memcpy() above a certain threshold for the size, in the hope that the extern memcpy understands. The threshold is hard-coded but depends on -march. The hand-rolled version doesn't know that it doesn't understand copying memory, and uses a hard-coded magic number of 32 related to the threshold. The thresholds of 64/128 are possibly a little too small on the test hardware given the unsmartness of the extern memcpy(), but it is hard to do better without tuning for the CPU, its memory system, and other things. Perhaps ixgbe could know something about the context, but libkern can't. Times: amd64 l = 256: builtin 4.09 seconds, hand-rolled 2.28 seconds i386 l = 128: builtin 4.75 seconds, hand-rolled 2.67 seconds These times show some interesting unrelated pessimizations: - i386 is about twice as slow as amd64, on the same hardware. It is handicapped by only having 32-bit integer registers. With SSE, there would be little difference, but the setup overhead is probably too large to use SSE for such small copies, even in userland. - amd64 goes from being about twice as fast as i386 with builtin memcpy below the threshold to not much faster above the threshold. This is surprising, since the extern memcpy uses "rep movsq" on amd64, and in other benchmarks on other (amd64) machines, using "rep movsq" instead of "rep movsl"" retains the 2-flow speed advantage (because "rep movsq" goes at cache speed, but "rep movsl" can't keep up. String instructions tend to be slower` but are often fast enough. I recently read a claim that movs* is the fastest method on SandyBridge but not fastest on any older CPU. Anyway, the speed of copying from the L1 cache is unimportant. If anyone cared about it, then they would have noticed when automatic use of builtin memcpy was broken for kernels ~20 years ago by turning off all builtins. BTW, NOTES still hasn't caught up with this change. It still has -fno-builtin in makeoptions. I put this there to test the inversion of the default -fbuiltin. But the default was reversed without changing this (-fno-builtin is not explicit, but is implied by -ffreestanding). Bruce