Date: Sat, 5 May 2018 10:38:29 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Conrad Meyer <cem@freebsd.org> Cc: Brooks Davis <brooks@freebsd.org>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r333240 - in head/sys: powerpc/powerpc sys Message-ID: <20180505090954.X1307@besplex.bde.org> In-Reply-To: <CAG6CVpUr_2b-yT1-uExcY1Tvpg=-P3y_owNHQ0UPg604to8Y0Q@mail.gmail.com> References: <201805040400.w4440moH025057@repo.freebsd.org> <20180504155301.GA56280@spindle.one-eyed-alien.net> <CAG6CVpUr_2b-yT1-uExcY1Tvpg=-P3y_owNHQ0UPg604to8Y0Q@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 4 May 2018, Conrad Meyer wrote: > On Fri, May 4, 2018 at 8:53 AM, Brooks Davis <brooks@freebsd.org> wrote: >> On Fri, May 04, 2018 at 04:00:48AM +0000, Mateusz Guzik wrote: >>> Author: mjg >>> Date: Fri May 4 04:00:48 2018 >>> New Revision: 333240 >>> URL: https://svnweb.freebsd.org/changeset/base/333240 >>> >>> Log: >>> Allow __builtin_memmove instead of bcopy for small buffers of known size >> >> What is the justification for forcing a size rather than using the >> compiler's built-in fallback to memmove? It is much the same as for using 'register' -- the programmer knows more than the compiler (for 1% of the targets supported by the 1% of compilers known to the programmer). >> Is the kernel compilation >> environment disabling that? > > Yes. I think the compiler can't use its builtin bcopy() in standalone mode. Not quite. Builtin bcopy() doesn't exist even for clang (5.0.1) in hosted mode. But the question was about using __builtin_memmove(). This exists for both gcc and clang, but doesn't work right: - for gcc4.2.1, __builtin_memmove() never works right. It always falls back to memmove(). This is broken in the -ffreestanding case, since memmove() can be anything then. Actually, in the kernel a function named memmove(9) exists and has the same semantics as memmove(3), but is slightly slower. So the effect of the change for gcc-4.2.1 is to pessimize the case that was supposed to be optimized: fixed-sized moves of size <= 64 are pessimized to use memmove() instead of bcopy(), while other args still use bcopy(). - for clang-5.0.1, on x86 for most -march settings, SSE is used even with -ffreestanding. The kernel uses various -m flags to prevent use of SSE and the FPU. This forces the compiler to copy through integer registers. This reduces the number of cases that must be analyzed to see how broken the hard-coded 64 is: - with AVX present and not killed, inlining occurs up to size 256 (8 pairs of vmovups through 256-bit registers) - else with SSE present and not killed, inlining occurs up to size 128 (8 pairs of movaps through 128-bit registers). This also depends on other CPU capabilitities. E.g., this uses movaps instead of movups since movups is slower on the target CPU (core2) even in the aligned case, while vmovups doesn't have this problem. - the above cases don't apply in the kernel - on 64-bit arches, inlining occurs up to size 64 (8 pairs movq through 64-bit integer registers) - on 32-bit arches, inlining occurs up to size 64 (8 pairs movq through 32-bit integer registers). So the effect of the change for clang is to pessimize half of the cases that were supposed to be optimized on i386 (and probably similarly on other 32-bit arches): fixed-sized moves of size > 32 and <= 64 are pessimized to use memmove() instead of bcopy(). Summary: this change wouldn't have passed my review. I have used similar changes for 15-20 years but never made them production quality since there are too many variations to consider and testing showed insignificant improvements except for micro-benchmarks. This change should be harmless if done right. E.g., it could spell 64 as (8 * sizeof(register_t)) so as to not pessimize i386 and probably similarly on other 32-bit arches. It is still too chummy with the implementations. x86 compilers used to get this wrong too, and probably still do except when they can use AVX. gcc got it wrong 20-30 years ago on i386 by inlining to "rep movsl" up to sizes much larger than 32. Eventually it learned that it didn't understand memory and left this to the library. Now on x86 CPUs that have "fast-strings", inlined "rep movsb" is better than most libraries. x86 kernel "libraries" still reduce to "rep movs*" plus lots of extra instructions that more than double the overhead. With fast-strings, the problem with using "rep movsb" is almost entirely in its overhead. On Haswell, IIRC, it takes 25 cycles to start up. Then it copies 32 bytes per cycle if both the source and the target are in the L1 cache. So the copy size must be 25 * 32 = 800 just to amortize half of the startup overhead. Even copying through 32-bit integer registers is much faster than that. There are often cache misses so that all methods are slow, but micro-benchmarks rarely cover the slow cases. I don't believe the claimed speeding of using the optimized bcopy() in cpu_fetch_sycall_args() on amd64 (r333241). This changes the copy size from a variable to 6*8 bytes. It is claimed to speed up getppid() from 7.31M/sec to 10.65M/sec on Skylake. But here it and other changes in the last week give a small pessimization for getpid() on Haswell at 4.08GHz: last week, 100M getpid()'s took 8.27 seconds (12.09M/sec). Now they take 8.30 seconds (12.05M/sec). (This is with very old libraries, so there is no possibility of getpid() being done in userland.) 0.03 seconds is 122.4M cycles. So the speed difference is 1.224 cycles. Here the timing has a resolution of only 0.01 seconds, so most digits in this 1.224 are noise, but the difference is apparently a single cycle. I would have expected more like the "rep movs*" setup overhead of 25 cycles. Another way to look at this is: 12.09M/sec syscalls means 82.71 nsec or 337 cycles each. Not too bad. The cache latency for this system according to lmbench-2.0 is 1 nsec for L1, 4.5 nsec for L2, and 16.3 nsec for main memory (old lmbench doesn't understand L3...). 66 cycles for main memory. I don't quite believe that. 100-200 cycles is more common. I think Skylake is slightly slower. Anyway, there can't be many cache misses in 337 cycles. We can hope for improvements of ~25 cycles each by avoiding "rep movs*" instructions. bcopy() actually has 2 of these (it ends up with 1 rep movsb which is null except in misaligned cases. This started being pessimal with the i486 and is relatively worse with fast-strings). However, fixes in this area rarely make as much difference as hoped for, except possibly on in-order arches. On Haswell and Skylake, everything is pipelined so it extra cycles in bcopy() don't matter provided nothing depends on the result very soon, and nearby code is equally inefficient and the resources used by the bcopy() don't contend too much with the resources used by the nearby code. Resources used by bcopy() do contend a lot in practice, but not 100%. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180505090954.X1307>