From owner-svn-src-head@freebsd.org Fri Nov 16 09:50:50 2018 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D51F9112DB35; Fri, 16 Nov 2018 09:50:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 53EC679287; Fri, 16 Nov 2018 09:50:48 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 87781105BBFC; Fri, 16 Nov 2018 20:50:36 +1100 (AEDT) Date: Fri, 16 Nov 2018 20:50:34 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Mateusz Guzik cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r340472 - in head: lib/libc/amd64/string sys/amd64/amd64 In-Reply-To: <201811160044.wAG0iNjM011630@repo.freebsd.org> Message-ID: <20181116184758.B1835@besplex.bde.org> References: <201811160044.wAG0iNjM011630@repo.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=P6RKvmIu c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=kj9zAlcOel0A:10 a=xSh3073RJmBTSleZU8EA:9 a=CjuIK1q_8ugA:10 X-Rspamd-Queue-Id: 53EC679287 X-Spamd-Result: default: False [-2.64 / 40.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.75)[-0.751,0]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; R_SPF_ALLOW(-0.20)[+ip4:211.29.132.0/23]; FREEMAIL_FROM(0.00)[optusnet.com.au]; MIME_GOOD(-0.10)[text/plain]; TO_MATCH_ENVRCPT_ALL(0.00)[]; DMARC_NA(0.00)[optusnet.com.au]; TO_DN_SOME(0.00)[]; NEURAL_HAM_LONG(-1.00)[-0.997,0]; IP_SCORE(-0.01)[country: AU(-0.03)]; MX_GOOD(-0.01)[cached: extmail.optusnet.com.au]; NEURAL_HAM_SHORT(-0.58)[-0.575,0]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[optusnet.com.au]; ASN(0.00)[asn:4804, ipnet:211.28.0.0/14, country:AU]; RCVD_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_LOW(-0.10)[249.132.29.211.list.dnswl.org : 127.0.5.1] X-Rspamd-Server: mx1.freebsd.org X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Nov 2018 09:50:50 -0000 On Fri, 16 Nov 2018, Mateusz Guzik wrote: > Log: > amd64: handle small memset buffers with overlapping stores > > Instead of jumping to locations which store the exact number of bytes, > use displacement to move the destination. > > In particular the following clears an area between 8-16 (inclusive) > branch-free: > > movq %r10,(%rdi) > movq %r10,-8(%rdi,%rcx) > > For instance for rcx of 10 the second line is rdi + 10 - 8 = rdi + 2. > Writing 8 bytes starting at that offset overlaps with 6 bytes written > previously and writes 2 new, giving 10 in total. > > Provides a nice win for smaller stores. Other ones are erratic depending > on the microarchitecture. > > General idea taken from NetBSD (restricted use of the trick) and bionic > string functions (use for various ranges like in this patch). Why not take such ideas from FreeBSD (or at least from FreeBSD committers) where this one was used between 1996 and 2010 for the i586(npx)-optimized bzero? Testing showed that it wasn't a very good idea, so I didn't use it anywhere else and didn't complain much when it was backed out. It is not very good since it pessimizes the usual case where everything is aligned. Now it is an even larger pessimization for the ERMS case, at least in theory, since "rep movsb" should be able to handle alignment stuff. Here it is for the version in FreeBSD-5: XX i586_bz2: XX fldz XX XX /* XX * Align to an 8 byte boundary (misalignment in the main loop would XX * cost a factor of >= 2). Avoid jumps (at little cost if it is XX * already aligned) by always zeroing 8 bytes and using the part up XX * to the _next_ alignment position. XX */ XX fstl 0(%edx) XX addl %edx,%ecx /* part of %ecx -= new_%edx - %edx */ XX addl $8,%edx XX andl $~7,%edx XX subl %edx,%ecx XX XX /* XX * Similarly align `len' to a multiple of 8. XX */ XX fstl -8(%edx,%ecx) XX decl %ecx XX andl $~7,%ecx This even has comments. The "little" cost mentioned in the comments is just the instruction fetch cost plus an extra fstl in cases where everthing is aligned. The first fstl in the above doesn't have much extra cost since it replaces an fstl in the loop later. The second one is only needed when the alignment stuff is needed (e.g., to write 14 bytes as 8+8 with an overlap of too), but it is always done to reduces branches. With too many instructions to fetch, code like this becomes almost as slow as "rep movs". IIRC, on Haswell, "rep movs[bwlq]" takes 25 cycles to start up (most x86's take about that long to start up string instructions and ERMS doesn't improve this on at least Haswell), and all cases have a throughput of 32 bytes/cycle, so in 25 cycles 800 bytes can be copied and for data smaller than about this size it is best not to use string instructions, provided you don't use methods that take too many cycles to start up. Alignment stuff tends to take too many cycles to start up especially if it has lots of branches which trash the branch target caches. I think the above takes about 3 cycles on Haswell, except for the fstls which are quite slow. IIRC, they have a throughput of about 1 every 2 cycles and a latency of 4 or 8 cycles. The above code was optimized for Pentium-1's where the times in cycles for fstl were not much different from on newer CPUs, but everything else is either wider or faster so going through npx registers is a pessimization. Pentium-1's can barely saturate their slow L1 cache using npx instructions. I don't like the way ERMS is used in -current on amd64: - pagezero_erms: this is useless on at least Haswell since it has the same speed as pagezero_std. ERMS makes "rep stosq" equally fast to "rep stosb" and everything is aligned so there is no difference in the setup overhead. (There might be a difference dividing the count by the access width, but this is done at compile time in pagezero_std, and in the overhead for doing this at runtime is in the noise.) - memmove_erms: with a large size, this should just use "rep movsb" with almost no setup. Instead, it uses the MEMMOVE macro to do lots of setup and to obfuscate its pessimizations. Disassembly shows that memmove_erms ends up with 152 instructions while memmove_std ends up with 164 instructions. There is little difference except that memmove_std has an extra main loop doing "rep movsq". This is implemented by testing the erms arg of MEMMOVE in just 2 places. The non-erms case of course has to divide by the access width, and then has to arrange to copy any remaining bytes. - memmove_std: this still has the pessimization of using "rep movsb" to finish up in some cases. This costs ~25 cycles of setup overhead when it is reached. erms as currently used gives almost no optimizations except by bypassing this pessimization. The finishing should be done using integer registers. Only 1 store is needed using overlapping stores. The macro makes it more difficult to remove this pessimization. - memmove_erms again: with a small size, it is best not to use any string instructions. I think the macro does this OK except it is hard to read. But optimizing for small sizes is especially unimportant. Small sizes are inlined by the compiler in many cases, and when memmove* is actually called, it runs fast, at least if it doesn't use string instructions or have too much setup overhead to optimize the even more unimportant misaligned case. - ifuncs: I don't like ifuncs. The one for pagezero_erms is just an obfuscation. The one for memmove_erms is just to save 1 whole predictable branch near the start of a general memmove function. memmove_erms now has 27 branches. Most of them are no executed for most calls, but I doubt that they cost less than the 1 branch avoided using ifuncs. The pre-erms version of bcopy() has just 1 branch (for the overlap case). Benchmarking branch costs is difficult and micro-benchmarks do it especially badly. The size of the branch target cache on modern x86's is a few K. That is enough to hold all branches for most micro-benchmarks, but kernels and applications are so bloated that it is not large enough to hold all branches for a few real applications. memmove() might be called enough to keep its main branches cached, but that keeps the cache depleted for other uses. - memcpy*: like memmove*, except it is simpler, but this is obfuscated using the MEMMOVE macro for memcpy too. - memset*. similar, but even simpler since it is inherently simpler and the MEMSET macro is only used to obfuscate 2 functions. - bzero: this better API is pessimized by going through memset() so that its fill word has to be constructed at runtime - copyout*, copyin*: these use MEMMOVE for the erms control. This requires further obfuscations in the macro, and duplicates the ~27 branches in memmove_erms many times. I only did efficiency tests of this using the makeworld macro-benchmark. This showed no signficant differences on Haswell. I did some micro-benchmarks related to this in userland. clang turns simple C memcpy() (a loop with *dst++ = *src++) into sort of much better code that all old and new "optimized" asm variants. E.g., if -march indicates that the CPU supports AVX*, then clang generates unportabley AVX* code, which the asm variants can't reasonably do, especially in the kernel. This is only sort of better since, apart from its unportability, clang generates its usualy excessive unrolling and too much setup code. In practice, I couldn't get clang's AVX code to run faster than "rep movsb" on freefall, by on my local Haswell system my benchmarks using AVX in asm run faster than "rep movsb"). gcc-4.2 doesn't support AVX* and mostly generates worse non-AVX code than clang, but sometimes it generates better code. Both compilers generated worse code for i386 with -m32. So for compilers as well as for asm authors, the best code is very MD and is rarely produced or written. Bruce