From owner-svn-src-all@freebsd.org Sat May 5 01:46:58 2018 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2B17DFC12EF; Sat, 5 May 2018 01:46:58 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id 775296BFE9; Sat, 5 May 2018 01:46:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id EEFDA1A4567; Sat, 5 May 2018 11:46:53 +1000 (AEST) Date: Sat, 5 May 2018 11:46:53 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Mateusz Guzik cc: Brooks Davis , Mateusz Guzik , src-committers , svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r333240 - in head/sys: powerpc/powerpc sys In-Reply-To: Message-ID: <20180505103837.A1307@besplex.bde.org> References: <201805040400.w4440moH025057@repo.freebsd.org> <20180504155301.GA56280@spindle.one-eyed-alien.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=VJytp5HX c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=kj9zAlcOel0A:10 a=6I5d2MoRAAAA:8 a=dNsih-8PXwqfbXkUOPYA:9 a=CjuIK1q_8ugA:10 a=IjZwj45LgO3ly-622nXo:22 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 05 May 2018 01:46:58 -0000 On Sat, 5 May 2018, Mateusz Guzik wrote: > On Fri, May 4, 2018 at 5:53 PM, Brooks Davis wrote: > >> On Fri, May 04, 2018 at 04:00:48AM +0000, Mateusz Guzik wrote: >>> Author: mjg >>> Date: Fri May 4 04:00:48 2018 >>> New Revision: 333240 >>> URL: https://svnweb.freebsd.org/changeset/base/333240 >>> >>> Log: >>> Allow __builtin_memmove instead of bcopy for small buffers of known >> size >> >> What is the justification for forcing a size rather than using the >> compiler's built-in fallback to memmove? Is the kernel compilation >> environment disabling that? >> > It will fallback to memmove which is super pessimized as being wrapped > around bcopy. Actually, the pessimization is tiny. A quick test in userland on freefall gives: - 22.81 seconds for 10**9 bcopy()s of 128 bytes (75 cycles each; bandwidth 5.61G/sec) - 23.43 seconds for 10**9 mymemmove()s of 128 bytes where mymemmove() == kernel wrapper of bcopy() (77.3 cycles each; bandwidth 5.46G/sec) but that was only for the first run. On another run, the bcopy()s took 23.11 seconds and the mymmemove()s took 22.62 seconds. So the errors in the measurement are much larger than the 2-cycle difference. Nevertheless, I expect the difference to be about 2 cycles or 3%. Most of the bandwidth is wasted by both methods. gcc inlines memcpy() to 16 movq pairs and this takes 4.38 to 5.16 seconds (it tends to get faster with every run, which I suspect is due to SCHED_ULE affinity not understanding HTT but getting better with training -- I have been fixing this in SCHED_4BSD). - 4.38 seconds for 10**9 memcpys()s of 128 bytes (14.5 cycles each; bandwidth 29.22G/sec) Even without fancy -march, clang inlines memcpy() to movaps on vanilla amd64 since SSE is available: - 2.84-2.89 seconds for 10**9 memcpys()s of 128 bytes (9.4 cycles each; bandwidth 45.07G/sec). clang does a weird optimization for my counting loop unless its counter is declared volatile -- it doesn't optimize the loop to a single copy, but reduces the count by 10 each iteration instead of the expected 1 or everything. Semi-semi-finally, with -march=native to get AVX instead of SSE: - 1.65-1.88 seconds for 10**9 memcpys()s of 128 bytes (5.5 cycles each; bandwidth 77.57G/sec). This is still far short of the 128-non-disk-manufacturersG that is reached by Haswell at 4GHz using simple "rep movsb". The size is too small to amortize the overhead. Semi-finally, with AVX and the size doubled to 256 bytes: - 2.98-3.06 seconds for 10**9 memcpys()s of 256 bytes (9.8 cycles each; bandwidth 85.90G/sec). Finally, with the size increased to 4K and the count reduced to 10M: - 0.62-0.66 seconds for 10**7 memcpys()s of 4K bytes (204.6 cycles each; bandwidth 66.02G/sec). Oops, that wasn't final. It shows that freefall's Xeon behaves much like Haswell. The user library is clueless about all this and just uses "rep movsq" and even 4K is too small to amortize the startup overhead of not-so-fast-strings. With the size increased to 8K and the count kept at 10M: - 1.12-1.16 seconds for 10**7 memcpys()s of 8K bytes (369.6 cycles each; bandwidth 73.14G/sec). This almost reaches the AVX bandwidth. But the bandwidth is only this high with both the source and the target in the L1 cache. Doubling the size again runs out of L1. So there is no size large enough to amortize the ~25 cycle startup overhead of not-so-fast-strings. > These limits will get lifted after the fallback routines get sorted out. They should be left out to begin with. Any hard-coded limits are sure to be wrong for many cases. E.g., 32 is wrong on i386. Using __builtin_memmove() at all is a pessimization for gcc-4.2.1, but that can be fixed (reduced to a null change) by sorting out memmove(). Bruce