From owner-svn-src-head@freebsd.org Mon May 7 23:35:29 2018 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9E966FC1572; Mon, 7 May 2018 23:35:29 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 5478671DA4; Mon, 7 May 2018 23:35:28 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 67DB94224FD; Tue, 8 May 2018 09:35:15 +1000 (AEST) Date: Tue, 8 May 2018 09:35:14 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Mateusz Guzik cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r333324 - in head/sys: amd64/amd64 conf In-Reply-To: <201805071507.w47F7SOs035073@repo.freebsd.org> Message-ID: <20180508072206.B888@besplex.bde.org> References: <201805071507.w47F7SOs035073@repo.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=kj9zAlcOel0A:10 a=MdRsKBzazwKPVFaIwW4A:9 a=RYddRPqCA4RVdAks:21 a=-iCQatyvBGjsEPS3:21 a=CjuIK1q_8ugA:10 X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 May 2018 23:35:29 -0000 On Mon, 7 May 2018, Mateusz Guzik wrote: > Log: > amd64: replace libkern's memset and memmove with assembly variants > > memmove is repurposed bcopy (arguments swapped, return value added) > The libkern variant is a wrapper around bcopy, so this is a big > improvement. This is a tiny improvement, perhaps negative. As previously explained, the wrapper only costs a couple of cycles, and bcopy() is not as good as it could be. But the simple C versions are much faster than the asm versions. The asm versions still use 2 "rep movs*", so they take about 50 cycles to start up. The simple C loop can do about 50 loads and stores in 50 cycles, so even with byte accesses it handles about 50 bytes before the asm version does anything. So this change is a small pessimization if the size is usually small. > memset is repurposed memcpy. The librkern variant is doing fishy stuff, > including branching on 0 and calling bzero. You mean the old libkern variant. It just duplicates the inline implementation in a stupid way. Another style bug in this is that the memset() (and also index() and rindex()) is implemented as an inline in , when newer mem*() are implemented as macros in . The inlines are harder to use and don't even support the hack of a special case if the size is constant and small. libkern/memcmp.c is almost never used, and neither is your asm version, so it shouldn't be optimized and your asm version shouldn't exist, and the bugs in the asm version rarely matter. It is just a fallback for cases where inlining doesn't work and for use in function pointers. I don't see how the inlining can ever not work, but there is an ifdef tangle that makes this non-obvious. libkern/memcpy.c gets tangled up with the complications. It can't just use the inline version because of namespace problems. With a macro version, it would define itself as (memset) and then just invoke the macro version as plain memset. With better names, the inline version would be defined with a name like __memset() and memset() would be defined as __memset() and the extern version would just use __memset(). Instead, libkern/memcpy.c defines LIBKERN_INLINE to tell to get out of the way. This is misnamed since it kills inlines instead of enabling them. Without it, all the inlines in libkern.h are 'static __inline'. Since they are static, they are always used, so the extern version is never used except possibly for function pointers. (Inlining may fail, but then a static function is used. This happens mainly with -O0.) The final details for memset() are that when LIBKERN_INLINE is defined, this is used as part of the function prototype; it also prevents the definition of LIBKERN_BODY and this is what kills production of the inline version. So libkern/memset.c gets only a prototype for memset(). Similarly, but bogusly for other files that define LIBKERN_INLINE. But normal files get the inline version. libc also has many silly "optimizations", even in its MI parts, so copying it to create libkern is not so good. E.g., on amd64, strcpy() has been pessimized by turning it into a wrapper around stpcpy(). amd64 stpcpy() is in asm and uses a masking trick to reduce to word-sized accesses. This is slower for short strings and probably should be MI if it is done at all. OTOH, i386 strcpy() is not a wrapper, but is still the ~25-30 year old version that does does manual unrolling in asm. At least it doesn't use any x86 string instructions. i386 used to have strlen() is asm, but this was intentionally not done for amd64, and now both use the MI strlen() which has the fancy masking optimization. Compilers often inline strlen() so it is unclear how often this is used. > Both functions are rather crude and subject to partial depessimization. Indeed, they are fully pessimized using 2 "rep movs". > This is a soft prerequisite to adding variants utilizing the > 'Enhanced REP MOVSB/STOSB' bit and let the kernel patch at runtime. Ugh. Dynamic changes also complicate debugging and profiling. Stack traces fil up with . I checked the effect of previous micro-optimizations for makeworld. Makeworld for a non-bloated (~5.2) world does about 120 million calls to copying functions (I added counters and zapped the "rep movsb" for all such functions in amd64/support.S). Previous optimizations for inlining, or just zapping "rep movsb", saves about 50 cycles/call. That is 6 billion cycles altogther, which take about 1.5 seconds at 4GHz, but HTT makes times about 50% longer, so scale this up to 2.25 seconds. Divide by the number of CPUs (8) to get the improvement in real time. It is 0.28 seconds. This is in the noise. (The total sys time is about 100 seconds and the total real time is about 150 seconds, and the noses is mostly in the idle time (standard deviation about 1% for real time but about 4% for idle time. The estimated 0.28 seconds is about 0.2% of the real time and could be confirmed by careful measurements, but I only did a quick test that the sys time was reduced by a few seconds). It is hard to get excited about optimizations of 0.2%, but that is exactly what I am doing for my current project of scheduler optimizations :-). Bruce