From owner-svn-src-head@freebsd.org Mon Jun 11 11:16:56 2018 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 785C510143B4; Mon, 11 Jun 2018 11:16:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au [211.29.132.97]) by mx1.freebsd.org (Postfix) with ESMTP id 8DC767E7FA; Mon, 11 Jun 2018 11:16:54 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id 070D0105FD1; Mon, 11 Jun 2018 21:16:52 +1000 (AEST) Date: Mon, 11 Jun 2018 21:16:51 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: Bruce Evans , src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r334928 - head/lib/libc/stdlib In-Reply-To: <20180611092230.GB1939@kib.kiev.ua> Message-ID: <20180611193811.M1292@besplex.bde.org> References: <201806101754.w5AHsi4r061028@repo.freebsd.org> <20180611141715.W1182@besplex.bde.org> <20180611092230.GB1939@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=cIaQihWN c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=kj9zAlcOel0A:10 a=4WncY0STsJbJ7bl2gfEA:9 a=CjuIK1q_8ugA:10 X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 11 Jun 2018 11:16:56 -0000 On Mon, 11 Jun 2018, Konstantin Belousov wrote: > On Mon, Jun 11, 2018 at 03:48:52PM +1000, Bruce Evans wrote: >> Change TYPE to unsigned char[sizeof(old TYPE)] and use memcpy() to assign >> the variables to fix this with small churn and without the pessimizations >> in this commit. This gives portable code back to K&R1978 + memcpy(), and >> is as efficient as the above when the above works. On x86, even gcc-3.3 >> produces a load and store for memcpy(p, q, sizeof(register_t)) when p and >> q are void *. >> >> Change TYPE to unsigned char[n] and use a single memcpy() without a loop >> to assign the variables to fix this with larger churn and with different >> (smaller?) pessimizations than in this commit. Setup and initialization >> for this method is also simpler. This uses the newfangled VLA feature, >> and since n is variable for qsort(), builtin memcpy() with length n >> doesn't work so well. > Either results in the unacceptable stack use. 'unsigned char[sizeof(old TYPE)]' uses no memory and thus no stack provided memcpy() is optimized to register moves. Similarly for 'unsigned char[n]', except this might be harder to optimize. If it uses memory, then it wasn't optimized to register moves. n must be split up into blocks that fit in registers, just like the reverse of what clang does now for combining byte-sized moves into larger blocks that fit in registers. Any source code may result in unacceptable stack usage if the compiler optimizations aren't quite right. The compiler should do something like combining assignments of small objects to assignments of objects of size n; then reblock to fit this in registers. When this is not done right, parts that don't fit in registers might be spilled to the stack, and when this is done wrong the spill size might be nearly n or even larger. In my example, clang and gcc-4.2.1 get the optimization with the VLA wrong by putting the VLA on the stack and not doing any reblocking; clang doesn't use any stack when it generates large code to reblock the byte assignments (except when it uses AVX, signal handlers use more stack). > I can limit the char[es] and memcpy to some value of es, but I do not > see a point. I will not object if somebody decides to do it. The old code reblocked es into smaller objects and got the type punning for this wrong. I checked the speed of qsort(). It is O(n*log(n)) for both swaps and comparisons. Howver, I think most uses of it have small elements, since for large data the elements should be keys or pointers to avoid swapping the data. The old version is almost optimal for swapping pointers. clang's sophisticted reblocking of byte swaps is really bad for swapping pointers: on amd64: XX movq plen(%rip), %r10 XX testq %r10, %r10 XX je .LBB2_18 plen is 8, so don't branch. XX # %bb.1: # %for.body.lr.ph.i XX movq q(%rip), %rcx XX movq p(%rip), %rdx XX cmpq $32, %r10 XX jb .LBB2_12 plen is < 32, so branch. XX .... XX .LBB2_12: # %for.body.i8.preheader XX leaq -1(%r10), %rsi XX movq %r10, %rdi XX andq $3, %rdi XX je .LBB2_15 plen is 0 mod 4, so branch. XX ... XX .LBB2_15: # %for.body.i8.prol.loopexit XX cmpq $3, %rsi XX jb .LBB2_18 plen - 1 is not below 3, so don't branch. XX # %bb.16: # %for.body.i8.preheader.new XX xorl %esi, %esi XX .p2align 4, 0x90 XX .LBB2_17: # %for.body.i8 XX # =>This Inner Loop Header: Depth=1 XX movzbl (%rdx,%rsi), %edi XX movzbl (%rcx,%rsi), %eax XX movb %al, (%rdx,%rsi) XX movb %dil, (%rcx,%rsi) XX movzbl 1(%rdx,%rsi), %edi XX movzbl 1(%rcx,%rsi), %eax XX movb %al, 1(%rdx,%rsi) XX movb %dil, 1(%rcx,%rsi) XX movzbl 2(%rdx,%rsi), %edi XX movzbl 2(%rcx,%rsi), %eax XX movb %al, 2(%rdx,%rsi) XX movb %dil, 2(%rcx,%rsi) XX movzbl 3(%rdx,%rsi), %edi XX movzbl 3(%rcx,%rsi), %eax XX movb %al, 3(%rdx,%rsi) XX movb %dil, 3(%rcx,%rsi) XX addq $4, %rsi End swapping copying a byte at a time after all. Swap 4 bytes per iteration. XX cmpq %rsi, %r10 XX jne .LBB2_17 Iterate twice for the 8 byte pointer. All this instead of 2 loads and 2 stores that the old code might have generated. Since es is not always 8, the old code must generate more than 2 loads and 2 stores -- it will have to reblock using similar branches to the above. It only uses 2 block sizes (sizeof(long) and 1), but the compiler might pessimize this to the same as the above. clang actually generates enormous^2 code with the current version and enormous^3 code with the previous version. For the current version, 1742 lines, with 192 lines for just movups instructions. For the previous version, it generates 4823 lines with 576 movups instructions. gcc -O2 generates 619 lines for the current version and 1053 lines for the previous version. So I don't mind the simplification. It would take lots of __predict_ugly() annotations to prevent unrolling that pessimizes the usual case. Bruce