Date: Mon, 11 Jun 2018 21:16:51 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Bruce Evans <brde@optusnet.com.au>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r334928 - head/lib/libc/stdlib Message-ID: <20180611193811.M1292@besplex.bde.org> In-Reply-To: <20180611092230.GB1939@kib.kiev.ua> References: <201806101754.w5AHsi4r061028@repo.freebsd.org> <20180611141715.W1182@besplex.bde.org> <20180611092230.GB1939@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 11 Jun 2018, Konstantin Belousov wrote: > On Mon, Jun 11, 2018 at 03:48:52PM +1000, Bruce Evans wrote: >> Change TYPE to unsigned char[sizeof(old TYPE)] and use memcpy() to assign >> the variables to fix this with small churn and without the pessimizations >> in this commit. This gives portable code back to K&R1978 + memcpy(), and >> is as efficient as the above when the above works. On x86, even gcc-3.3 >> produces a load and store for memcpy(p, q, sizeof(register_t)) when p and >> q are void *. >> >> Change TYPE to unsigned char[n] and use a single memcpy() without a loop >> to assign the variables to fix this with larger churn and with different >> (smaller?) pessimizations than in this commit. Setup and initialization >> for this method is also simpler. This uses the newfangled VLA feature, >> and since n is variable for qsort(), builtin memcpy() with length n >> doesn't work so well. > Either results in the unacceptable stack use. 'unsigned char[sizeof(old TYPE)]' uses no memory and thus no stack provided memcpy() is optimized to register moves. Similarly for 'unsigned char[n]', except this might be harder to optimize. If it uses memory, then it wasn't optimized to register moves. n must be split up into blocks that fit in registers, just like the reverse of what clang does now for combining byte-sized moves into larger blocks that fit in registers. Any source code may result in unacceptable stack usage if the compiler optimizations aren't quite right. The compiler should do something like combining assignments of small objects to assignments of objects of size n; then reblock to fit this in registers. When this is not done right, parts that don't fit in registers might be spilled to the stack, and when this is done wrong the spill size might be nearly n or even larger. In my example, clang and gcc-4.2.1 get the optimization with the VLA wrong by putting the VLA on the stack and not doing any reblocking; clang doesn't use any stack when it generates large code to reblock the byte assignments (except when it uses AVX, signal handlers use more stack). > I can limit the char[es] and memcpy to some value of es, but I do not > see a point. I will not object if somebody decides to do it. The old code reblocked es into smaller objects and got the type punning for this wrong. I checked the speed of qsort(). It is O(n*log(n)) for both swaps and comparisons. Howver, I think most uses of it have small elements, since for large data the elements should be keys or pointers to avoid swapping the data. The old version is almost optimal for swapping pointers. clang's sophisticted reblocking of byte swaps is really bad for swapping pointers: on amd64: XX movq plen(%rip), %r10 XX testq %r10, %r10 XX je .LBB2_18 plen is 8, so don't branch. XX # %bb.1: # %for.body.lr.ph.i XX movq q(%rip), %rcx XX movq p(%rip), %rdx XX cmpq $32, %r10 XX jb .LBB2_12 plen is < 32, so branch. XX .... XX .LBB2_12: # %for.body.i8.preheader XX leaq -1(%r10), %rsi XX movq %r10, %rdi XX andq $3, %rdi XX je .LBB2_15 plen is 0 mod 4, so branch. XX ... XX .LBB2_15: # %for.body.i8.prol.loopexit XX cmpq $3, %rsi XX jb .LBB2_18 plen - 1 is not below 3, so don't branch. XX # %bb.16: # %for.body.i8.preheader.new XX xorl %esi, %esi XX .p2align 4, 0x90 XX .LBB2_17: # %for.body.i8 XX # =>This Inner Loop Header: Depth=1 XX movzbl (%rdx,%rsi), %edi XX movzbl (%rcx,%rsi), %eax XX movb %al, (%rdx,%rsi) XX movb %dil, (%rcx,%rsi) XX movzbl 1(%rdx,%rsi), %edi XX movzbl 1(%rcx,%rsi), %eax XX movb %al, 1(%rdx,%rsi) XX movb %dil, 1(%rcx,%rsi) XX movzbl 2(%rdx,%rsi), %edi XX movzbl 2(%rcx,%rsi), %eax XX movb %al, 2(%rdx,%rsi) XX movb %dil, 2(%rcx,%rsi) XX movzbl 3(%rdx,%rsi), %edi XX movzbl 3(%rcx,%rsi), %eax XX movb %al, 3(%rdx,%rsi) XX movb %dil, 3(%rcx,%rsi) XX addq $4, %rsi End swapping copying a byte at a time after all. Swap 4 bytes per iteration. XX cmpq %rsi, %r10 XX jne .LBB2_17 Iterate twice for the 8 byte pointer. All this instead of 2 loads and 2 stores that the old code might have generated. Since es is not always 8, the old code must generate more than 2 loads and 2 stores -- it will have to reblock using similar branches to the above. It only uses 2 block sizes (sizeof(long) and 1), but the compiler might pessimize this to the same as the above. clang actually generates enormous^2 code with the current version and enormous^3 code with the previous version. For the current version, 1742 lines, with 192 lines for just movups instructions. For the previous version, it generates 4823 lines with 576 movups instructions. gcc -O2 generates 619 lines for the current version and 1053 lines for the previous version. So I don't mind the simplification. It would take lots of __predict_ugly() annotations to prevent unrolling that pessimizes the usual case. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180611193811.M1292>