Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 11 Jun 2018 21:16:51 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Bruce Evans <brde@optusnet.com.au>, src-committers@freebsd.org,  svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r334928 - head/lib/libc/stdlib
Message-ID:  <20180611193811.M1292@besplex.bde.org>
In-Reply-To: <20180611092230.GB1939@kib.kiev.ua>
References:  <201806101754.w5AHsi4r061028@repo.freebsd.org> <20180611141715.W1182@besplex.bde.org> <20180611092230.GB1939@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 11 Jun 2018, Konstantin Belousov wrote:

> On Mon, Jun 11, 2018 at 03:48:52PM +1000, Bruce Evans wrote:
>> Change TYPE to unsigned char[sizeof(old TYPE)] and use memcpy() to assign
>> the variables to fix this with small churn and without the pessimizations
>> in this commit.  This gives portable code back to K&R1978 + memcpy(), and
>> is as efficient as the above when the above works.  On x86, even gcc-3.3
>> produces a load and store for memcpy(p, q, sizeof(register_t)) when p and
>> q are void *.
>>
>> Change TYPE to unsigned char[n] and use a single memcpy() without a loop
>> to assign the variables to fix this with larger churn and with different
>> (smaller?) pessimizations than in this commit.  Setup and initialization
>> for this method is also simpler.  This uses the newfangled VLA feature,
>> and since n is variable for qsort(), builtin memcpy() with length n
>> doesn't work so well.
> Either results in the unacceptable stack use.

'unsigned char[sizeof(old TYPE)]' uses no memory and thus no stack provided
memcpy() is optimized to register moves.

Similarly for 'unsigned char[n]', except this might be harder to optimize.
If it uses memory, then it wasn't optimized to register moves.  n must be
split up into blocks that fit in registers, just like the reverse of what
clang does now for combining byte-sized moves into larger blocks that fit
in registers.

Any source code may result in unacceptable stack usage if the compiler
optimizations aren't quite right.  The compiler should do something like
combining assignments of small objects to assignments of objects of size n;
then reblock to fit this in registers.  When this is not done right, parts
that don't fit in registers might be spilled to the stack, and when this
is done wrong the spill size might be nearly n or even larger.

In my example, clang and gcc-4.2.1 get the optimization with the VLA wrong
by putting the VLA on the stack and not doing any reblocking; clang
doesn't use any stack when it generates large code to reblock the byte
assignments (except when it uses AVX, signal handlers use more stack).

> I can limit the char[es] and memcpy to some value of es, but I do not
> see a point.  I will not object if somebody decides to do it.

The old code reblocked es into smaller objects and got the type punning
for this wrong.

I checked the speed of qsort().  It is O(n*log(n)) for both swaps and
comparisons.  Howver, I think most uses of it have small elements, since
for large data the elements should be keys or pointers to avoid swapping
the data.  The old version is almost optimal for swapping pointers.
clang's sophisticted reblocking of byte swaps is really bad for swapping
pointers: on amd64:

XX 	movq	plen(%rip), %r10
XX 	testq	%r10, %r10
XX 	je	.LBB2_18

plen is 8, so don't branch.

XX # %bb.1:                                # %for.body.lr.ph.i
XX 	movq	q(%rip), %rcx
XX 	movq	p(%rip), %rdx
XX 	cmpq	$32, %r10
XX 	jb	.LBB2_12

plen is < 32, so branch.

XX 	....
XX .LBB2_12:                               # %for.body.i8.preheader
XX 	leaq	-1(%r10), %rsi
XX 	movq	%r10, %rdi
XX 	andq	$3, %rdi
XX 	je	.LBB2_15

plen is 0 mod 4, so branch.

XX ...
XX .LBB2_15:                               # %for.body.i8.prol.loopexit
XX 	cmpq	$3, %rsi
XX 	jb	.LBB2_18

plen - 1 is not below 3, so don't branch.

XX # %bb.16:                               # %for.body.i8.preheader.new
XX 	xorl	%esi, %esi
XX 	.p2align	4, 0x90
XX .LBB2_17:                               # %for.body.i8
XX                                         # =>This Inner Loop Header: Depth=1
XX 	movzbl	(%rdx,%rsi), %edi
XX 	movzbl	(%rcx,%rsi), %eax
XX 	movb	%al, (%rdx,%rsi)
XX 	movb	%dil, (%rcx,%rsi)
XX 	movzbl	1(%rdx,%rsi), %edi
XX 	movzbl	1(%rcx,%rsi), %eax
XX 	movb	%al, 1(%rdx,%rsi)
XX 	movb	%dil, 1(%rcx,%rsi)
XX 	movzbl	2(%rdx,%rsi), %edi
XX 	movzbl	2(%rcx,%rsi), %eax
XX 	movb	%al, 2(%rdx,%rsi)
XX 	movb	%dil, 2(%rcx,%rsi)
XX 	movzbl	3(%rdx,%rsi), %edi
XX 	movzbl	3(%rcx,%rsi), %eax
XX 	movb	%al, 3(%rdx,%rsi)
XX 	movb	%dil, 3(%rcx,%rsi)
XX 	addq	$4, %rsi

End swapping copying a byte at a time after all.  Swap 4 bytes per iteration.

XX 	cmpq	%rsi, %r10
XX 	jne	.LBB2_17

Iterate twice for the 8 byte pointer.

All this instead of 2 loads and 2 stores that the old code might have
generated.

Since es is not always 8, the old code must generate more than 2 loads
and 2 stores -- it will have to reblock using similar branches to the
above.  It only uses 2 block sizes (sizeof(long) and 1), but the compiler
might pessimize this to the same as the above.

clang actually generates enormous^2 code with the current version and
enormous^3 code with the previous version.  For the current version,
1742 lines, with 192 lines for just movups instructions.  For the
previous version, it generates 4823 lines with 576 movups instructions.
gcc -O2 generates 619 lines for the current version and 1053 lines for
the previous version.

So I don't mind the simplification.  It would take lots of __predict_ugly()
annotations to prevent unrolling that pessimizes the usual case.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180611193811.M1292>