Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 Jan 2017 19:31:28 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Conrad Meyer <cem@freebsd.org>
Cc:        Bruce Evans <brde@optusnet.com.au>,  src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern
Message-ID:  <20170131175309.N1418@besplex.bde.org>
In-Reply-To: <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com>
References:  <201701310326.v0V3QW30024375@repo.freebsd.org> <20170131153411.G1061@besplex.bde.org> <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 30 Jan 2017, Conrad Meyer wrote:

> On Mon, Jan 30, 2017 at 9:26 PM, Bruce Evans <brde@optusnet.com.au> wrote:
>> On Tue, 31 Jan 2017, Conrad E. Meyer wrote:
>>
>>> Log:
>>>  calculate_crc32c: Add SSE4.2 implementation on x86
>>
>> This breaks building with gcc-4.2.1,
>
> gcc-4.2.1 is an ancient compiler.  Good riddance.

I prefer it.

>>> Added: head/sys/libkern/x86/crc32_sse42.c
>>>
>>> ==============================================================================
>>> --- /dev/null   00:00:00 1970   (empty, because file is newly added)
>>> +++ head/sys/libkern/x86/crc32_sse42.c  Tue Jan 31 03:26:32 2017
>>> (r313006)
>>> +
>>> +#include <nmmintrin.h>
>>
>> ...
>>
>> Inline asm is much less unportable than intrinsics.  kib used the correct
>> method of .byte's in asms to avoid depending on assembler support for newer
>> instructions.  .byte is still used for clflush on amd64 and i386.  It
>> used to be used for invpcid on amd64.  I can't find where it is or was
>> used for xsave stuff.
>
> Konstantin predicted this complaint in code review (phabricator).
> Unfortunately, Clang does not automatically unroll asms, even with the
> correct mnemonic.  Unrolling is essential to performance below the
> by-3 block size (768 bytes in this implementation).  Hand unrolling in
> C seems to generate less efficient assembly than the compiler's
> unrolling.

Unorolling is almost completely useless if the instruction takes 3 cycles
like you said it does.  The loop calculations run much faster than that,
so they can run in parallel.  However, clang generates horrible code for
inline asms instead of intrinsics.

Simple asms can be compiled even by gcc (apparently old binutils supports
SSE4):

X Index: crc32_sse42.c
X ===================================================================
X --- crc32_sse42.c	(revision 313007)
X +++ crc32_sse42.c	(working copy)
X @@ -38,8 +38,29 @@
X  #include <sys/systm.h>
X  #endif
X 
X -#include <nmmintrin.h>
X +static __inline uint32_t
X +_mm_crc32_u8(uint32_t x, uint8_t y)
X +{
X +	__asm("crc32b %1,%0" : "+r" (x) : "rm" (y));
X +	return (x);
X +}
X 
X +static __inline uint32_t __unused
X +_mm_crc32_u32(uint32_t x, int32_t y)
X +{
X +	__asm("crc32l %1,%0" : "+r" (x) : "rm" (y));
X +	return (x);
X +}
X +
X +#ifdef __amd64__
X +static __inline uint64_t
X +_mm_crc32_u64(uint64_t x, int64_t y)
X +{
X +	__asm("crc32q %1,%0" : "+r" (x) : "rm" (y));
X +	return (x);
X +}
X +#endif
X +
X  /* CRC-32C (iSCSI) polynomial in reversed bit order. */
X  #define POLY	0x82f63b78

I only tested this at compile time.  clang generates horrible code.
Starting with y in an array in memory, it always copies y to the
stack and does a memory access there.  There is nothing special about
these inline asms, so I think clang does the same pessimizations for
most asms in cpufunc.h.  jkim got annoyed by this for something like
rdtsc() and changed too much to reduce the problem for just one case.
The differences in the generated code look like:

intrinsincs (old <) vs inline asm (new >):
28,29c29,33
< 	movl	%edi, %eax
< 	crc32b	(%rsi), %eax
---
> 	movzbl	(%rsi), %eax
> 	movb	%al, -10(%rbp)
> 	#APP
> 	crc32b	-10(%rbp), %edi
> 	#NO_APP

Similarly for crc32q, except the load doesn't do an extension.

157,181d192
< 	.p2align	4, 0x90
< .LBB0_19:                               # %while.body69
<                                         # =>This Inner Loop Header: Depth=1
< 	crc32b	(%rsi,%rdx), %eax
< 	crc32b	1(%rsi,%rdx), %eax
< 	crc32b	2(%rsi,%rdx), %eax
< 	crc32b	3(%rsi,%rdx), %eax
< 	crc32b	4(%rsi,%rdx), %eax
< 	crc32b	5(%rsi,%rdx), %eax
< 	crc32b	6(%rsi,%rdx), %eax
< 	crc32b	7(%rsi,%rdx), %eax

This is the only clang unrolling.

< 	addq	$8, %rdx
< 	testl	%edx, %edx
< 	jne	.LBB0_19
< # BB#20:                                # %while.end75.loopexit.unr-lcssa.loopexit
< 	addq	%rdx, %rsi
< .LBB0_21:                               # %while.end75.loopexit.unr-lcssa
< 	testl	%ecx, %ecx
< 	je	.LBB0_24
< # BB#22:                                # %while.body69.epil.preheader
184c195
< .LBB0_23:                               # %while.body69.epil
---
> .LBB0_18:                               # %while.body69
186c197,201
< 	crc32b	(%rsi), %eax
---
> 	movzbl	(%rsi), %edx
> 	movb	%dl, -9(%rbp)
> 	#APP
> 	crc32b	-9(%rbp), %eax
> 	#NO_APP

With inline asm, clang doesn't unroll this byte loop.

gcc generates nice-looking code:

G #APP
G 	crc32q (%r8),%rcx
G 	crc32q 8192(%r8),%rdi
G 	crc32q 16384(%r8),%r10
G #NO_APP
G ...
G #APP
G 	crc32q (%r8),%rcx
G 	crc32q 256(%r8),%rdi
G 	crc32q 512(%r8),%r10
G #NO_APP
G ...
G .L97:
G 	movl	%ecx, %r8d
G 	decl	%edx
G #APP
G 	crc32b -1(%rsi),%r8d
G #NO_APP
G 	mov	%r8d, %ecx
G .L96:
G 	incq	%rsi
G 	testl	%edx, %edx
G 	jne	.L97

It doesn't do the copying or the unrolling.

> The left column below is block size.  The measurements are nanoseconds
> per buf, per CLOCK_VIRTUAL, averaged over 10^5 loops.  These numbers
> do not vary more than +/- 1ns run to run on my idle Sandy Bridge
> laptop.  "asm" is using __asm__(), "intrins" using the _mm_crc32
> intrinsics that Clang can unroll, and multitable is the older
> lookup-table implementation (still used on other architectures).
>
> 0x000010: asm:0 intrins:0 multitable:0  (ns per buf)
> 0x000020: asm:7 intrins:9 multitable:78  (ns per buf)
> 0x000040: asm:10 intrins:7 multitable:50  (ns per buf)
> 0x000080: asm:15 intrins:9 multitable:91  (ns per buf)
> 0x000100: asm:25 intrins:17 multitable:178  (ns per buf)
> 0x000200: asm:55 intrins:38 multitable:347  (ns per buf)
> 0x000400: asm:61 intrins:62 multitable:684  (ns per buf)
>
> Both implementations are superior to the multitable approach, so it is
> unreasonable not to make one of them standard on x86 platforms.
>
> The unrolled intrinsics are consistently better than not unrolled on
> objects 0x40-0x200 bytes large.  At 0x400 bytes we pass the first
> unroll-by-3 threshold and it stops mattering as much.
>
> At 0x40 bytes, it is the difference between 6.4 GB/s and 9.1 GB/s.  At
> 0x200 bytes, it is the difference between 9.3 GB/s and 13.5 GB/s.  I
> think this justifies some minor ugliness.

If it matters, which I doubt, then the compiler shouldn't be trusted
for unrolling.  It can only do better than handwritten unrolling if
it has tables for the correct amount unrolling for every instruction
for thousands of CPUs (since such tables are too hard to write manually).

clang seems to only unroll for the byte loop.  Perhaps that is useless
since if the byte loop is reached then things are already slow.  I get
this by only noticing large differences for the byte loop in the above
diff, and then static instruction counting.  All of clang with intrinsics,
clang with asms and gcc with asms generate the same number of crc32q's;
clang with asms and gcc with asms generate the same number of crc32b's;
so the only difference seems to be extra crc32b's for the unrolling.

You get differences of a few nsec per buf, but say that this is for
_mm_crc32.  I don't see any unrolling for _mm_crc32.  The full byte
loop with clang pessimizations is:

C 	.p2align	4, 0x90
C .LBB0_18:                               # %while.body69
C                                         # =>This Inner Loop Header: Depth=1
C 	movzbl	(%rsi), %edx
C 	movb	%dl, -9(%rbp)
C 	#APP
C 	crc32b	-9(%rbp), %eax
C 	#NO_APP
C 	incq	%rsi
C 	incl	%ecx
C 	jne	.LBB0_18

That has too many instructions, but many of them can be done in parallel.
gcc increments the counter at the beginning of the loop, but then has
to do an extra instruction (testl) to check the status just before the
branch.  clang's extra instructions to copy to the stack can't help.
I'm not sure if the overhead for this is mostly hidden by parallelism,
or if it creates serious extra dependencies.

The unrolling is mediocre at best:

< 	crc32b	(%rsi,%rdx), %eax
< 	crc32b	1(%rsi,%rdx), %eax
< 	crc32b	2(%rsi,%rdx), %eax
< 	crc32b	3(%rsi,%rdx), %eax
< 	crc32b	4(%rsi,%rdx), %eax
< 	crc32b	5(%rsi,%rdx), %eax
< 	crc32b	6(%rsi,%rdx), %eax
< 	crc32b	7(%rsi,%rdx), %eax

Why not a single crc32q for this?  I don't know exactly what the instruction
does, but there must be ways to calculate with more than a byte at a time.

Every instruction in this has a dependency on the result of the previous
one.  It is might be better to do crc32b into 8 different registers and
combine the result.  Similar techniques if FP code give speedups of about
a factor of 2.

Inet checksum (in_cksum.c) might be more important than this, but is still
pessimized on amd64 and only "optimized" for 486-era CPUs on i386.  i386
uses asms, but these had bugs related to not using __volatile enough and/
or not declaring enough clobbers, and this was "fixed" for amd64 by not
using asms but regressing to a C MD version of in_cksum.c that is little
different from the C MI version.  My i386 version has some updates for
newer CPUs, but I never saw a significant difference from this.  The
"optimizations" have similar problems to the above.  The main loop doing
32-bit accesses might be OK, but setup and exit code with byte loops in
it has large overheads and is just as hard to unroll correctly; the code
doesn't try to unroll it manually and ends up quite slow.  NetBSD wrote
the entire in_cksum for i386 in asm.  This was faster in the i586 era,
but would be difficult to maintain.  Every CPU needs different scheduling
for best results.  OTOH, newer CPUs have better scheduling in hardware.

I looked at the source code.  Unrolling for the byte loop is just a
pessimization.  This loop is just for trailing bytes.  The main loop
should have used wider accesses, leaving < 8 bytes to handle at the
end on amd64, and 4 bytes on i386.  Thus the 8 unrolled crc32b's in
the above (for amd64) are never reached, but it takes an extra branch
to not reach them.  The real byte loop at the end for the intrinsics
case is:

X 	.p2align	4, 0x90
X .LBB0_23:                               # %while.body69.epil
X                                         # =>This Inner Loop Header: Depth=1
X 	crc32b	(%rsi), %eax
X 	incq	%rsi
X 	incl	%ecx
X 	jne	.LBB0_23

This is the same as in the asm case except it is missing the pessimization.

The other loops have lots of manual unrolling which is apparently too
hard for clang to understand to unroll further.  The main step seems
to be just manually unrolling into 3 crc32[lq]'s at a time.  This 3
looks like an optimization for i386 -- i386 barely has 3 registers to
hold the results.  Or perhaps 3 is exactly right for the latency/throughput
of crc32[lq] on the author's hardware.  3 would go well with a latency of
3 cycles and a throughput of 3 per 3 cycles.  Then the extra instructions
for the asm would mess up the latency/throughput hard-coding.  Here is
one of the loops with this 3-way unrolling, for the asm case:

C 	.p2align	4, 0x90
C .LBB0_7:                                # %do.body
C                                         #   Parent Loop BB0_6 Depth=1
C                                         # =>  This Inner Loop Header: Depth=2
C 	movq	%rsi, %rbx
C 	movq	-16384(%rbx), %rsi
C 	movq	%rsi, -72(%rbp)
C 	#APP
C 	crc32q	-72(%rbp), %rax
C 	#NO_APP
C 	movq	-8192(%rbx), %rsi
C 	movq	%rsi, -64(%rbp)
C 	#APP
C 	crc32q	-64(%rbp), %rdi
C 	#NO_APP
C 	movq	(%rbx), %rsi
C 	movq	%rsi, -56(%rbp)
C 	#APP
C 	crc32q	-56(%rbp), %rcx
C 	#NO_APP
C 	leaq	8(%rbx), %rsi
C 	addq	$-16376, %rbx           # imm = 0xC008
C 	cmpq	%r8, %rbx
C 	jb	.LBB0_7

This has 2 extra instructions before every crc32q.  If 3 crc32q's in
parallel leave no spare resouces (or if less than 3 can run in
parallel), then there is nothing to spare for loop control or for the
extra instructions.  The loss would be quite small however.  Just
a couple of cycles extra if 3 crc32q's take 3 cycles with nothing to
spare.  L2 (and L1?) cache misses already cost more.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170131175309.N1418>