Date: Tue, 31 Jan 2017 19:31:28 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Conrad Meyer <cem@freebsd.org> Cc: Bruce Evans <brde@optusnet.com.au>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern Message-ID: <20170131175309.N1418@besplex.bde.org> In-Reply-To: <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com> References: <201701310326.v0V3QW30024375@repo.freebsd.org> <20170131153411.G1061@besplex.bde.org> <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 30 Jan 2017, Conrad Meyer wrote: > On Mon, Jan 30, 2017 at 9:26 PM, Bruce Evans <brde@optusnet.com.au> wrote: >> On Tue, 31 Jan 2017, Conrad E. Meyer wrote: >> >>> Log: >>> calculate_crc32c: Add SSE4.2 implementation on x86 >> >> This breaks building with gcc-4.2.1, > > gcc-4.2.1 is an ancient compiler. Good riddance. I prefer it. >>> Added: head/sys/libkern/x86/crc32_sse42.c >>> >>> ============================================================================== >>> --- /dev/null 00:00:00 1970 (empty, because file is newly added) >>> +++ head/sys/libkern/x86/crc32_sse42.c Tue Jan 31 03:26:32 2017 >>> (r313006) >>> + >>> +#include <nmmintrin.h> >> >> ... >> >> Inline asm is much less unportable than intrinsics. kib used the correct >> method of .byte's in asms to avoid depending on assembler support for newer >> instructions. .byte is still used for clflush on amd64 and i386. It >> used to be used for invpcid on amd64. I can't find where it is or was >> used for xsave stuff. > > Konstantin predicted this complaint in code review (phabricator). > Unfortunately, Clang does not automatically unroll asms, even with the > correct mnemonic. Unrolling is essential to performance below the > by-3 block size (768 bytes in this implementation). Hand unrolling in > C seems to generate less efficient assembly than the compiler's > unrolling. Unorolling is almost completely useless if the instruction takes 3 cycles like you said it does. The loop calculations run much faster than that, so they can run in parallel. However, clang generates horrible code for inline asms instead of intrinsics. Simple asms can be compiled even by gcc (apparently old binutils supports SSE4): X Index: crc32_sse42.c X =================================================================== X --- crc32_sse42.c (revision 313007) X +++ crc32_sse42.c (working copy) X @@ -38,8 +38,29 @@ X #include <sys/systm.h> X #endif X X -#include <nmmintrin.h> X +static __inline uint32_t X +_mm_crc32_u8(uint32_t x, uint8_t y) X +{ X + __asm("crc32b %1,%0" : "+r" (x) : "rm" (y)); X + return (x); X +} X X +static __inline uint32_t __unused X +_mm_crc32_u32(uint32_t x, int32_t y) X +{ X + __asm("crc32l %1,%0" : "+r" (x) : "rm" (y)); X + return (x); X +} X + X +#ifdef __amd64__ X +static __inline uint64_t X +_mm_crc32_u64(uint64_t x, int64_t y) X +{ X + __asm("crc32q %1,%0" : "+r" (x) : "rm" (y)); X + return (x); X +} X +#endif X + X /* CRC-32C (iSCSI) polynomial in reversed bit order. */ X #define POLY 0x82f63b78 I only tested this at compile time. clang generates horrible code. Starting with y in an array in memory, it always copies y to the stack and does a memory access there. There is nothing special about these inline asms, so I think clang does the same pessimizations for most asms in cpufunc.h. jkim got annoyed by this for something like rdtsc() and changed too much to reduce the problem for just one case. The differences in the generated code look like: intrinsincs (old <) vs inline asm (new >): 28,29c29,33 < movl %edi, %eax < crc32b (%rsi), %eax --- > movzbl (%rsi), %eax > movb %al, -10(%rbp) > #APP > crc32b -10(%rbp), %edi > #NO_APP Similarly for crc32q, except the load doesn't do an extension. 157,181d192 < .p2align 4, 0x90 < .LBB0_19: # %while.body69 < # =>This Inner Loop Header: Depth=1 < crc32b (%rsi,%rdx), %eax < crc32b 1(%rsi,%rdx), %eax < crc32b 2(%rsi,%rdx), %eax < crc32b 3(%rsi,%rdx), %eax < crc32b 4(%rsi,%rdx), %eax < crc32b 5(%rsi,%rdx), %eax < crc32b 6(%rsi,%rdx), %eax < crc32b 7(%rsi,%rdx), %eax This is the only clang unrolling. < addq $8, %rdx < testl %edx, %edx < jne .LBB0_19 < # BB#20: # %while.end75.loopexit.unr-lcssa.loopexit < addq %rdx, %rsi < .LBB0_21: # %while.end75.loopexit.unr-lcssa < testl %ecx, %ecx < je .LBB0_24 < # BB#22: # %while.body69.epil.preheader 184c195 < .LBB0_23: # %while.body69.epil --- > .LBB0_18: # %while.body69 186c197,201 < crc32b (%rsi), %eax --- > movzbl (%rsi), %edx > movb %dl, -9(%rbp) > #APP > crc32b -9(%rbp), %eax > #NO_APP With inline asm, clang doesn't unroll this byte loop. gcc generates nice-looking code: G #APP G crc32q (%r8),%rcx G crc32q 8192(%r8),%rdi G crc32q 16384(%r8),%r10 G #NO_APP G ... G #APP G crc32q (%r8),%rcx G crc32q 256(%r8),%rdi G crc32q 512(%r8),%r10 G #NO_APP G ... G .L97: G movl %ecx, %r8d G decl %edx G #APP G crc32b -1(%rsi),%r8d G #NO_APP G mov %r8d, %ecx G .L96: G incq %rsi G testl %edx, %edx G jne .L97 It doesn't do the copying or the unrolling. > The left column below is block size. The measurements are nanoseconds > per buf, per CLOCK_VIRTUAL, averaged over 10^5 loops. These numbers > do not vary more than +/- 1ns run to run on my idle Sandy Bridge > laptop. "asm" is using __asm__(), "intrins" using the _mm_crc32 > intrinsics that Clang can unroll, and multitable is the older > lookup-table implementation (still used on other architectures). > > 0x000010: asm:0 intrins:0 multitable:0 (ns per buf) > 0x000020: asm:7 intrins:9 multitable:78 (ns per buf) > 0x000040: asm:10 intrins:7 multitable:50 (ns per buf) > 0x000080: asm:15 intrins:9 multitable:91 (ns per buf) > 0x000100: asm:25 intrins:17 multitable:178 (ns per buf) > 0x000200: asm:55 intrins:38 multitable:347 (ns per buf) > 0x000400: asm:61 intrins:62 multitable:684 (ns per buf) > > Both implementations are superior to the multitable approach, so it is > unreasonable not to make one of them standard on x86 platforms. > > The unrolled intrinsics are consistently better than not unrolled on > objects 0x40-0x200 bytes large. At 0x400 bytes we pass the first > unroll-by-3 threshold and it stops mattering as much. > > At 0x40 bytes, it is the difference between 6.4 GB/s and 9.1 GB/s. At > 0x200 bytes, it is the difference between 9.3 GB/s and 13.5 GB/s. I > think this justifies some minor ugliness. If it matters, which I doubt, then the compiler shouldn't be trusted for unrolling. It can only do better than handwritten unrolling if it has tables for the correct amount unrolling for every instruction for thousands of CPUs (since such tables are too hard to write manually). clang seems to only unroll for the byte loop. Perhaps that is useless since if the byte loop is reached then things are already slow. I get this by only noticing large differences for the byte loop in the above diff, and then static instruction counting. All of clang with intrinsics, clang with asms and gcc with asms generate the same number of crc32q's; clang with asms and gcc with asms generate the same number of crc32b's; so the only difference seems to be extra crc32b's for the unrolling. You get differences of a few nsec per buf, but say that this is for _mm_crc32. I don't see any unrolling for _mm_crc32. The full byte loop with clang pessimizations is: C .p2align 4, 0x90 C .LBB0_18: # %while.body69 C # =>This Inner Loop Header: Depth=1 C movzbl (%rsi), %edx C movb %dl, -9(%rbp) C #APP C crc32b -9(%rbp), %eax C #NO_APP C incq %rsi C incl %ecx C jne .LBB0_18 That has too many instructions, but many of them can be done in parallel. gcc increments the counter at the beginning of the loop, but then has to do an extra instruction (testl) to check the status just before the branch. clang's extra instructions to copy to the stack can't help. I'm not sure if the overhead for this is mostly hidden by parallelism, or if it creates serious extra dependencies. The unrolling is mediocre at best: < crc32b (%rsi,%rdx), %eax < crc32b 1(%rsi,%rdx), %eax < crc32b 2(%rsi,%rdx), %eax < crc32b 3(%rsi,%rdx), %eax < crc32b 4(%rsi,%rdx), %eax < crc32b 5(%rsi,%rdx), %eax < crc32b 6(%rsi,%rdx), %eax < crc32b 7(%rsi,%rdx), %eax Why not a single crc32q for this? I don't know exactly what the instruction does, but there must be ways to calculate with more than a byte at a time. Every instruction in this has a dependency on the result of the previous one. It is might be better to do crc32b into 8 different registers and combine the result. Similar techniques if FP code give speedups of about a factor of 2. Inet checksum (in_cksum.c) might be more important than this, but is still pessimized on amd64 and only "optimized" for 486-era CPUs on i386. i386 uses asms, but these had bugs related to not using __volatile enough and/ or not declaring enough clobbers, and this was "fixed" for amd64 by not using asms but regressing to a C MD version of in_cksum.c that is little different from the C MI version. My i386 version has some updates for newer CPUs, but I never saw a significant difference from this. The "optimizations" have similar problems to the above. The main loop doing 32-bit accesses might be OK, but setup and exit code with byte loops in it has large overheads and is just as hard to unroll correctly; the code doesn't try to unroll it manually and ends up quite slow. NetBSD wrote the entire in_cksum for i386 in asm. This was faster in the i586 era, but would be difficult to maintain. Every CPU needs different scheduling for best results. OTOH, newer CPUs have better scheduling in hardware. I looked at the source code. Unrolling for the byte loop is just a pessimization. This loop is just for trailing bytes. The main loop should have used wider accesses, leaving < 8 bytes to handle at the end on amd64, and 4 bytes on i386. Thus the 8 unrolled crc32b's in the above (for amd64) are never reached, but it takes an extra branch to not reach them. The real byte loop at the end for the intrinsics case is: X .p2align 4, 0x90 X .LBB0_23: # %while.body69.epil X # =>This Inner Loop Header: Depth=1 X crc32b (%rsi), %eax X incq %rsi X incl %ecx X jne .LBB0_23 This is the same as in the asm case except it is missing the pessimization. The other loops have lots of manual unrolling which is apparently too hard for clang to understand to unroll further. The main step seems to be just manually unrolling into 3 crc32[lq]'s at a time. This 3 looks like an optimization for i386 -- i386 barely has 3 registers to hold the results. Or perhaps 3 is exactly right for the latency/throughput of crc32[lq] on the author's hardware. 3 would go well with a latency of 3 cycles and a throughput of 3 per 3 cycles. Then the extra instructions for the asm would mess up the latency/throughput hard-coding. Here is one of the loops with this 3-way unrolling, for the asm case: C .p2align 4, 0x90 C .LBB0_7: # %do.body C # Parent Loop BB0_6 Depth=1 C # => This Inner Loop Header: Depth=2 C movq %rsi, %rbx C movq -16384(%rbx), %rsi C movq %rsi, -72(%rbp) C #APP C crc32q -72(%rbp), %rax C #NO_APP C movq -8192(%rbx), %rsi C movq %rsi, -64(%rbp) C #APP C crc32q -64(%rbp), %rdi C #NO_APP C movq (%rbx), %rsi C movq %rsi, -56(%rbp) C #APP C crc32q -56(%rbp), %rcx C #NO_APP C leaq 8(%rbx), %rsi C addq $-16376, %rbx # imm = 0xC008 C cmpq %r8, %rbx C jb .LBB0_7 This has 2 extra instructions before every crc32q. If 3 crc32q's in parallel leave no spare resouces (or if less than 3 can run in parallel), then there is nothing to spare for loop control or for the extra instructions. The loss would be quite small however. Just a couple of cycles extra if 3 crc32q's take 3 cycles with nothing to spare. L2 (and L1?) cache misses already cost more. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170131175309.N1418>