Date: Fri, 3 Feb 2017 07:29:05 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: "Conrad E. Meyer" <cem@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern Message-ID: <20170203062806.A2690@besplex.bde.org> In-Reply-To: <20170202184819.GP2092@kib.kiev.ua> References: <201701310326.v0V3QW30024375@repo.freebsd.org> <20170202184819.GP2092@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 2 Feb 2017, Konstantin Belousov wrote: > On Tue, Jan 31, 2017 at 03:26:32AM +0000, Conrad E. Meyer wrote: >> + compile-with "${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \ > > BTW, new gcc has -mcrc32 option, but clang 3.9.1 apparently does not. I've almost finished fixing and optimizing this. I didn't manage to fix all the compiler pessimizations, but the result is within 5% of optimal for buffers larger than a few K. See the previous patch for -msse4 removal. It might be good to add -fno-unroll-functions to CFLAGS to prevent some clang pessimizations. Or -Os might work even better. X Index: libkern/x86/crc32_sse42.c X =================================================================== X --- libkern/x86/crc32_sse42.c (revision 313090) X +++ libkern/x86/crc32_sse42.c (working copy) X @@ -31,15 +31,41 @@ X */ X #ifdef USERSPACE_TESTING X #include <stdint.h> X +#include <stdlib.h> X #else X #include <sys/param.h> X +#include <sys/systm.h> X #include <sys/kernel.h> X -#include <sys/libkern.h> X -#include <sys/systm.h> X #endif X X -#include <nmmintrin.h> X +static __inline uint32_t X +_mm_crc32_u8(uint32_t x, uint8_t y) X +{ X + /* X + * clang (at least 3.9.[0-1]) pessimizes "rm" (y) and "m" (y) X + * significantly and "r" (y) a lot by copying y to a different X + * local variable (on the stack or in a register), so only use X + * the latter. This costs a register and an instruction but X + * not a uop. X + */ X + __asm("crc32b %1,%0" : "+r" (x) : "r" (y)); X + return (x); X +} I haven't found a better fix. Even gcc likes to pessimize this by preferring the "r" constraint in "rm", so "r" should be disparaged. It does this to try to schedule loop control instructions in beteen the memory reference and the operation using it, but it gets this scheduling mostly wrong. -Os stops this for gcc, but clang keeps producing the large and slow explicit load (then a store to a different place in memory to use there if the constraint allows "m") even with -Os. X X +static __inline uint32_t X +_mm_crc32_u32(uint32_t x, uint32_t y) X +{ X + __asm("crc32l %1,%0" : "+r" (x) : "r" (y)); X + return (x); X +} X + X +static __inline uint64_t X +_mm_crc32_u64(uint64_t x, uint64_t y) X +{ X + __asm("crc32q %1,%0" : "+r" (x) : "r" (y)); X + return (x); X +} X + X /* CRC-32C (iSCSI) polynomial in reversed bit order. */ X #define POLY 0x82f63b78 X X @@ -47,12 +73,14 @@ X * Block sizes for three-way parallel crc computation. LONG and SHORT must X * both be powers of two. X */ X -#define LONG 8192 X -#define SHORT 256 X +#define LONG 128 X +#define SHORT 64 These are now the smallest that work right on Haswell. Practical values should be larger in case the timing is to different for the manual scheduling to work. Small buffers are still slow, so LONG = SHORT = 1024 (with no LONG loop) would be a good practical value, except for the annoying problem that the 3-way optimization doesn't work so well for normal power-of-2 buffer sizes. LONG = SHORT = 1024 is perfect for buffer sizes of a multiple of 3K. It does 4K as 3K + 1K, with the 3K part optimized and the 1K part unoptimized and taking as long as the 3K part. LONG = 1024 with SHORT = 64 does the residual 1K at almost full speed except for another residual of 192. X X /* Tables for hardware crc that shift a crc by LONG and SHORT zeros. */ X static uint32_t crc32c_long[4][256]; X +static uint32_t crc32c_2long[4][256]; X static uint32_t crc32c_short[4][256]; X +static uint32_t crc32c_2short[4][256]; X X /* X * Multiply a matrix times a vector over the Galois field of two elements, X @@ -171,7 +199,9 @@ X crc32c_init_hw(void) X { X crc32c_zeros(crc32c_long, LONG); X + crc32c_zeros(crc32c_2long, 2 * LONG); X crc32c_zeros(crc32c_short, SHORT); X + crc32c_zeros(crc32c_2short, 2 * SHORT); X } X #ifdef _KERNEL X SYSINIT(crc32c_sse42, SI_SUB_LOCK, SI_ORDER_ANY, crc32c_init_hw, NULL); X @@ -190,7 +220,11 @@ X const size_t align = 4; X #endif X const unsigned char *next, *end; X - uint64_t crc0, crc1, crc2; /* need to be 64 bits for crc32q */ X +#ifdef __amd64__ X + uint64_t crc0, crc1, crc2; X +#else X + uint32_t crc0, crc1, crc2; X +#endif X X next = buf; X crc0 = crc; The type system doesn't work well here. The crc32* instruction returns a 32-bit value in a register of unclear size. I think it is zero extended to full width, but my asms aren't written quite right to say this and neither are the instrinsics. The 64-bit intrinsic returns a 64-bit value which just gets in the way. The above has to declare the type as uint64_t for amd64, else pessimizations result. That gives smaller pessimizations, mainly with gcc, for expressions like (crc0 >> 24) where the 64-bit declaration doesn't make it clear that the top bits are zero, so extra zeroing operations may be necessary. I have barely tested the 32-bit case, but the 64-bit type is clearly wasted then (but easier to optimize away since no crc intrinsic returns it). LONG and SHORT should be about half as large in the 32-bit case, since they should scale with the number of crc32 operations in the inner loop. X @@ -202,6 +236,7 @@ X len--; X } X X +#if LONG > SHORT X /* X * Compute the crc on sets of LONG*3 bytes, executing three independent X * crc instructions, each on LONG bytes -- this is optimized for the X @@ -209,6 +244,7 @@ X * have a throughput of one crc per cycle, but a latency of three X * cycles. X */ X + crc = 0; X while (len >= LONG * 3) { X crc1 = 0; X crc2 = 0; X @@ -229,16 +265,64 @@ X #endif X next += align; X } while (next < end); X - crc0 = crc32c_shift(crc32c_long, crc0) ^ crc1; X - crc0 = crc32c_shift(crc32c_long, crc0) ^ crc2; X + /*- X + * Update the crc. Try to do it in parallel with the inner X + * loop. 'crc' is used to accumulate crc0 and crc1 X + * produced by the inner loop so that the next iteration X + * of the loop doesn't depend on anything except crc2. X + * X + * The full expression for the update is: X + * crc = S*S*S*crc + S*S*crc0 + S*crc1 X + * where the terms are polynomials modulo the CRC polynomial. X + * We regroup this subtly as: X + * crc = S*S * (S * crc + crc0) + S*crc1. X + * This has an extra dependency which reduces possible X + * parallelism for the expression, but it turns out to be X + * best to intentionally delay evaluation of this expression X + * so that it competes less with the inner loop. X + * X + * We also intentionally reduce parallelism by feedng back X + * crc2 to the inner loop as crc0 instead of accumulating X + * it in crc. This synchronizes the loop with crc update. X + * CPU and/or compiler schedulers produced bad order without X + * this. X + * X + * Shifts take about 12 cycles each, so 3 here with 2 X + * parallelizable take about 24 cycles and the crc update X + * takes slightly longer. 8 dependent crc32 instructions X + * can run in 24 cycles, so the 3-way blocking is worse X + * than useless for sizes less than 8 * <word size> = 64 X + * on amd64. In practice, SHORT = 32 confirms these X + * timing calculations by giving a small improvement X + * starting at size 96. Then the inner loop takes about X + * 12 cycles and the crc update about 24, but these are X + * partly in parallel so the total time is less than the X + * 36 cycles that 12 dependent crc32 instructions would X + * take. X + * X + * To have a chance of completely hiding the overhead for X + * the crc update, the inner loop must take considerably X + * longer than 24 cycles. LONG = 64 makes the inner loop X + * take about 24 cycles, so is not quite large enough. X + * LONG = 128 works OK. Unhideable overheads are about X + * 12 cycles per inner loop. All assuming timing like X + * Haswell. X + */ X + crc = crc32c_shift(crc32c_long, crc) ^ crc0; X + crc1 = crc32c_shift(crc32c_long, crc1); X + crc = crc32c_shift(crc32c_2long, crc) ^ crc1; X + crc0 = crc2; X next += LONG * 2; X len -= LONG * 3; X } X + crc0 ^= crc; X +#endif /* LONG > SHORT */ See the comment. Optimizing this was similar to optimizing polynomial expressions in libm. Higher power of S are best avoided, since the parallelism for combining results is limited. X X /* X * Do the same thing, but now on SHORT*3 blocks for the remaining data X * less than a LONG*3 block X */ X + crc = 0; X while (len >= SHORT * 3) { X crc1 = 0; X crc2 = 0; X @@ -259,11 +343,14 @@ X #endif X next += align; X } while (next < end); X - crc0 = crc32c_shift(crc32c_short, crc0) ^ crc1; X - crc0 = crc32c_shift(crc32c_short, crc0) ^ crc2; X + crc = crc32c_shift(crc32c_short, crc) ^ crc0; X + crc1 = crc32c_shift(crc32c_short, crc1); X + crc = crc32c_shift(crc32c_2short, crc) ^ crc1; X + crc0 = crc2; A simpler method that almost works is: crc0 = crc32c_shift(crc32c_2short, crc0); crc1 = crc32c_shift(crc32c_short, crc1); /* crc2 unchanged, and crc not used. */ These updates run in parallel, so the inner loop only has to wait half as long to restart. With the old SHORT = 256, the timing was approx. 96 cycles for the inner loop, then 24 cycles for the CRC update (33% overhead). This simpler change reduces the overhead to 17%). With the new LONG = 128, the inner loop takes about 48 cycles and the update about 24, mostly in parallel so that they have zero overhead except for the last iteration. Compiler "optimizations" of rearranging this code tend to break the parallelism. X next += SHORT * 2; X len -= SHORT * 3; X } X + crc0 ^= crc; X X /* Compute the crc on the remaining bytes at native word size. */ X end = next + (len - (len & (align - 1))); Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170203062806.A2690>