Date: Tue, 31 Jan 2017 09:15:13 -0800 From: Conrad Meyer <cem@freebsd.org> To: Bruce Evans <brde@optusnet.com.au> Cc: src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern Message-ID: <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com> In-Reply-To: <20170201005009.E2504@besplex.bde.org> References: <201701310326.v0V3QW30024375@repo.freebsd.org> <20170131153411.G1061@besplex.bde.org> <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com> <20170131175309.N1418@besplex.bde.org> <20170201005009.E2504@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde@optusnet.com.au> wrote: > On Tue, 31 Jan 2017, Bruce Evans wrote: > Unrolling (or not) may be helpful or harmful for entry and exit code. Helpful, per my earlier benchmarks. > I > think there should by no alignment on entry -- just assume the buffer is > aligned in the usual case, and only run 4% slower when it is misaligned. Please write such a patch and demonstrate the improvement. > The exit code handles up to SHORT * 3 =3D 768 bytes, not up to 4 or 8 > bytes or up to 3 times that like simpler algorithms. 768 is quite > large, and the exit code is quite slow. It reduces 8 or 4 bytes at a > time without any dependency reduction, and then 1 byte at a time. Yes, this is the important loop to unroll for small inputs. Somehow with the unrolling, it is only ~19% slower than the by-3 algorithm on my system =E2=80=94 not 66%. Clang 3.9.1 unrolls both of these trailing loops; here is the first: 0x0000000000401b88 <+584>: cmp $0x38,%rbx 0x0000000000401b8c <+588>: jae 0x401b93 <sse42_crc32c+595> 0x0000000000401b8e <+590>: mov %rsi,%rdx 0x0000000000401b91 <+593>: jmp 0x401be1 <sse42_crc32c+673> 0x0000000000401b93 <+595>: lea -0x1(%rdi),%rbx 0x0000000000401b97 <+599>: sub %rdx,%rbx 0x0000000000401b9a <+602>: mov %rsi,%rdx 0x0000000000401b9d <+605>: nopl (%rax) 0x0000000000401ba0 <+608>: crc32q (%rdx),%rax 0x0000000000401ba6 <+614>: crc32q 0x8(%rdx),%rax 0x0000000000401bad <+621>: crc32q 0x10(%rdx),%rax 0x0000000000401bb4 <+628>: crc32q 0x18(%rdx),%rax 0x0000000000401bbb <+635>: crc32q 0x20(%rdx),%rax 0x0000000000401bc2 <+642>: crc32q 0x28(%rdx),%rax 0x0000000000401bc9 <+649>: crc32q 0x30(%rdx),%rax 0x0000000000401bd0 <+656>: crc32q 0x38(%rdx),%rax 0x0000000000401bd7 <+663>: add $0x40,%rdx 0x0000000000401bdb <+667>: add $0x8,%rbx 0x0000000000401bdf <+671>: jne 0x401ba0 <sse42_crc32c+608> > I > don't understand the algorithm for joining crcs -- why doesn't it work > to reduce to 12 or 24 bytes in the main loop? It would, but I haven't implemented or tested that. You're welcome to do so and demonstrate an improvement. It does add more lookup table bloat, but perhaps we could just remove the 3x8k table =E2=80=94 I'm not su= re it adds any benefit over the 3x256 table. > Your benchmarks mainly give results for the <=3D 768 bytes where most of > the manual optimizations don't apply. 0x000400: asm:68 intrins:62 multitable:684 (ns per buf) 0x000800: asm:132 intrins:133 (ns per buf) 0x002000: asm:449 intrins:446 (ns per buf) 0x008000: asm:1501 intrins:1497 (ns per buf) 0x020000: asm:5618 intrins:5609 (ns per buf) (All routines are in a separate compilation unit with no full-program optimization, as they are in the kernel.) > Compiler optimizations are more > likely to help there. So I looked more closely at the last 2 loop. > clang indeed only unrolls the last one, Not in 3.9.1. > only for the unreachable case > with more than 8 bytes on amd64. How is it unreachable? Best, Conrad
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q>