Date: Wed, 1 Feb 2017 11:44:58 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Conrad Meyer <cem@freebsd.org> Cc: Bruce Evans <brde@optusnet.com.au>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern Message-ID: <20170201101029.R1617@besplex.bde.org> In-Reply-To: <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com> References: <201701310326.v0V3QW30024375@repo.freebsd.org> <20170131153411.G1061@besplex.bde.org> <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com> <20170131175309.N1418@besplex.bde.org> <20170201005009.E2504@besplex.bde.org> <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 31 Jan 2017, Conrad Meyer wrote: > On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde@optusnet.com.au> wrote= : >> On Tue, 31 Jan 2017, Bruce Evans wrote: >> Unrolling (or not) may be helpful or harmful for entry and exit code. > > Helpful, per my earlier benchmarks. > >> I >> think there should by no alignment on entry -- just assume the buffer is >> aligned in the usual case, and only run 4% slower when it is misaligned. > > Please write such a patch and demonstrate the improvement. I now understand the algorithm. The division into 3 has to keep sequentiality (since CRC input is a bit string), so it doesn't work to separate the memory acces of the 3 crc32's by 8 bytes as in my simple test -- they have to be separated by large amounts. Then recombining requires multiplying the polynomials associated with the CRCs of the higher 2 blocks by X^N and reducing again, where N is large an related to the block size. This is done using a large table for each N needed, and to keep things reasonably simple, only 2 N's are used. >> The exit code handles up to SHORT * 3 =3D 768 bytes, not up to 4 or 8 >> bytes or up to 3 times that like simpler algorithms. 768 is quite >> large, and the exit code is quite slow. It reduces 8 or 4 bytes at a >> time without any dependency reduction, and then 1 byte at a time. > > Yes, this is the important loop to unroll for small inputs. Somehow Not like clang does it. Unrolling is useless without the 3-way blocking. > with the unrolling, it is only ~19% slower than the by-3 algorithm on > my system =E2=80=94 not 66%. Clang 3.9.1 unrolls both of these trailing > loops; here is the first: Maybe 3.9.1 pessimizes the 3-way loop by unrolling it. This would be fairly easy to do. Just replicate the 3 crc32q's a few times, say 8, and do them in a bad order (3 blocks of 8 dependent ones instead of 8 blocks of 3 independent ones). With enough replication, the code would be too large for the hardware to reorder. Inline asm has another advantage here -- volatile on it prevents reordering and might prevent unrolling. Maybe 3.9.1 unpessimizes the inline asms. But I suspect not getting the 3 times speedup is for another reason. > > 0x0000000000401b88 <+584>: cmp $0x38,%rbx > 0x0000000000401b8c <+588>: jae 0x401b93 <sse42_crc32c+595> > 0x0000000000401b8e <+590>: mov %rsi,%rdx > 0x0000000000401b91 <+593>: jmp 0x401be1 <sse42_crc32c+673> > 0x0000000000401b93 <+595>: lea -0x1(%rdi),%rbx > 0x0000000000401b97 <+599>: sub %rdx,%rbx > 0x0000000000401b9a <+602>: mov %rsi,%rdx > 0x0000000000401b9d <+605>: nopl (%rax) > 0x0000000000401ba0 <+608>: crc32q (%rdx),%rax > 0x0000000000401ba6 <+614>: crc32q 0x8(%rdx),%rax > 0x0000000000401bad <+621>: crc32q 0x10(%rdx),%rax > 0x0000000000401bb4 <+628>: crc32q 0x18(%rdx),%rax > 0x0000000000401bbb <+635>: crc32q 0x20(%rdx),%rax > 0x0000000000401bc2 <+642>: crc32q 0x28(%rdx),%rax > 0x0000000000401bc9 <+649>: crc32q 0x30(%rdx),%rax > 0x0000000000401bd0 <+656>: crc32q 0x38(%rdx),%rax > 0x0000000000401bd7 <+663>: add $0x40,%rdx > 0x0000000000401bdb <+667>: add $0x8,%rbx > 0x0000000000401bdf <+671>: jne 0x401ba0 <sse42_crc32c+608> No, this unrolling is useless. The crc32q's are dependent on each other, so they take 3 cycles each. There are spare resources to run about 12 instructions during that time. Loop control only takes 3. >> I >> don't understand the algorithm for joining crcs -- why doesn't it work >> to reduce to 12 or 24 bytes in the main loop? > > It would, but I haven't implemented or tested that. You're welcome to > do so and demonstrate an improvement. It does add more lookup table > bloat, but perhaps we could just remove the 3x8k table =E2=80=94 I'm not = sure > it adds any benefit over the 3x256 table. > >> Your benchmarks mainly give results for the <=3D 768 bytes where most of >> the manual optimizations don't apply. Actually, they test only the large buffer case. They used buffer size of 1M and 1k and didn't do the entry and exit code that usually dominates for small buffers. I re-tested with the correct blocking. This was about 10% slower (0.34 -> 0.37 seconds for 10GB), except for clang without intrinsics it was 20% slower (0.43 -> 0.51) seconds. > 0x000400: asm:68 intrins:62 multitable:684 (ns per buf) I don't see any signs of this in my test: - a single crc32q in a (C) loop doesn't benefit from unrolling or lose to the extra clang instructions without intrinsics. clang-3.9.0 unrolls this 8-way in the simpler environment of my test program, but this makes no difference. - similarly for a single crc32b in a loop, except when I forgot to change the type of the crc accumulator from uint64_t to uint32_t, gcc was 1 cycle slower in the loop (3 instead of 4). gcc generates an extra instruction to zero-extend the crc, and this is more expensive than usual since it gives gives another dependency. clang optimizes this away. > 0x000800: asm:132 intrins:133 (ns per buf) > 0x002000: asm:449 intrins:446 (ns per buf) > 0x008000: asm:1501 intrins:1497 (ns per buf) > 0x020000: asm:5618 intrins:5609 (ns per buf) Now it is mostly in the 3-way optimized case and the differences are in the noise. > (All routines are in a separate compilation unit with no full-program > optimization, as they are in the kernel.) > >> Compiler optimizations are more >> likely to help there. So I looked more closely at the last 2 loop. >> clang indeed only unrolls the last one, > > Not in 3.9.1. 3.9.1 seems to only extend the useless unrolling. >> only for the unreachable case >> with more than 8 bytes on amd64. > > How is it unreachable? Because the loop doing 8-byte words at a time reduces the count below 8. Bruce From owner-svn-src-head@freebsd.org Wed Feb 1 01:25:32 2017 Return-Path: <owner-svn-src-head@freebsd.org> Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 363DFCCB437; Wed, 1 Feb 2017 01:25:32 +0000 (UTC) (envelope-from jmd@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 05C89275; Wed, 1 Feb 2017 01:25:31 +0000 (UTC) (envelope-from jmd@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id v111PVTZ087790; Wed, 1 Feb 2017 01:25:31 GMT (envelope-from jmd@FreeBSD.org) Received: (from jmd@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id v111PVR7087789; Wed, 1 Feb 2017 01:25:31 GMT (envelope-from jmd@FreeBSD.org) Message-Id: <201702010125.v111PVR7087789@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: jmd set sender to jmd@FreeBSD.org using -f From: Johannes M Dieterich <jmd@FreeBSD.org> Date: Wed, 1 Feb 2017 01:25:31 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r313033 - head/share/misc X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SVN commit messages for the src tree for head/-current <svn-src-head.freebsd.org> List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>, <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe> List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/> List-Post: <mailto:svn-src-head@freebsd.org> List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help> List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>, <mailto:svn-src-head-request@freebsd.org?subject=subscribe> X-List-Received-Date: Wed, 01 Feb 2017 01:25:32 -0000 Author: jmd (ports committer) Date: Wed Feb 1 01:25:30 2017 New Revision: 313033 URL: https://svnweb.freebsd.org/changeset/base/313033 Log: Add myself (jmd) to committers-ports.dot. Document rene and swills as my mentors. Reviewed by: rene (mentor) Approved by: rene (mentor) Differential Revision: https://reviews.freebsd.org/D9393 Modified: head/share/misc/committers-ports.dot Modified: head/share/misc/committers-ports.dot ============================================================================== --- head/share/misc/committers-ports.dot Wed Feb 1 00:10:29 2017 (r313032) +++ head/share/misc/committers-ports.dot Wed Feb 1 01:25:30 2017 (r313033) @@ -125,6 +125,7 @@ jgh [label="Jason Helfman\njgh@FreeBSD.o jhale [label="Jason E. Hale\njhale@FreeBSD.org\n2012/09/10"] jkim [label="Jung-uk Kim\njkim@FreeBSD.org\n2007/09/12"] jlaffaye [label="Julien Laffaye\njlaffaye@FreeBSD.org\n2011/06/06"] +jmd [label="Johannes M. Dieterich\njmd@FreeBSD.org\n2017/01/09"] jmelo [label="Jean Milanez Melo\njmelo@FreeBSD.org\n2006/03/31"] joerg [label="Joerg Wunsch\njoerg@FreeBSD.org\n1994/08/22"] johans [label="Johan Selst\njohans@FreeBSD.org\n2006/04/01"] @@ -562,6 +563,7 @@ rene -> bar rene -> cmt rene -> crees rene -> jgh +rene -> jmd rene -> ler rene -> olivierd @@ -594,6 +596,7 @@ stas -> araujo steve -> netchild swills -> feld +swills -> jmd swills -> jrm swills -> milki swills -> pclin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170201101029.R1617>