From owner-svn-src-head@freebsd.org Tue Jan 31 18:48:17 2017 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8C09CCC9F64; Tue, 31 Jan 2017 18:48:17 +0000 (UTC) (envelope-from cse.cem@gmail.com) Received: from mail-wm0-f54.google.com (mail-wm0-f54.google.com [74.125.82.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 2D39B1BB4; Tue, 31 Jan 2017 18:48:17 +0000 (UTC) (envelope-from cse.cem@gmail.com) Received: by mail-wm0-f54.google.com with SMTP id b65so1384218wmf.0; Tue, 31 Jan 2017 10:48:17 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:reply-to:in-reply-to:references :from:date:message-id:subject:to:cc:content-transfer-encoding; bh=rxudrZZz9kFFXO9BqDqkZb124+bR2aD1NiGyuvKQ9BM=; b=f9fyiVi1UaTMAct9h+H0/MgPOqovY7FHKhcN5t3nIUnLujyeU1/Z08pTbq5dtXqMtw RTd53UpI0FNYQ6skftszCTYbZUbhJ7qSvLcCKteGcYuVtT0jfzUNCjjta+ll3fxgBDob pIdHDVITwC9cBGK/2edFVk8YtVVgKIgScmMP/oDxDDsIx1/44FxKNGolQ2CKsABZcxXm kQxtlZqRNaeoAzkjEam9vYpNU7nKkPHSeKDLzd80QWeRDvLO2GHKokRGabwuZ7LI+S2w lOu+6ykXTH6E1dP72xBBKEYdN2azfzC+uQZB8oWJnK3i6YtJt1Om+KCE7+qbtY5K8jZq n/Mw== X-Gm-Message-State: AIkVDXJ0dVdGv9UzKa0kL4V11yKnkXmhw4pho3PkdlfmT4GHvLB22p0vwmVt1ykjeOkz9g== X-Received: by 10.223.148.2 with SMTP id 2mr27184251wrq.75.1485882914317; Tue, 31 Jan 2017 09:15:14 -0800 (PST) Received: from mail-wm0-f48.google.com (mail-wm0-f48.google.com. [74.125.82.48]) by smtp.gmail.com with ESMTPSA id 8sm33366640wmg.1.2017.01.31.09.15.14 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 31 Jan 2017 09:15:14 -0800 (PST) Received: by mail-wm0-f48.google.com with SMTP id c85so270054213wmi.1; Tue, 31 Jan 2017 09:15:14 -0800 (PST) X-Received: by 10.223.177.202 with SMTP id r10mr24559020wra.94.1485882913947; Tue, 31 Jan 2017 09:15:13 -0800 (PST) MIME-Version: 1.0 Reply-To: cem@freebsd.org Received: by 10.194.22.42 with HTTP; Tue, 31 Jan 2017 09:15:13 -0800 (PST) In-Reply-To: <20170201005009.E2504@besplex.bde.org> References: <201701310326.v0V3QW30024375@repo.freebsd.org> <20170131153411.G1061@besplex.bde.org> <20170131175309.N1418@besplex.bde.org> <20170201005009.E2504@besplex.bde.org> From: Conrad Meyer Date: Tue, 31 Jan 2017 09:15:13 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern To: Bruce Evans Cc: src-committers , svn-src-all@freebsd.org, svn-src-head@freebsd.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 31 Jan 2017 18:48:17 -0000 On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans wrote: > On Tue, 31 Jan 2017, Bruce Evans wrote: > Unrolling (or not) may be helpful or harmful for entry and exit code. Helpful, per my earlier benchmarks. > I > think there should by no alignment on entry -- just assume the buffer is > aligned in the usual case, and only run 4% slower when it is misaligned. Please write such a patch and demonstrate the improvement. > The exit code handles up to SHORT * 3 =3D 768 bytes, not up to 4 or 8 > bytes or up to 3 times that like simpler algorithms. 768 is quite > large, and the exit code is quite slow. It reduces 8 or 4 bytes at a > time without any dependency reduction, and then 1 byte at a time. Yes, this is the important loop to unroll for small inputs. Somehow with the unrolling, it is only ~19% slower than the by-3 algorithm on my system =E2=80=94 not 66%. Clang 3.9.1 unrolls both of these trailing loops; here is the first: 0x0000000000401b88 <+584>: cmp $0x38,%rbx 0x0000000000401b8c <+588>: jae 0x401b93 0x0000000000401b8e <+590>: mov %rsi,%rdx 0x0000000000401b91 <+593>: jmp 0x401be1 0x0000000000401b93 <+595>: lea -0x1(%rdi),%rbx 0x0000000000401b97 <+599>: sub %rdx,%rbx 0x0000000000401b9a <+602>: mov %rsi,%rdx 0x0000000000401b9d <+605>: nopl (%rax) 0x0000000000401ba0 <+608>: crc32q (%rdx),%rax 0x0000000000401ba6 <+614>: crc32q 0x8(%rdx),%rax 0x0000000000401bad <+621>: crc32q 0x10(%rdx),%rax 0x0000000000401bb4 <+628>: crc32q 0x18(%rdx),%rax 0x0000000000401bbb <+635>: crc32q 0x20(%rdx),%rax 0x0000000000401bc2 <+642>: crc32q 0x28(%rdx),%rax 0x0000000000401bc9 <+649>: crc32q 0x30(%rdx),%rax 0x0000000000401bd0 <+656>: crc32q 0x38(%rdx),%rax 0x0000000000401bd7 <+663>: add $0x40,%rdx 0x0000000000401bdb <+667>: add $0x8,%rbx 0x0000000000401bdf <+671>: jne 0x401ba0 > I > don't understand the algorithm for joining crcs -- why doesn't it work > to reduce to 12 or 24 bytes in the main loop? It would, but I haven't implemented or tested that. You're welcome to do so and demonstrate an improvement. It does add more lookup table bloat, but perhaps we could just remove the 3x8k table =E2=80=94 I'm not su= re it adds any benefit over the 3x256 table. > Your benchmarks mainly give results for the <=3D 768 bytes where most of > the manual optimizations don't apply. 0x000400: asm:68 intrins:62 multitable:684 (ns per buf) 0x000800: asm:132 intrins:133 (ns per buf) 0x002000: asm:449 intrins:446 (ns per buf) 0x008000: asm:1501 intrins:1497 (ns per buf) 0x020000: asm:5618 intrins:5609 (ns per buf) (All routines are in a separate compilation unit with no full-program optimization, as they are in the kernel.) > Compiler optimizations are more > likely to help there. So I looked more closely at the last 2 loop. > clang indeed only unrolls the last one, Not in 3.9.1. > only for the unreachable case > with more than 8 bytes on amd64. How is it unreachable? Best, Conrad