From owner-svn-src-head@freebsd.org  Tue Jan 31 18:48:17 2017
Return-Path: <owner-svn-src-head@freebsd.org>
Delivered-To: svn-src-head@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8C09CCC9F64;
 Tue, 31 Jan 2017 18:48:17 +0000 (UTC)
 (envelope-from cse.cem@gmail.com)
Received: from mail-wm0-f54.google.com (mail-wm0-f54.google.com [74.125.82.54])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 2D39B1BB4;
 Tue, 31 Jan 2017 18:48:17 +0000 (UTC)
 (envelope-from cse.cem@gmail.com)
Received: by mail-wm0-f54.google.com with SMTP id b65so1384218wmf.0;
 Tue, 31 Jan 2017 10:48:17 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:reply-to:in-reply-to:references
 :from:date:message-id:subject:to:cc:content-transfer-encoding;
 bh=rxudrZZz9kFFXO9BqDqkZb124+bR2aD1NiGyuvKQ9BM=;
 b=f9fyiVi1UaTMAct9h+H0/MgPOqovY7FHKhcN5t3nIUnLujyeU1/Z08pTbq5dtXqMtw
 RTd53UpI0FNYQ6skftszCTYbZUbhJ7qSvLcCKteGcYuVtT0jfzUNCjjta+ll3fxgBDob
 pIdHDVITwC9cBGK/2edFVk8YtVVgKIgScmMP/oDxDDsIx1/44FxKNGolQ2CKsABZcxXm
 kQxtlZqRNaeoAzkjEam9vYpNU7nKkPHSeKDLzd80QWeRDvLO2GHKokRGabwuZ7LI+S2w
 lOu+6ykXTH6E1dP72xBBKEYdN2azfzC+uQZB8oWJnK3i6YtJt1Om+KCE7+qbtY5K8jZq
 n/Mw==
X-Gm-Message-State: AIkVDXJ0dVdGv9UzKa0kL4V11yKnkXmhw4pho3PkdlfmT4GHvLB22p0vwmVt1ykjeOkz9g==
X-Received: by 10.223.148.2 with SMTP id 2mr27184251wrq.75.1485882914317;
 Tue, 31 Jan 2017 09:15:14 -0800 (PST)
Received: from mail-wm0-f48.google.com (mail-wm0-f48.google.com.
 [74.125.82.48])
 by smtp.gmail.com with ESMTPSA id 8sm33366640wmg.1.2017.01.31.09.15.14
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 31 Jan 2017 09:15:14 -0800 (PST)
Received: by mail-wm0-f48.google.com with SMTP id c85so270054213wmi.1;
 Tue, 31 Jan 2017 09:15:14 -0800 (PST)
X-Received: by 10.223.177.202 with SMTP id r10mr24559020wra.94.1485882913947; 
 Tue, 31 Jan 2017 09:15:13 -0800 (PST)
MIME-Version: 1.0
Reply-To: cem@freebsd.org
Received: by 10.194.22.42 with HTTP; Tue, 31 Jan 2017 09:15:13 -0800 (PST)
In-Reply-To: <20170201005009.E2504@besplex.bde.org>
References: <201701310326.v0V3QW30024375@repo.freebsd.org>
 <20170131153411.G1061@besplex.bde.org>
 <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com>
 <20170131175309.N1418@besplex.bde.org> <20170201005009.E2504@besplex.bde.org>
From: Conrad Meyer <cem@freebsd.org>
Date: Tue, 31 Jan 2017 09:15:13 -0800
X-Gmail-Original-Message-ID: <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com>
Message-ID: <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com>
Subject: Re: svn commit: r313006 - in head: sys/conf sys/libkern
 sys/libkern/x86 sys/sys tests/sys/kern
To: Bruce Evans <brde@optusnet.com.au>
Cc: src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, 
 svn-src-head@freebsd.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 31 Jan 2017 18:48:17 -0000

On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde@optusnet.com.au> wrote:
> On Tue, 31 Jan 2017, Bruce Evans wrote:
> Unrolling (or not) may be helpful or harmful for entry and exit code.

Helpful, per my earlier benchmarks.

> I
> think there should by no alignment on entry -- just assume the buffer is
> aligned in the usual case, and only run 4% slower when it is misaligned.

Please write such a patch and demonstrate the improvement.

> The exit code handles up to SHORT * 3 =3D 768 bytes, not up to 4 or 8
> bytes or up to 3 times that like simpler algorithms.  768 is quite
> large, and the exit code is quite slow.  It reduces 8 or 4 bytes at a
> time without any dependency reduction, and then 1 byte at a time.

Yes, this is the important loop to unroll for small inputs.  Somehow
with the unrolling, it is only ~19% slower than the by-3 algorithm on
my system =E2=80=94 not 66%.  Clang 3.9.1 unrolls both of these trailing
loops; here is the first:

   0x0000000000401b88 <+584>:   cmp    $0x38,%rbx
   0x0000000000401b8c <+588>:   jae    0x401b93 <sse42_crc32c+595>
   0x0000000000401b8e <+590>:   mov    %rsi,%rdx
   0x0000000000401b91 <+593>:   jmp    0x401be1 <sse42_crc32c+673>
   0x0000000000401b93 <+595>:   lea    -0x1(%rdi),%rbx
   0x0000000000401b97 <+599>:   sub    %rdx,%rbx
   0x0000000000401b9a <+602>:   mov    %rsi,%rdx
   0x0000000000401b9d <+605>:   nopl   (%rax)
   0x0000000000401ba0 <+608>:   crc32q (%rdx),%rax
   0x0000000000401ba6 <+614>:   crc32q 0x8(%rdx),%rax
   0x0000000000401bad <+621>:   crc32q 0x10(%rdx),%rax
   0x0000000000401bb4 <+628>:   crc32q 0x18(%rdx),%rax
   0x0000000000401bbb <+635>:   crc32q 0x20(%rdx),%rax
   0x0000000000401bc2 <+642>:   crc32q 0x28(%rdx),%rax
   0x0000000000401bc9 <+649>:   crc32q 0x30(%rdx),%rax
   0x0000000000401bd0 <+656>:   crc32q 0x38(%rdx),%rax
   0x0000000000401bd7 <+663>:   add    $0x40,%rdx
   0x0000000000401bdb <+667>:   add    $0x8,%rbx
   0x0000000000401bdf <+671>:   jne    0x401ba0 <sse42_crc32c+608>


> I
> don't understand the algorithm for joining crcs -- why doesn't it work
> to reduce to 12 or 24 bytes in the main loop?

It would, but I haven't implemented or tested that.  You're welcome to
do so and demonstrate an improvement.  It does add more lookup table
bloat, but perhaps we could just remove the 3x8k table =E2=80=94 I'm not su=
re
it adds any benefit over the 3x256 table.

> Your benchmarks mainly give results for the <=3D 768 bytes where most of
> the manual optimizations don't apply.

0x000400: asm:68 intrins:62 multitable:684  (ns per buf)
0x000800: asm:132 intrins:133  (ns per buf)
0x002000: asm:449 intrins:446  (ns per buf)
0x008000: asm:1501 intrins:1497  (ns per buf)
0x020000: asm:5618 intrins:5609  (ns per buf)

(All routines are in a separate compilation unit with no full-program
optimization, as they are in the kernel.)

> Compiler optimizations are more
> likely to help there.  So I looked more closely at the last 2 loop.
> clang indeed only unrolls the last one,

Not in 3.9.1.

> only for the unreachable case
> with more than 8 bytes on amd64.

How is it unreachable?

Best,
Conrad