From owner-svn-src-all@freebsd.org  Wed Feb  1 03:48:02 2017
Return-Path: <owner-svn-src-all@freebsd.org>
Delivered-To: svn-src-all@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1DB61CC9886;
 Wed,  1 Feb 2017 03:48:02 +0000 (UTC)
 (envelope-from cse.cem@gmail.com)
Received: from mail-wm0-f66.google.com (mail-wm0-f66.google.com [74.125.82.66])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id C1B44E71;
 Wed,  1 Feb 2017 03:48:01 +0000 (UTC)
 (envelope-from cse.cem@gmail.com)
Received: by mail-wm0-f66.google.com with SMTP id u63so2759024wmu.2;
 Tue, 31 Jan 2017 19:48:01 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:reply-to:in-reply-to:references
 :from:date:message-id:subject:to:cc:content-transfer-encoding;
 bh=WWiyHxt5HVthivQqQ/DK2Et9fAgy86UG3XdpnRvUvt0=;
 b=JjNYEpMdJsxAraZZWJY6hhtGpfkXS5p9rmDm7iMINUSai1TAvj+8wSzNIfhojBnD6p
 UibOjcAz8X7G1q7sdb9ePsgCTc25kRcQKNJrBxWJCbvkIj3+rhRFcqMpIKDOv7aiMJc5
 1jEqb4yi7RYT90+fvmhqcmS+B2PxhqRoA2tptVIp2q7DxnM1/XeTB6PoLDHy/OfaG9Zt
 R/YdFbZV4t8qMXJjocJNMk6pDgWJQi9B+p0zIHKJO2z5pffc0hHXKqO6DbfT6ZkFTcXK
 n5/rBHTBHbVzaCJJXLfisaKVv1Lwcwdd7HZYYdCXn6gMxcnXTGNStRy/Cnjw0+cHM3Hz
 CUog==
X-Gm-Message-State: AIkVDXLtgPTXxy0ZLun9fhFnvlGm8Ck3jQGwHhG43EoIeUu8xPNmMC/DO2e/+A30R5XV6w==
X-Received: by 10.223.165.1 with SMTP id i1mr517215wrb.82.1485920873992;
 Tue, 31 Jan 2017 19:47:53 -0800 (PST)
Received: from mail-wm0-f53.google.com (mail-wm0-f53.google.com.
 [74.125.82.53])
 by smtp.gmail.com with ESMTPSA id k70sm27194589wmc.3.2017.01.31.19.47.53
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 31 Jan 2017 19:47:53 -0800 (PST)
Received: by mail-wm0-f53.google.com with SMTP id v77so18495256wmv.0;
 Tue, 31 Jan 2017 19:47:53 -0800 (PST)
X-Received: by 10.223.174.183 with SMTP id y52mr592705wrc.112.1485920873590;
 Tue, 31 Jan 2017 19:47:53 -0800 (PST)
MIME-Version: 1.0
Reply-To: cem@freebsd.org
Received: by 10.194.22.42 with HTTP; Tue, 31 Jan 2017 19:47:53 -0800 (PST)
In-Reply-To: <20170201123838.X1974@besplex.bde.org>
References: <201701310326.v0V3QW30024375@repo.freebsd.org>
 <20170131153411.G1061@besplex.bde.org>
 <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com>
 <20170131175309.N1418@besplex.bde.org> <20170201005009.E2504@besplex.bde.org>
 <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com>
 <20170201123838.X1974@besplex.bde.org>
From: Conrad Meyer <cem@freebsd.org>
Date: Tue, 31 Jan 2017 19:47:53 -0800
X-Gmail-Original-Message-ID: <CAG6CVpX0Y1PSODjnr2aY+ybM3rKfr-KF3qAMXGpG669fYG7WXQ@mail.gmail.com>
Message-ID: <CAG6CVpX0Y1PSODjnr2aY+ybM3rKfr-KF3qAMXGpG669fYG7WXQ@mail.gmail.com>
Subject: Re: svn commit: r313006 - in head: sys/conf sys/libkern
 sys/libkern/x86 sys/sys tests/sys/kern
To: Bruce Evans <brde@optusnet.com.au>
Cc: src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, 
 svn-src-head@freebsd.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 01 Feb 2017 03:48:02 -0000

On Tue, Jan 31, 2017 at 7:16 PM, Bruce Evans <brde@optusnet.com.au> wrote:
> Another reply to this...
>
> On Tue, 31 Jan 2017, Conrad Meyer wrote:
>
>> On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde@optusnet.com.au> wrot=
e:
>>>
>>> On Tue, 31 Jan 2017, Bruce Evans wrote:
>>> I
>>> think there should by no alignment on entry -- just assume the buffer i=
s
>>> aligned in the usual case, and only run 4% slower when it is misaligned=
.
>>
>>
>> Please write such a patch and demonstrate the improvement.
>
>
> It is easy to demonstrate.  I just put #if 0 around the early alignment
> code.  The result seem too good to be true, so maybe I missed some
> later dependency on alignment of the addresses:
> - for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with
>   alignment and 1.02 seconds without alignment.
> This actually makes sense, 128 bytes can be done with 16 8-byte unaligned
> crc32q's.  The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte=
.
> 7 more 3-cycle instructions and overhead too is far more than the cost
> of letting the CPU do read-combining.
> - for 4096-byte buffers, the difference is insignificant (0.47 seconds fo=
r
>   10g.

I believe it, especially for newer amd64.  I seem to recall that older
x86 machines had a higher misalignment penalty, but it was largely
reduced in (?)Nehalem.  Why don't you go ahead and commit that change?

>> perhaps we could just remove the 3x8k table =E2=80=94 I'm not sure
>> it adds any benefit over the 3x256 table.
>
>
> Good idea, but the big table is useful.  Ifdefing out the LONG case reduc=
es
> the speed for large buffers from ~0.35 seconds to ~0.43 seconds in the
> setup below.  Ifdefing out the SHORT case only reduces to ~0.39 seconds.

Interesting.

> I hoped that an even shorter SHORT case would work.  I think it now handl=
es
> 768 bytes (3 * SHORT) in the inner loop.

Right.

> That is 32 sets of 3 crc32q's,
> and I would have thought that update at the end would take about as long
> as 1 iteration (3%), but it apparently takes 33%.

The update at the end may be faster with PCLMULQDQ.  There are
versions of this algorithm that use that in place of the lookup-table
combine (for example, Linux has a permissively licensed implementation
here: http://lxr.free-electrons.com/source/arch/x86/crypto/crc32c-pcl-intel=
-asm_64.S
).

Unfortunately, PCLMULQDQ uses FPU state, which is inappropriate most
of the time in kernel mode.  It could be used opportunistically if the
thread is already in FPU-save mode or if the input is "big enough" to
make it worth it.

>>> Your benchmarks mainly give results for the <=3D 768 bytes where most o=
f
>>> the manual optimizations don't apply.
>>
>>
>> 0x000400: asm:68 intrins:62 multitable:684  (ns per buf)
>> 0x000800: asm:132 intrins:133  (ns per buf)
>> 0x002000: asm:449 intrins:446  (ns per buf)
>> 0x008000: asm:1501 intrins:1497  (ns per buf)
>> 0x020000: asm:5618 intrins:5609  (ns per buf)
>>
>> (All routines are in a separate compilation unit with no full-program
>> optimization, as they are in the kernel.)
>
>
> These seem slow.  I modified my program to test the actual kernel code,
> and get for 10gB on freefall's Xeon (main times in seconds):

Freefall has a Haswell Xeon @ 3.3GHz.  My laptop is a Sandy Bridge
Core i5 @ 2.6 GHz.  That may help explain the difference.

Best,
Conrad