From owner-svn-src-head@FreeBSD.ORG  Mon Apr 13 17:36:54 2015
Return-Path: <owner-svn-src-head@FreeBSD.ORG>
Delivered-To: svn-src-head@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id AA0A4218;
 Mon, 13 Apr 2015 17:36:54 +0000 (UTC)
Received: from pp2.rice.edu (proofpoint2.mail.rice.edu [128.42.201.101])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 6E6DDBE8;
 Mon, 13 Apr 2015 17:36:53 +0000 (UTC)
Received: from pps.filterd (pp2.rice.edu [127.0.0.1])
 by pp2.rice.edu (8.14.5/8.14.5) with SMTP id t3DHaBcu017943;
 Mon, 13 Apr 2015 12:36:52 -0500
Received: from mh3.mail.rice.edu (mh3.mail.rice.edu [128.42.199.10])
 by pp2.rice.edu with ESMTP id 1tq6sm8ryv-3;
 Mon, 13 Apr 2015 12:36:51 -0500
X-Virus-Scanned: by amavis-2.7.0 at mh3.mail.rice.edu, auth channel
Received: from 108-254-203-201.lightspeed.hstntx.sbcglobal.net
 (108-254-203-201.lightspeed.hstntx.sbcglobal.net [108.254.203.201])
 (using TLSv1 with cipher RC4-MD5 (128/128 bits))
 (No client certificate requested) (Authenticated sender: alc)
 by mh3.mail.rice.edu (Postfix) with ESMTPSA id 2B732403E8;
 Mon, 13 Apr 2015 12:36:51 -0500 (CDT)
Message-ID: <552BFEB2.8040407@rice.edu>
Date: Mon, 13 Apr 2015 12:36:50 -0500
From: Alan Cox <alc@rice.edu>
User-Agent: Mozilla/5.0 (X11; FreeBSD i386;
 rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>, Bruce Evans <brde@optusnet.com.au>
Subject: Re: svn commit: r280279 - head/sys/sys
References: <201503201027.t2KAR6Ze053047@svn.freebsd.org>
 <550DA656.5060004@FreeBSD.org> <20150322080015.O955@besplex.bde.org>
 <17035816.lxyzYKiOWV@ralph.baldwin.cx>
In-Reply-To: <17035816.lxyzYKiOWV@ralph.baldwin.cx>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable
X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0
 suspectscore=3 phishscore=0
 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1
 engine=7.0.1-1402240000 definitions=main-1504130149
Cc: Konstantin Belousov <kostikbel@gmail.com>, svn-src-head@freebsd.org,
 svn-src-all@freebsd.org, src-committers@freebsd.org
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Apr 2015 17:36:54 -0000

On 03/30/2015 10:50, John Baldwin wrote:
> On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
>> On Sat, 21 Mar 2015, John Baldwin wrote:
>>
>>> On 3/21/15 12:35 PM, Konstantin Belousov wrote:
>>>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
>>>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote:
>>>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote:
>>>>>>> Author: jhb
>>>>>>> Date: Fri Mar 20 10:27:06 2015
>>>>>>> New Revision: 280279
>>>>>>> URL: https://svnweb.freebsd.org/changeset/base/280279
>>>>>>>
>>>>>>> Log:
>>>>>>>   Expand the bitcount* API to support 64-bit integers, plain ints=
 and longs
>>>>>>>   and create a "hidden" API that can be used in other system head=
ers without
>>>>>>>   adding namespace pollution.
>>>>>>>   - If the POPCNT instruction is enabled at compile time, use
>>>>>>>     __builtin_popcount*() to implement __bitcount*(), otherwise f=
all back
>>>>>>>     to software implementations.
>>>>>> Are you aware of the Haswell errata HSD146 ?  I see the described =
behaviour
>>>>>> on machines back to SandyBridge, but not on Nehalems.
>>>>>> HSD146.   POPCNT Instruction May Take Longer to Execute Than Expec=
ted
>>>>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand =
may be
>>>>>> delayed until previous non-dependent instructions have executed.
>>>>>>
>>>>>> Jilles noted that gcc head and 4.9.2 already provides a workaround=
 by
>>>>>> xoring the dst register.  I have some patch for amd64 pmap, see th=
e end
>>>>>> of the message.
>>>>> No, I was not aware, but I think it's hard to fix this anywhere but=
 the
>>>>> compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and c=
lang
>>>>> uses POPCOUNT for this function from ACPI-CA:
>>>>>
>>>>> static UINT8
>>>>> AcpiRsCountSetBits (
>>>>>     UINT16                  BitField)
>>>>> {
>>>>>     UINT8                   BitsSet;
>>>>>
>>>>>
>>>>>     ACPI_FUNCTION_ENTRY ();
>>>>>
>>>>>
>>>>>     for (BitsSet =3D 0; BitField; BitsSet++)
>>>>>     {
>>>>>         /* Zero the least significant bit that is set */
>>>>>
>>>>>         BitField &=3D (UINT16) (BitField - 1);
>>>>>     }
>>>>>
>>>>>     return (BitsSet);
>>>>> }
>>>>>
>>>>> (I ran into this accidentally because a kernel built on my system f=
ailed
>>>>> to boot in older qemu because the kernel paniced with an illegal in=
struction
>>>>> fault in this function.)
>> Does it do the same for the similar home made popcount in pmap?:
> Yes:
>
> ffffffff807658d4:       f6 04 25 46 e2 d6 80    testb  $0x80,0xffffffff=
80d6e246
> ffffffff807658db:       80=20
> ffffffff807658dc:       74 32                   je     ffffffff80765910=
 <pmap_demote_pde_locked+0x4d0>
> ffffffff807658de:       48 89 4d b8             mov    %rcx,-0x48(%rbp)=

> ffffffff807658e2:       f3 48 0f b8 4d b8       popcnt -0x48(%rbp),%rcx=

> ffffffff807658e8:       48 8b 50 20             mov    0x20(%rax),%rdx
> ffffffff807658ec:       48 89 55 b0             mov    %rdx,-0x50(%rbp)=

> ffffffff807658f0:       f3 48 0f b8 55 b0       popcnt -0x50(%rbp),%rdx=

> ffffffff807658f6:       01 ca                   add    %ecx,%edx
> ffffffff807658f8:       48 8b 48 28             mov    0x28(%rax),%rcx
> ffffffff807658fc:       48 89 4d a8             mov    %rcx,-0x58(%rbp)=

> ffffffff80765900:       f3 48 0f b8 4d a8       popcnt -0x58(%rbp),%rcx=

> ffffffff80765906:       eb 1b                   jmp    ffffffff80765923=
 <pmap_demote_pde_locked+0x4e3>
> ffffffff80765908:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)=

> ffffffff8076590f:       00=20
> ffffffff80765910:       f3 48 0f b8 c9          popcnt %rcx,%rcx
> ffffffff80765915:       f3 48 0f b8 50 20       popcnt 0x20(%rax),%rdx
> ffffffff8076591b:       01 ca                   add    %ecx,%edx
> ffffffff8076591d:       f3 48 0f b8 48 28       popcnt 0x28(%rax),%rcx
> ffffffff80765923:       01 d1                   add    %edx,%ecx
>
> It also uses popcnt for this in blist_fill() and blist_meta_fill():
>
> 742             /* Count the number of blocks we're about to allocate *=
/
> 743             bitmap =3D scan->u.bmu_bitmap & mask;
> 744             for (nblks =3D 0; bitmap !=3D 0; nblks++)
> 745                     bitmap &=3D bitmap - 1;
>
>> Always using new API would lose the micro-optimizations given by the r=
untime
>> decision for default CFLAGS (used by distributions for portability).  =
To
>> keep them, it seems best to keep the inline asm but replace
>> popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can the=
n
>> be used to work around slowness in the software (that is actually
>> hardware) case.
> I'm not sure if bitcount64() is strictly better than the loop in this c=
ase
> even though it is O(1) given the claimed nature of the values in the co=
mment.
>


I checked.  Even with zeroes being more common than ones, bitcount64()
is faster than the simple loop.  Using bitcount64, reserve_pv_entries()
takes on average 4265 cycles during "buildworld" on my test machine.  In
contrast, with the simple loop, it takes on average 4507 cycles.  Even
though bitcount64 is a lot larger than the simple loop, we do the 3 bit
count operations many times in a loop, so the extra i-cache misses are
being made up for by the repeated execution of the faster code.

However, in the popcnt case, we are spilling the bit map to memory in
order to popcnt it.  That's rather silly:

    3570:       48 8b 48 18             mov    0x18(%rax),%rcx
    3574:       f6 04 25 00 00 00 00    testb  $0x80,0x0
    357b:       80
    357c:       74 42                   je     35c0
<pmap_demote_pde_locked+0x2f0>
    357e:       48 89 4d b8             mov    %rcx,-0x48(%rbp)
    3582:       31 c9                   xor    %ecx,%ecx
    3584:       f3 48 0f b8 4d b8       popcnt -0x48(%rbp),%rcx
    358a:       48 8b 50 20             mov    0x20(%rax),%rdx
    358e:       48 89 55 b0             mov    %rdx,-0x50(%rbp)
    3592:       31 d2                   xor    %edx,%edx
    3594:       f3 48 0f b8 55 b0       popcnt -0x50(%rbp),%rdx
    359a:       01 ca                   add    %ecx,%edx
    359c:       48 8b 48 28             mov    0x28(%rax),%rcx
    35a0:       48 89 4d a8             mov    %rcx,-0x58(%rbp)
    35a4:       31 c9                   xor    %ecx,%ecx
    35a6:       f3 48 0f b8 4d a8       popcnt -0x58(%rbp),%rcx
    35ac:       01 d1                   add    %edx,%ecx
    35ae:       e9 12 01 00 00          jmpq   36c5
<pmap_demote_pde_locked+0x3f5>

Caveat: I'm still using clang 3.5.  Maybe the newer clang doesn't spill?