From owner-svn-src-all@FreeBSD.ORG  Mon Apr 13 19:06:47 2015
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id D56C5B69;
 Mon, 13 Apr 2015 19:06:47 +0000 (UTC)
Received: from mail-ie0-x231.google.com (mail-ie0-x231.google.com
 [IPv6:2607:f8b0:4001:c03::231])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9590F80F;
 Mon, 13 Apr 2015 19:06:47 +0000 (UTC)
Received: by iebmp1 with SMTP id mp1so73331826ieb.0;
 Mon, 13 Apr 2015 12:06:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=SqvHlCeNAZoZHEQkuduzUIXpAz+7GZTCXlvqKuXyjmk=;
 b=yEDxRinQjltKPNx7xHSgjylBeooWPn1YHr1pngnFb2wSEKc+9Wc4u7XKd+cjoNAExs
 i7VQweX8XCHztz2zdq1FIX2i5YaIeRx4AXLwM5mf7/xqjKg5q71bYRCAEQfbbs/IypH3
 jeUWxbl0Mph/Y7lkVhiytsMfPMXCer0Mv1WHqm7o5lmvAPPzF+sXDCh36SwEGC0awhrI
 C40XEwnDnI7IabgrFsEg0EhZ9nRK2KfdZsfPCdTdme2pRKX+sNKkv5OAinRkVtcRNTl9
 w+ulnVbLo6e7zv86t4r2GVx+6FsOathgDl9/wGQhzgDRXB5P2bHKnLaTUIaYr0pQ6CQT
 9bCA==
MIME-Version: 1.0
X-Received: by 10.43.163.129 with SMTP id mo1mr20686133icc.61.1428952006216;
 Mon, 13 Apr 2015 12:06:46 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.36.17.194 with HTTP; Mon, 13 Apr 2015 12:06:46 -0700 (PDT)
In-Reply-To: <552BFEB2.8040407@rice.edu>
References: <201503201027.t2KAR6Ze053047@svn.freebsd.org>
 <550DA656.5060004@FreeBSD.org>
 <20150322080015.O955@besplex.bde.org>
 <17035816.lxyzYKiOWV@ralph.baldwin.cx> <552BFEB2.8040407@rice.edu>
Date: Mon, 13 Apr 2015 12:06:46 -0700
X-Google-Sender-Auth: IKfXdRveN1SnY95Jw2NKmMKU1g0
Message-ID: <CAJ-VmokhJoq293dxv8Ugnn+a=0r=WyVmxjQbNs_NwstHaGoYFA@mail.gmail.com>
Subject: Re: svn commit: r280279 - head/sys/sys
From: Adrian Chadd <adrian@freebsd.org>
To: Alan Cox <alc@rice.edu>
Content-Type: text/plain; charset=UTF-8
Cc: "src-committers@freebsd.org" <src-committers@freebsd.org>,
 John Baldwin <jhb@freebsd.org>,
 "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>,
 "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>,
 Bruce Evans <brde@optusnet.com.au>, Konstantin Belousov <kostikbel@gmail.com>
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
 user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all/>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
 <mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Apr 2015 19:06:48 -0000

Hi,

These CPUs are supposed to have loop unwinder / streaming hardware. Is
it not unwinding/streaming this loop for us?


-a


On 13 April 2015 at 10:36, Alan Cox <alc@rice.edu> wrote:
> On 03/30/2015 10:50, John Baldwin wrote:
>> On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
>>> On Sat, 21 Mar 2015, John Baldwin wrote:
>>>
>>>> On 3/21/15 12:35 PM, Konstantin Belousov wrote:
>>>>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
>>>>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote:
>>>>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote:
>>>>>>>> Author: jhb
>>>>>>>> Date: Fri Mar 20 10:27:06 2015
>>>>>>>> New Revision: 280279
>>>>>>>> URL: https://svnweb.freebsd.org/changeset/base/280279
>>>>>>>>
>>>>>>>> Log:
>>>>>>>>   Expand the bitcount* API to support 64-bit integers, plain ints and longs
>>>>>>>>   and create a "hidden" API that can be used in other system headers without
>>>>>>>>   adding namespace pollution.
>>>>>>>>   - If the POPCNT instruction is enabled at compile time, use
>>>>>>>>     __builtin_popcount*() to implement __bitcount*(), otherwise fall back
>>>>>>>>     to software implementations.
>>>>>>> Are you aware of the Haswell errata HSD146 ?  I see the described behaviour
>>>>>>> on machines back to SandyBridge, but not on Nehalems.
>>>>>>> HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
>>>>>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
>>>>>>> delayed until previous non-dependent instructions have executed.
>>>>>>>
>>>>>>> Jilles noted that gcc head and 4.9.2 already provides a workaround by
>>>>>>> xoring the dst register.  I have some patch for amd64 pmap, see the end
>>>>>>> of the message.
>>>>>> No, I was not aware, but I think it's hard to fix this anywhere but the
>>>>>> compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
>>>>>> uses POPCOUNT for this function from ACPI-CA:
>>>>>>
>>>>>> static UINT8
>>>>>> AcpiRsCountSetBits (
>>>>>>     UINT16                  BitField)
>>>>>> {
>>>>>>     UINT8                   BitsSet;
>>>>>>
>>>>>>
>>>>>>     ACPI_FUNCTION_ENTRY ();
>>>>>>
>>>>>>
>>>>>>     for (BitsSet = 0; BitField; BitsSet++)
>>>>>>     {
>>>>>>         /* Zero the least significant bit that is set */
>>>>>>
>>>>>>         BitField &= (UINT16) (BitField - 1);
>>>>>>     }
>>>>>>
>>>>>>     return (BitsSet);
>>>>>> }
>>>>>>
>>>>>> (I ran into this accidentally because a kernel built on my system failed
>>>>>> to boot in older qemu because the kernel paniced with an illegal instruction
>>>>>> fault in this function.)
>>> Does it do the same for the similar home made popcount in pmap?:
>> Yes:
>>
>> ffffffff807658d4:       f6 04 25 46 e2 d6 80    testb  $0x80,0xffffffff80d6e246
>> ffffffff807658db:       80
>> ffffffff807658dc:       74 32                   je     ffffffff80765910 <pmap_demote_pde_locked+0x4d0>
>> ffffffff807658de:       48 89 4d b8             mov    %rcx,-0x48(%rbp)
>> ffffffff807658e2:       f3 48 0f b8 4d b8       popcnt -0x48(%rbp),%rcx
>> ffffffff807658e8:       48 8b 50 20             mov    0x20(%rax),%rdx
>> ffffffff807658ec:       48 89 55 b0             mov    %rdx,-0x50(%rbp)
>> ffffffff807658f0:       f3 48 0f b8 55 b0       popcnt -0x50(%rbp),%rdx
>> ffffffff807658f6:       01 ca                   add    %ecx,%edx
>> ffffffff807658f8:       48 8b 48 28             mov    0x28(%rax),%rcx
>> ffffffff807658fc:       48 89 4d a8             mov    %rcx,-0x58(%rbp)
>> ffffffff80765900:       f3 48 0f b8 4d a8       popcnt -0x58(%rbp),%rcx
>> ffffffff80765906:       eb 1b                   jmp    ffffffff80765923 <pmap_demote_pde_locked+0x4e3>
>> ffffffff80765908:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
>> ffffffff8076590f:       00
>> ffffffff80765910:       f3 48 0f b8 c9          popcnt %rcx,%rcx
>> ffffffff80765915:       f3 48 0f b8 50 20       popcnt 0x20(%rax),%rdx
>> ffffffff8076591b:       01 ca                   add    %ecx,%edx
>> ffffffff8076591d:       f3 48 0f b8 48 28       popcnt 0x28(%rax),%rcx
>> ffffffff80765923:       01 d1                   add    %edx,%ecx
>>
>> It also uses popcnt for this in blist_fill() and blist_meta_fill():
>>
>> 742             /* Count the number of blocks we're about to allocate */
>> 743             bitmap = scan->u.bmu_bitmap & mask;
>> 744             for (nblks = 0; bitmap != 0; nblks++)
>> 745                     bitmap &= bitmap - 1;
>>
>>> Always using new API would lose the micro-optimizations given by the runtime
>>> decision for default CFLAGS (used by distributions for portability).  To
>>> keep them, it seems best to keep the inline asm but replace
>>> popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can then
>>> be used to work around slowness in the software (that is actually
>>> hardware) case.
>> I'm not sure if bitcount64() is strictly better than the loop in this case
>> even though it is O(1) given the claimed nature of the values in the comment.
>>
>
>
> I checked.  Even with zeroes being more common than ones, bitcount64()
> is faster than the simple loop.  Using bitcount64, reserve_pv_entries()
> takes on average 4265 cycles during "buildworld" on my test machine.  In
> contrast, with the simple loop, it takes on average 4507 cycles.  Even
> though bitcount64 is a lot larger than the simple loop, we do the 3 bit
> count operations many times in a loop, so the extra i-cache misses are
> being made up for by the repeated execution of the faster code.
>
> However, in the popcnt case, we are spilling the bit map to memory in
> order to popcnt it.  That's rather silly:
>
>     3570:       48 8b 48 18             mov    0x18(%rax),%rcx
>     3574:       f6 04 25 00 00 00 00    testb  $0x80,0x0
>     357b:       80
>     357c:       74 42                   je     35c0
> <pmap_demote_pde_locked+0x2f0>
>     357e:       48 89 4d b8             mov    %rcx,-0x48(%rbp)
>     3582:       31 c9                   xor    %ecx,%ecx
>     3584:       f3 48 0f b8 4d b8       popcnt -0x48(%rbp),%rcx
>     358a:       48 8b 50 20             mov    0x20(%rax),%rdx
>     358e:       48 89 55 b0             mov    %rdx,-0x50(%rbp)
>     3592:       31 d2                   xor    %edx,%edx
>     3594:       f3 48 0f b8 55 b0       popcnt -0x50(%rbp),%rdx
>     359a:       01 ca                   add    %ecx,%edx
>     359c:       48 8b 48 28             mov    0x28(%rax),%rcx
>     35a0:       48 89 4d a8             mov    %rcx,-0x58(%rbp)
>     35a4:       31 c9                   xor    %ecx,%ecx
>     35a6:       f3 48 0f b8 4d a8       popcnt -0x58(%rbp),%rcx
>     35ac:       01 d1                   add    %edx,%ecx
>     35ae:       e9 12 01 00 00          jmpq   36c5
> <pmap_demote_pde_locked+0x3f5>
>
> Caveat: I'm still using clang 3.5.  Maybe the newer clang doesn't spill?
>
>
>