Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Mar 2015 11:50:57 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        Konstantin Belousov <kostikbel@gmail.com>, svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r280279 - head/sys/sys
Message-ID:  <17035816.lxyzYKiOWV@ralph.baldwin.cx>
In-Reply-To: <20150322080015.O955@besplex.bde.org>
References:  <201503201027.t2KAR6Ze053047@svn.freebsd.org> <550DA656.5060004@FreeBSD.org> <20150322080015.O955@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote:
> On Sat, 21 Mar 2015, John Baldwin wrote:
> 
> > On 3/21/15 12:35 PM, Konstantin Belousov wrote:
> >> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote:
> >>> On 3/20/15 9:02 AM, Konstantin Belousov wrote:
> >>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote:
> >>>>> Author: jhb
> >>>>> Date: Fri Mar 20 10:27:06 2015
> >>>>> New Revision: 280279
> >>>>> URL: https://svnweb.freebsd.org/changeset/base/280279
> >>>>>
> >>>>> Log:
> >>>>>   Expand the bitcount* API to support 64-bit integers, plain ints and longs
> >>>>>   and create a "hidden" API that can be used in other system headers without
> >>>>>   adding namespace pollution.
> >>>>>   - If the POPCNT instruction is enabled at compile time, use
> >>>>>     __builtin_popcount*() to implement __bitcount*(), otherwise fall back
> >>>>>     to software implementations.
> >>>> Are you aware of the Haswell errata HSD146 ?  I see the described behaviour
> >>>> on machines back to SandyBridge, but not on Nehalems.
> >>>> HSD146.   POPCNT Instruction May Take Longer to Execute Than Expected
> >>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand may be
> >>>> delayed until previous non-dependent instructions have executed.
> >>>>
> >>>> Jilles noted that gcc head and 4.9.2 already provides a workaround by
> >>>> xoring the dst register.  I have some patch for amd64 pmap, see the end
> >>>> of the message.
> >>>
> >>> No, I was not aware, but I think it's hard to fix this anywhere but the
> >>> compiler.  I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang
> >>> uses POPCOUNT for this function from ACPI-CA:
> >>>
> >>> static UINT8
> >>> AcpiRsCountSetBits (
> >>>     UINT16                  BitField)
> >>> {
> >>>     UINT8                   BitsSet;
> >>>
> >>>
> >>>     ACPI_FUNCTION_ENTRY ();
> >>>
> >>>
> >>>     for (BitsSet = 0; BitField; BitsSet++)
> >>>     {
> >>>         /* Zero the least significant bit that is set */
> >>>
> >>>         BitField &= (UINT16) (BitField - 1);
> >>>     }
> >>>
> >>>     return (BitsSet);
> >>> }
> >>>
> >>> (I ran into this accidentally because a kernel built on my system failed
> >>> to boot in older qemu because the kernel paniced with an illegal instruction
> >>> fault in this function.)
> 
> Does it do the same for the similar home made popcount in pmap?:

Yes:

ffffffff807658d4:       f6 04 25 46 e2 d6 80    testb  $0x80,0xffffffff80d6e246
ffffffff807658db:       80 
ffffffff807658dc:       74 32                   je     ffffffff80765910 <pmap_demote_pde_locked+0x4d0>
ffffffff807658de:       48 89 4d b8             mov    %rcx,-0x48(%rbp)
ffffffff807658e2:       f3 48 0f b8 4d b8       popcnt -0x48(%rbp),%rcx
ffffffff807658e8:       48 8b 50 20             mov    0x20(%rax),%rdx
ffffffff807658ec:       48 89 55 b0             mov    %rdx,-0x50(%rbp)
ffffffff807658f0:       f3 48 0f b8 55 b0       popcnt -0x50(%rbp),%rdx
ffffffff807658f6:       01 ca                   add    %ecx,%edx
ffffffff807658f8:       48 8b 48 28             mov    0x28(%rax),%rcx
ffffffff807658fc:       48 89 4d a8             mov    %rcx,-0x58(%rbp)
ffffffff80765900:       f3 48 0f b8 4d a8       popcnt -0x58(%rbp),%rcx
ffffffff80765906:       eb 1b                   jmp    ffffffff80765923 <pmap_demote_pde_locked+0x4e3>
ffffffff80765908:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
ffffffff8076590f:       00 
ffffffff80765910:       f3 48 0f b8 c9          popcnt %rcx,%rcx
ffffffff80765915:       f3 48 0f b8 50 20       popcnt 0x20(%rax),%rdx
ffffffff8076591b:       01 ca                   add    %ecx,%edx
ffffffff8076591d:       f3 48 0f b8 48 28       popcnt 0x28(%rax),%rcx
ffffffff80765923:       01 d1                   add    %edx,%ecx

It also uses popcnt for this in blist_fill() and blist_meta_fill():

742             /* Count the number of blocks we're about to allocate */
743             bitmap = scan->u.bmu_bitmap & mask;
744             for (nblks = 0; bitmap != 0; nblks++)
745                     bitmap &= bitmap - 1;

> Always using new API would lose the micro-optimizations given by the runtime
> decision for default CFLAGS (used by distributions for portability).  To
> keep them, it seems best to keep the inline asm but replace
> popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can then
> be used to work around slowness in the software (that is actually
> hardware) case.

I'm not sure if bitcount64() is strictly better than the loop in this case
even though it is O(1) given the claimed nature of the values in the comment.

-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?17035816.lxyzYKiOWV>