Date: Mon, 30 Mar 2015 11:50:57 -0400 From: John Baldwin <jhb@freebsd.org> To: Bruce Evans <brde@optusnet.com.au> Cc: Konstantin Belousov <kostikbel@gmail.com>, svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r280279 - head/sys/sys Message-ID: <17035816.lxyzYKiOWV@ralph.baldwin.cx> In-Reply-To: <20150322080015.O955@besplex.bde.org> References: <201503201027.t2KAR6Ze053047@svn.freebsd.org> <550DA656.5060004@FreeBSD.org> <20150322080015.O955@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote: > On Sat, 21 Mar 2015, John Baldwin wrote: > > > On 3/21/15 12:35 PM, Konstantin Belousov wrote: > >> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: > >>> On 3/20/15 9:02 AM, Konstantin Belousov wrote: > >>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote: > >>>>> Author: jhb > >>>>> Date: Fri Mar 20 10:27:06 2015 > >>>>> New Revision: 280279 > >>>>> URL: https://svnweb.freebsd.org/changeset/base/280279 > >>>>> > >>>>> Log: > >>>>> Expand the bitcount* API to support 64-bit integers, plain ints and longs > >>>>> and create a "hidden" API that can be used in other system headers without > >>>>> adding namespace pollution. > >>>>> - If the POPCNT instruction is enabled at compile time, use > >>>>> __builtin_popcount*() to implement __bitcount*(), otherwise fall back > >>>>> to software implementations. > >>>> Are you aware of the Haswell errata HSD146 ? I see the described behaviour > >>>> on machines back to SandyBridge, but not on Nehalems. > >>>> HSD146. POPCNT Instruction May Take Longer to Execute Than Expected > >>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand may be > >>>> delayed until previous non-dependent instructions have executed. > >>>> > >>>> Jilles noted that gcc head and 4.9.2 already provides a workaround by > >>>> xoring the dst register. I have some patch for amd64 pmap, see the end > >>>> of the message. > >>> > >>> No, I was not aware, but I think it's hard to fix this anywhere but the > >>> compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang > >>> uses POPCOUNT for this function from ACPI-CA: > >>> > >>> static UINT8 > >>> AcpiRsCountSetBits ( > >>> UINT16 BitField) > >>> { > >>> UINT8 BitsSet; > >>> > >>> > >>> ACPI_FUNCTION_ENTRY (); > >>> > >>> > >>> for (BitsSet = 0; BitField; BitsSet++) > >>> { > >>> /* Zero the least significant bit that is set */ > >>> > >>> BitField &= (UINT16) (BitField - 1); > >>> } > >>> > >>> return (BitsSet); > >>> } > >>> > >>> (I ran into this accidentally because a kernel built on my system failed > >>> to boot in older qemu because the kernel paniced with an illegal instruction > >>> fault in this function.) > > Does it do the same for the similar home made popcount in pmap?: Yes: ffffffff807658d4: f6 04 25 46 e2 d6 80 testb $0x80,0xffffffff80d6e246 ffffffff807658db: 80 ffffffff807658dc: 74 32 je ffffffff80765910 <pmap_demote_pde_locked+0x4d0> ffffffff807658de: 48 89 4d b8 mov %rcx,-0x48(%rbp) ffffffff807658e2: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx ffffffff807658e8: 48 8b 50 20 mov 0x20(%rax),%rdx ffffffff807658ec: 48 89 55 b0 mov %rdx,-0x50(%rbp) ffffffff807658f0: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx ffffffff807658f6: 01 ca add %ecx,%edx ffffffff807658f8: 48 8b 48 28 mov 0x28(%rax),%rcx ffffffff807658fc: 48 89 4d a8 mov %rcx,-0x58(%rbp) ffffffff80765900: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx ffffffff80765906: eb 1b jmp ffffffff80765923 <pmap_demote_pde_locked+0x4e3> ffffffff80765908: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) ffffffff8076590f: 00 ffffffff80765910: f3 48 0f b8 c9 popcnt %rcx,%rcx ffffffff80765915: f3 48 0f b8 50 20 popcnt 0x20(%rax),%rdx ffffffff8076591b: 01 ca add %ecx,%edx ffffffff8076591d: f3 48 0f b8 48 28 popcnt 0x28(%rax),%rcx ffffffff80765923: 01 d1 add %edx,%ecx It also uses popcnt for this in blist_fill() and blist_meta_fill(): 742 /* Count the number of blocks we're about to allocate */ 743 bitmap = scan->u.bmu_bitmap & mask; 744 for (nblks = 0; bitmap != 0; nblks++) 745 bitmap &= bitmap - 1; > Always using new API would lose the micro-optimizations given by the runtime > decision for default CFLAGS (used by distributions for portability). To > keep them, it seems best to keep the inline asm but replace > popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then > be used to work around slowness in the software (that is actually > hardware) case. I'm not sure if bitcount64() is strictly better than the loop in this case even though it is O(1) given the claimed nature of the values in the comment. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?17035816.lxyzYKiOWV>