From owner-svn-src-head@FreeBSD.ORG Mon Mar 30 16:40:22 2015 Return-Path: Delivered-To: svn-src-head@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id CEB38D81; Mon, 30 Mar 2015 16:40:21 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 66627D0E; Mon, 30 Mar 2015 16:40:21 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-54-116-245.nwrknj.fios.verizon.net [173.54.116.245]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 6965BB984; Mon, 30 Mar 2015 12:40:20 -0400 (EDT) From: John Baldwin To: Bruce Evans Subject: Re: svn commit: r280279 - head/sys/sys Date: Mon, 30 Mar 2015 11:50:57 -0400 Message-ID: <17035816.lxyzYKiOWV@ralph.baldwin.cx> User-Agent: KMail/4.14.2 (FreeBSD/10.1-STABLE; KDE/4.14.2; amd64; ; ) In-Reply-To: <20150322080015.O955@besplex.bde.org> References: <201503201027.t2KAR6Ze053047@svn.freebsd.org> <550DA656.5060004@FreeBSD.org> <20150322080015.O955@besplex.bde.org> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Mon, 30 Mar 2015 12:40:20 -0400 (EDT) Cc: Konstantin Belousov , svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 30 Mar 2015 16:40:22 -0000 On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote: > On Sat, 21 Mar 2015, John Baldwin wrote: > > > On 3/21/15 12:35 PM, Konstantin Belousov wrote: > >> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: > >>> On 3/20/15 9:02 AM, Konstantin Belousov wrote: > >>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote: > >>>>> Author: jhb > >>>>> Date: Fri Mar 20 10:27:06 2015 > >>>>> New Revision: 280279 > >>>>> URL: https://svnweb.freebsd.org/changeset/base/280279 > >>>>> > >>>>> Log: > >>>>> Expand the bitcount* API to support 64-bit integers, plain ints and longs > >>>>> and create a "hidden" API that can be used in other system headers without > >>>>> adding namespace pollution. > >>>>> - If the POPCNT instruction is enabled at compile time, use > >>>>> __builtin_popcount*() to implement __bitcount*(), otherwise fall back > >>>>> to software implementations. > >>>> Are you aware of the Haswell errata HSD146 ? I see the described behaviour > >>>> on machines back to SandyBridge, but not on Nehalems. > >>>> HSD146. POPCNT Instruction May Take Longer to Execute Than Expected > >>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand may be > >>>> delayed until previous non-dependent instructions have executed. > >>>> > >>>> Jilles noted that gcc head and 4.9.2 already provides a workaround by > >>>> xoring the dst register. I have some patch for amd64 pmap, see the end > >>>> of the message. > >>> > >>> No, I was not aware, but I think it's hard to fix this anywhere but the > >>> compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang > >>> uses POPCOUNT for this function from ACPI-CA: > >>> > >>> static UINT8 > >>> AcpiRsCountSetBits ( > >>> UINT16 BitField) > >>> { > >>> UINT8 BitsSet; > >>> > >>> > >>> ACPI_FUNCTION_ENTRY (); > >>> > >>> > >>> for (BitsSet = 0; BitField; BitsSet++) > >>> { > >>> /* Zero the least significant bit that is set */ > >>> > >>> BitField &= (UINT16) (BitField - 1); > >>> } > >>> > >>> return (BitsSet); > >>> } > >>> > >>> (I ran into this accidentally because a kernel built on my system failed > >>> to boot in older qemu because the kernel paniced with an illegal instruction > >>> fault in this function.) > > Does it do the same for the similar home made popcount in pmap?: Yes: ffffffff807658d4: f6 04 25 46 e2 d6 80 testb $0x80,0xffffffff80d6e246 ffffffff807658db: 80 ffffffff807658dc: 74 32 je ffffffff80765910 ffffffff807658de: 48 89 4d b8 mov %rcx,-0x48(%rbp) ffffffff807658e2: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx ffffffff807658e8: 48 8b 50 20 mov 0x20(%rax),%rdx ffffffff807658ec: 48 89 55 b0 mov %rdx,-0x50(%rbp) ffffffff807658f0: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx ffffffff807658f6: 01 ca add %ecx,%edx ffffffff807658f8: 48 8b 48 28 mov 0x28(%rax),%rcx ffffffff807658fc: 48 89 4d a8 mov %rcx,-0x58(%rbp) ffffffff80765900: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx ffffffff80765906: eb 1b jmp ffffffff80765923 ffffffff80765908: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) ffffffff8076590f: 00 ffffffff80765910: f3 48 0f b8 c9 popcnt %rcx,%rcx ffffffff80765915: f3 48 0f b8 50 20 popcnt 0x20(%rax),%rdx ffffffff8076591b: 01 ca add %ecx,%edx ffffffff8076591d: f3 48 0f b8 48 28 popcnt 0x28(%rax),%rcx ffffffff80765923: 01 d1 add %edx,%ecx It also uses popcnt for this in blist_fill() and blist_meta_fill(): 742 /* Count the number of blocks we're about to allocate */ 743 bitmap = scan->u.bmu_bitmap & mask; 744 for (nblks = 0; bitmap != 0; nblks++) 745 bitmap &= bitmap - 1; > Always using new API would lose the micro-optimizations given by the runtime > decision for default CFLAGS (used by distributions for portability). To > keep them, it seems best to keep the inline asm but replace > popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then > be used to work around slowness in the software (that is actually > hardware) case. I'm not sure if bitcount64() is strictly better than the loop in this case even though it is O(1) given the claimed nature of the values in the comment. -- John Baldwin