Date: Mon, 13 Apr 2015 12:06:46 -0700 From: Adrian Chadd <adrian@freebsd.org> To: Alan Cox <alc@rice.edu> Cc: "src-committers@freebsd.org" <src-committers@freebsd.org>, John Baldwin <jhb@freebsd.org>, "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>, "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>, Bruce Evans <brde@optusnet.com.au>, Konstantin Belousov <kostikbel@gmail.com> Subject: Re: svn commit: r280279 - head/sys/sys Message-ID: <CAJ-VmokhJoq293dxv8Ugnn%2Ba=0r=WyVmxjQbNs_NwstHaGoYFA@mail.gmail.com> In-Reply-To: <552BFEB2.8040407@rice.edu> References: <201503201027.t2KAR6Ze053047@svn.freebsd.org> <550DA656.5060004@FreeBSD.org> <20150322080015.O955@besplex.bde.org> <17035816.lxyzYKiOWV@ralph.baldwin.cx> <552BFEB2.8040407@rice.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, These CPUs are supposed to have loop unwinder / streaming hardware. Is it not unwinding/streaming this loop for us? -a On 13 April 2015 at 10:36, Alan Cox <alc@rice.edu> wrote: > On 03/30/2015 10:50, John Baldwin wrote: >> On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote: >>> On Sat, 21 Mar 2015, John Baldwin wrote: >>> >>>> On 3/21/15 12:35 PM, Konstantin Belousov wrote: >>>>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: >>>>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote: >>>>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote: >>>>>>>> Author: jhb >>>>>>>> Date: Fri Mar 20 10:27:06 2015 >>>>>>>> New Revision: 280279 >>>>>>>> URL: https://svnweb.freebsd.org/changeset/base/280279 >>>>>>>> >>>>>>>> Log: >>>>>>>> Expand the bitcount* API to support 64-bit integers, plain ints and longs >>>>>>>> and create a "hidden" API that can be used in other system headers without >>>>>>>> adding namespace pollution. >>>>>>>> - If the POPCNT instruction is enabled at compile time, use >>>>>>>> __builtin_popcount*() to implement __bitcount*(), otherwise fall back >>>>>>>> to software implementations. >>>>>>> Are you aware of the Haswell errata HSD146 ? I see the described behaviour >>>>>>> on machines back to SandyBridge, but not on Nehalems. >>>>>>> HSD146. POPCNT Instruction May Take Longer to Execute Than Expected >>>>>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand may be >>>>>>> delayed until previous non-dependent instructions have executed. >>>>>>> >>>>>>> Jilles noted that gcc head and 4.9.2 already provides a workaround by >>>>>>> xoring the dst register. I have some patch for amd64 pmap, see the end >>>>>>> of the message. >>>>>> No, I was not aware, but I think it's hard to fix this anywhere but the >>>>>> compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang >>>>>> uses POPCOUNT for this function from ACPI-CA: >>>>>> >>>>>> static UINT8 >>>>>> AcpiRsCountSetBits ( >>>>>> UINT16 BitField) >>>>>> { >>>>>> UINT8 BitsSet; >>>>>> >>>>>> >>>>>> ACPI_FUNCTION_ENTRY (); >>>>>> >>>>>> >>>>>> for (BitsSet = 0; BitField; BitsSet++) >>>>>> { >>>>>> /* Zero the least significant bit that is set */ >>>>>> >>>>>> BitField &= (UINT16) (BitField - 1); >>>>>> } >>>>>> >>>>>> return (BitsSet); >>>>>> } >>>>>> >>>>>> (I ran into this accidentally because a kernel built on my system failed >>>>>> to boot in older qemu because the kernel paniced with an illegal instruction >>>>>> fault in this function.) >>> Does it do the same for the similar home made popcount in pmap?: >> Yes: >> >> ffffffff807658d4: f6 04 25 46 e2 d6 80 testb $0x80,0xffffffff80d6e246 >> ffffffff807658db: 80 >> ffffffff807658dc: 74 32 je ffffffff80765910 <pmap_demote_pde_locked+0x4d0> >> ffffffff807658de: 48 89 4d b8 mov %rcx,-0x48(%rbp) >> ffffffff807658e2: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx >> ffffffff807658e8: 48 8b 50 20 mov 0x20(%rax),%rdx >> ffffffff807658ec: 48 89 55 b0 mov %rdx,-0x50(%rbp) >> ffffffff807658f0: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx >> ffffffff807658f6: 01 ca add %ecx,%edx >> ffffffff807658f8: 48 8b 48 28 mov 0x28(%rax),%rcx >> ffffffff807658fc: 48 89 4d a8 mov %rcx,-0x58(%rbp) >> ffffffff80765900: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx >> ffffffff80765906: eb 1b jmp ffffffff80765923 <pmap_demote_pde_locked+0x4e3> >> ffffffff80765908: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) >> ffffffff8076590f: 00 >> ffffffff80765910: f3 48 0f b8 c9 popcnt %rcx,%rcx >> ffffffff80765915: f3 48 0f b8 50 20 popcnt 0x20(%rax),%rdx >> ffffffff8076591b: 01 ca add %ecx,%edx >> ffffffff8076591d: f3 48 0f b8 48 28 popcnt 0x28(%rax),%rcx >> ffffffff80765923: 01 d1 add %edx,%ecx >> >> It also uses popcnt for this in blist_fill() and blist_meta_fill(): >> >> 742 /* Count the number of blocks we're about to allocate */ >> 743 bitmap = scan->u.bmu_bitmap & mask; >> 744 for (nblks = 0; bitmap != 0; nblks++) >> 745 bitmap &= bitmap - 1; >> >>> Always using new API would lose the micro-optimizations given by the runtime >>> decision for default CFLAGS (used by distributions for portability). To >>> keep them, it seems best to keep the inline asm but replace >>> popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then >>> be used to work around slowness in the software (that is actually >>> hardware) case. >> I'm not sure if bitcount64() is strictly better than the loop in this case >> even though it is O(1) given the claimed nature of the values in the comment. >> > > > I checked. Even with zeroes being more common than ones, bitcount64() > is faster than the simple loop. Using bitcount64, reserve_pv_entries() > takes on average 4265 cycles during "buildworld" on my test machine. In > contrast, with the simple loop, it takes on average 4507 cycles. Even > though bitcount64 is a lot larger than the simple loop, we do the 3 bit > count operations many times in a loop, so the extra i-cache misses are > being made up for by the repeated execution of the faster code. > > However, in the popcnt case, we are spilling the bit map to memory in > order to popcnt it. That's rather silly: > > 3570: 48 8b 48 18 mov 0x18(%rax),%rcx > 3574: f6 04 25 00 00 00 00 testb $0x80,0x0 > 357b: 80 > 357c: 74 42 je 35c0 > <pmap_demote_pde_locked+0x2f0> > 357e: 48 89 4d b8 mov %rcx,-0x48(%rbp) > 3582: 31 c9 xor %ecx,%ecx > 3584: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx > 358a: 48 8b 50 20 mov 0x20(%rax),%rdx > 358e: 48 89 55 b0 mov %rdx,-0x50(%rbp) > 3592: 31 d2 xor %edx,%edx > 3594: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx > 359a: 01 ca add %ecx,%edx > 359c: 48 8b 48 28 mov 0x28(%rax),%rcx > 35a0: 48 89 4d a8 mov %rcx,-0x58(%rbp) > 35a4: 31 c9 xor %ecx,%ecx > 35a6: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx > 35ac: 01 d1 add %edx,%ecx > 35ae: e9 12 01 00 00 jmpq 36c5 > <pmap_demote_pde_locked+0x3f5> > > Caveat: I'm still using clang 3.5. Maybe the newer clang doesn't spill? > > >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-VmokhJoq293dxv8Ugnn%2Ba=0r=WyVmxjQbNs_NwstHaGoYFA>