From owner-svn-src-all@FreeBSD.ORG Mon Apr 13 19:06:47 2015 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id D56C5B69; Mon, 13 Apr 2015 19:06:47 +0000 (UTC) Received: from mail-ie0-x231.google.com (mail-ie0-x231.google.com [IPv6:2607:f8b0:4001:c03::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9590F80F; Mon, 13 Apr 2015 19:06:47 +0000 (UTC) Received: by iebmp1 with SMTP id mp1so73331826ieb.0; Mon, 13 Apr 2015 12:06:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=SqvHlCeNAZoZHEQkuduzUIXpAz+7GZTCXlvqKuXyjmk=; b=yEDxRinQjltKPNx7xHSgjylBeooWPn1YHr1pngnFb2wSEKc+9Wc4u7XKd+cjoNAExs i7VQweX8XCHztz2zdq1FIX2i5YaIeRx4AXLwM5mf7/xqjKg5q71bYRCAEQfbbs/IypH3 jeUWxbl0Mph/Y7lkVhiytsMfPMXCer0Mv1WHqm7o5lmvAPPzF+sXDCh36SwEGC0awhrI C40XEwnDnI7IabgrFsEg0EhZ9nRK2KfdZsfPCdTdme2pRKX+sNKkv5OAinRkVtcRNTl9 w+ulnVbLo6e7zv86t4r2GVx+6FsOathgDl9/wGQhzgDRXB5P2bHKnLaTUIaYr0pQ6CQT 9bCA== MIME-Version: 1.0 X-Received: by 10.43.163.129 with SMTP id mo1mr20686133icc.61.1428952006216; Mon, 13 Apr 2015 12:06:46 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.36.17.194 with HTTP; Mon, 13 Apr 2015 12:06:46 -0700 (PDT) In-Reply-To: <552BFEB2.8040407@rice.edu> References: <201503201027.t2KAR6Ze053047@svn.freebsd.org> <550DA656.5060004@FreeBSD.org> <20150322080015.O955@besplex.bde.org> <17035816.lxyzYKiOWV@ralph.baldwin.cx> <552BFEB2.8040407@rice.edu> Date: Mon, 13 Apr 2015 12:06:46 -0700 X-Google-Sender-Auth: IKfXdRveN1SnY95Jw2NKmMKU1g0 Message-ID: Subject: Re: svn commit: r280279 - head/sys/sys From: Adrian Chadd To: Alan Cox Content-Type: text/plain; charset=UTF-8 Cc: "src-committers@freebsd.org" , John Baldwin , "svn-src-all@freebsd.org" , "svn-src-head@freebsd.org" , Bruce Evans , Konstantin Belousov X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Apr 2015 19:06:48 -0000 Hi, These CPUs are supposed to have loop unwinder / streaming hardware. Is it not unwinding/streaming this loop for us? -a On 13 April 2015 at 10:36, Alan Cox wrote: > On 03/30/2015 10:50, John Baldwin wrote: >> On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote: >>> On Sat, 21 Mar 2015, John Baldwin wrote: >>> >>>> On 3/21/15 12:35 PM, Konstantin Belousov wrote: >>>>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: >>>>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote: >>>>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote: >>>>>>>> Author: jhb >>>>>>>> Date: Fri Mar 20 10:27:06 2015 >>>>>>>> New Revision: 280279 >>>>>>>> URL: https://svnweb.freebsd.org/changeset/base/280279 >>>>>>>> >>>>>>>> Log: >>>>>>>> Expand the bitcount* API to support 64-bit integers, plain ints and longs >>>>>>>> and create a "hidden" API that can be used in other system headers without >>>>>>>> adding namespace pollution. >>>>>>>> - If the POPCNT instruction is enabled at compile time, use >>>>>>>> __builtin_popcount*() to implement __bitcount*(), otherwise fall back >>>>>>>> to software implementations. >>>>>>> Are you aware of the Haswell errata HSD146 ? I see the described behaviour >>>>>>> on machines back to SandyBridge, but not on Nehalems. >>>>>>> HSD146. POPCNT Instruction May Take Longer to Execute Than Expected >>>>>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand may be >>>>>>> delayed until previous non-dependent instructions have executed. >>>>>>> >>>>>>> Jilles noted that gcc head and 4.9.2 already provides a workaround by >>>>>>> xoring the dst register. I have some patch for amd64 pmap, see the end >>>>>>> of the message. >>>>>> No, I was not aware, but I think it's hard to fix this anywhere but the >>>>>> compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang >>>>>> uses POPCOUNT for this function from ACPI-CA: >>>>>> >>>>>> static UINT8 >>>>>> AcpiRsCountSetBits ( >>>>>> UINT16 BitField) >>>>>> { >>>>>> UINT8 BitsSet; >>>>>> >>>>>> >>>>>> ACPI_FUNCTION_ENTRY (); >>>>>> >>>>>> >>>>>> for (BitsSet = 0; BitField; BitsSet++) >>>>>> { >>>>>> /* Zero the least significant bit that is set */ >>>>>> >>>>>> BitField &= (UINT16) (BitField - 1); >>>>>> } >>>>>> >>>>>> return (BitsSet); >>>>>> } >>>>>> >>>>>> (I ran into this accidentally because a kernel built on my system failed >>>>>> to boot in older qemu because the kernel paniced with an illegal instruction >>>>>> fault in this function.) >>> Does it do the same for the similar home made popcount in pmap?: >> Yes: >> >> ffffffff807658d4: f6 04 25 46 e2 d6 80 testb $0x80,0xffffffff80d6e246 >> ffffffff807658db: 80 >> ffffffff807658dc: 74 32 je ffffffff80765910 >> ffffffff807658de: 48 89 4d b8 mov %rcx,-0x48(%rbp) >> ffffffff807658e2: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx >> ffffffff807658e8: 48 8b 50 20 mov 0x20(%rax),%rdx >> ffffffff807658ec: 48 89 55 b0 mov %rdx,-0x50(%rbp) >> ffffffff807658f0: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx >> ffffffff807658f6: 01 ca add %ecx,%edx >> ffffffff807658f8: 48 8b 48 28 mov 0x28(%rax),%rcx >> ffffffff807658fc: 48 89 4d a8 mov %rcx,-0x58(%rbp) >> ffffffff80765900: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx >> ffffffff80765906: eb 1b jmp ffffffff80765923 >> ffffffff80765908: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) >> ffffffff8076590f: 00 >> ffffffff80765910: f3 48 0f b8 c9 popcnt %rcx,%rcx >> ffffffff80765915: f3 48 0f b8 50 20 popcnt 0x20(%rax),%rdx >> ffffffff8076591b: 01 ca add %ecx,%edx >> ffffffff8076591d: f3 48 0f b8 48 28 popcnt 0x28(%rax),%rcx >> ffffffff80765923: 01 d1 add %edx,%ecx >> >> It also uses popcnt for this in blist_fill() and blist_meta_fill(): >> >> 742 /* Count the number of blocks we're about to allocate */ >> 743 bitmap = scan->u.bmu_bitmap & mask; >> 744 for (nblks = 0; bitmap != 0; nblks++) >> 745 bitmap &= bitmap - 1; >> >>> Always using new API would lose the micro-optimizations given by the runtime >>> decision for default CFLAGS (used by distributions for portability). To >>> keep them, it seems best to keep the inline asm but replace >>> popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then >>> be used to work around slowness in the software (that is actually >>> hardware) case. >> I'm not sure if bitcount64() is strictly better than the loop in this case >> even though it is O(1) given the claimed nature of the values in the comment. >> > > > I checked. Even with zeroes being more common than ones, bitcount64() > is faster than the simple loop. Using bitcount64, reserve_pv_entries() > takes on average 4265 cycles during "buildworld" on my test machine. In > contrast, with the simple loop, it takes on average 4507 cycles. Even > though bitcount64 is a lot larger than the simple loop, we do the 3 bit > count operations many times in a loop, so the extra i-cache misses are > being made up for by the repeated execution of the faster code. > > However, in the popcnt case, we are spilling the bit map to memory in > order to popcnt it. That's rather silly: > > 3570: 48 8b 48 18 mov 0x18(%rax),%rcx > 3574: f6 04 25 00 00 00 00 testb $0x80,0x0 > 357b: 80 > 357c: 74 42 je 35c0 > > 357e: 48 89 4d b8 mov %rcx,-0x48(%rbp) > 3582: 31 c9 xor %ecx,%ecx > 3584: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx > 358a: 48 8b 50 20 mov 0x20(%rax),%rdx > 358e: 48 89 55 b0 mov %rdx,-0x50(%rbp) > 3592: 31 d2 xor %edx,%edx > 3594: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx > 359a: 01 ca add %ecx,%edx > 359c: 48 8b 48 28 mov 0x28(%rax),%rcx > 35a0: 48 89 4d a8 mov %rcx,-0x58(%rbp) > 35a4: 31 c9 xor %ecx,%ecx > 35a6: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx > 35ac: 01 d1 add %edx,%ecx > 35ae: e9 12 01 00 00 jmpq 36c5 > > > Caveat: I'm still using clang 3.5. Maybe the newer clang doesn't spill? > > >