From owner-svn-src-head@freebsd.org Thu May 10 17:31:56 2018 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 58F00FD1623; Thu, 10 May 2018 17:31:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 98C516FA1E; Thu, 10 May 2018 17:31:55 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 2C273424814; Fri, 11 May 2018 03:31:47 +1000 (AEST) Date: Fri, 11 May 2018 03:31:46 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r333461 - head/sys/amd64/amd64 In-Reply-To: <201805101501.w4AF1iI0039082@repo.freebsd.org> Message-ID: <20180511012309.V3949@besplex.bde.org> References: <201805101501.w4AF1iI0039082@repo.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=cIaQihWN c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=kj9zAlcOel0A:10 a=VEp1mD_9WxY-5FT3oJsA:9 a=dqOK8clpalB_RD_S:21 a=AIzsK_yzptf1LuMC:21 a=CjuIK1q_8ugA:10 X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 May 2018 17:31:56 -0000 On Thu, 10 May 2018, Konstantin Belousov wrote: > Log: > Make fpusave() and fpurestore() on amd64 ifuncs. > > From now on, linking amd64 kernel requires either lld or newer ld.bfd. This breaks building with gcc: XX cc1: warnings being treated as errors XX ../../../amd64/amd64/fpu.c:195: warning: 'ifunc' attribute directive ignored [-Wattributes] XX ../../../amd64/amd64/fpu.c:195: warning: redundant redeclaration of 'fpusave' [-Wredundant-decls] XX ./machine/fpu.h:64: warning: previous declaration of 'fpusave' was here XX ../../../amd64/amd64/fpu.c:202: warning: 'ifunc' attribute directive ignored [-Wattributes] XX ../../../amd64/amd64/fpu.c:202: warning: redundant redeclaration of 'fpurestore' [-Wredundant-decls] XX ./machine/fpu.h:62: warning: previous declaration of 'fpurestore' was here After building fpu.o with clang, linking doesn't reqire a newer ld, but booting does -- after linking with an older ld, the boot panics with a page fault near fork(). I used the current as and ld with old gcc until early this year. This is a bit fragile and regressed with lld. The regression was minor -- lld expanded the bss in the data section, giving a few MB of bloat. Now the bloat relative to old ld is only 29K in the text section. Debugging and presumably profiling is also broken: XX db> x/ia fpusususpend,20 XX fpususpend: pushq %rbp XX fpususpend+0x1: movq %rsp,%rbp XX fpususpend+0x4: pushq %rbx XX fpususpend+0x5: pushq %rax XX fpususpend+0x6: movl %cr0,%ebx XX fpususpend+0x9: clts XX fpususpend+0xb: call 0xffffffff80299960 XX fpususpend+0x10: movl %ebx,%cr0 XX fpususpend+0x13: addq $0x8,%rsp XX fpususpend+0x17: popq %rbx XX fpususpend+0x18: popq %rbp XX fpususpend+0x19: ret XX db> x/ia 0xffffffff80299960 XX 0xffffffff80299960: jmpl *0x56769a(%rip) ddb still doesn't understand pc-relative offsets. Decoding this with "p/x 0xffffffff80299960+6" gives brwsection+0x1000. At this address is a pointer to fpusave_xsave since the hardware supports xsave. This looks like a small or null pessimization. The branch in the old version is probably faster. The following simple test on Haswell shows that all reasonable versions have almost the same speed when cached: XX #include XX XX #ifndef FUNC XX #define FUNC cfoo XX #endif XX XX int FUNC(void); XX XX int asmifunctargfoo(void); XX int (*ifuncaddr)(void) = asmifunctargfoo; XX int xsave = 1; XX XX int XX cfoo(void) XX { XX return (xsave != 0 ? 2 : 1); XX } XX XX __asm(".text; asmbranchfoo: cmpl $0,xsave(%rip); je 1f; movl $2,%eax; ret; 1: movl $1,%eax; ret"); XX __asm(".text; asmnobranchfoo: cmpl $0,xsave(%rip); setne %al; movzbl %al,%eax; ret"); XX __asm(".text; asmifuncfoo: jmp *ifuncaddr(%rip)"); XX __asm(".text; asmifunctargfoo: movl $2,%eax; ret"); /* for xsave != 0 */ XX XX int XX main(void) XX { XX volatile int i, res;; XX XX res = 0; XX for (i = 0; i < 1000000000; i++) XX res += FUNC(); XX return (0); XX } Compile this with gcc -o foo foo.c -O -FUNC={cfoo,asmbranchfoo,asmnobranchfoo, asmifuncfoo). cfoo, asmnobranchfoo and asmifuncfoo take 1.71-1.72 seconds (~7 cycles at 4.08 Ghz) on Haswell. Only asmnobranchfoo is slower -- it takes about 1.78 seconds. That is a whole quarter of a cycle longer (7.25 cycles). The uncached case is hard to test. This loop runs in parallel with itself as far as possible, so the main limit is the number of dependencies and latency for branch prediction might not matter provided it is less than the loop time of 7 cycles. In normal use, probably nothing stays cached because it is rarely called. This also means that optimizations are in the noise. I think the direct branch can be statically predicted better than the indirect branch, and it is easy to arrange using excessive __predict_* optimizations that it is right on some hardware. Bruce