From owner-svn-src-head@freebsd.org Thu May 10 19:38:29 2018 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1C543FD46FD; Thu, 10 May 2018 19:38:29 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 7657A6A51F; Thu, 10 May 2018 19:38:28 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id w4AJcG2J003766 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 10 May 2018 22:38:19 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua w4AJcG2J003766 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id w4AJcGb8003765; Thu, 10 May 2018 22:38:16 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 10 May 2018 22:38:16 +0300 From: Konstantin Belousov To: Bruce Evans Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r333461 - head/sys/amd64/amd64 Message-ID: <20180510193816.GF6887@kib.kiev.ua> References: <201805101501.w4AF1iI0039082@repo.freebsd.org> <20180511012309.V3949@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180511012309.V3949@besplex.bde.org> User-Agent: Mutt/1.9.5 (2018-04-13) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 10 May 2018 19:38:29 -0000 On Fri, May 11, 2018 at 03:31:46AM +1000, Bruce Evans wrote: > On Thu, 10 May 2018, Konstantin Belousov wrote: > > > Log: > > Make fpusave() and fpurestore() on amd64 ifuncs. > > > > From now on, linking amd64 kernel requires either lld or newer ld.bfd. > > This breaks building with gcc: > > XX cc1: warnings being treated as errors > XX ../../../amd64/amd64/fpu.c:195: warning: 'ifunc' attribute directive ignored [-Wattributes] > XX ../../../amd64/amd64/fpu.c:195: warning: redundant redeclaration of 'fpusave' [-Wredundant-decls] > XX ./machine/fpu.h:64: warning: previous declaration of 'fpusave' was here > XX ../../../amd64/amd64/fpu.c:202: warning: 'ifunc' attribute directive ignored [-Wattributes] > XX ../../../amd64/amd64/fpu.c:202: warning: redundant redeclaration of 'fpurestore' [-Wredundant-decls] > XX ./machine/fpu.h:62: warning: previous declaration of 'fpurestore' was here Yes. Nothing unexpected. > > After building fpu.o with clang, linking doesn't reqire a newer ld, but > booting does -- after linking with an older ld, the boot panics with a > page fault near fork(). Yes. I noted this in the commit message. emaste will commit the check shortly which would prevent using inadequate linker for building kernel. > > I used the current as and ld with old gcc until early this year. This is > a bit fragile and regressed with lld. The regression was minor -- lld > expanded the bss in the data section, giving a few MB of bloat. Now the > bloat relative to old ld is only 29K in the text section. > > Debugging and presumably profiling is also broken: > > XX db> x/ia fpusususpend,20 > XX fpususpend: pushq %rbp > XX fpususpend+0x1: movq %rsp,%rbp > XX fpususpend+0x4: pushq %rbx > XX fpususpend+0x5: pushq %rax > XX fpususpend+0x6: movl %cr0,%ebx > XX fpususpend+0x9: clts > XX fpususpend+0xb: call 0xffffffff80299960 > XX fpususpend+0x10: movl %ebx,%cr0 > XX fpususpend+0x13: addq $0x8,%rsp > XX fpususpend+0x17: popq %rbx > XX fpususpend+0x18: popq %rbp > XX fpususpend+0x19: ret > XX db> x/ia 0xffffffff80299960 > XX 0xffffffff80299960: jmpl *0x56769a(%rip) > > ddb still doesn't understand pc-relative offsets. Decoding this with > "p/x 0xffffffff80299960+6" gives brwsection+0x1000. At this address > is a pointer to fpusave_xsave since the hardware supports xsave. > > This looks like a small or null pessimization. The branch in the old > version is probably faster. The following simple test on Haswell shows > that all reasonable versions have almost the same speed when cached: Yes, I already noted and mjg noted that ifuncs are directed through PLT. I remember that it was not the case when I did it the first time, but then both compiler and linker were different. I tried a quick experiment with adding -fvisibility=hidden to the kernel compilation, but it still leaves ifunc relocations on PLT (and PLT correctly consists only of ifuncs). I might look some more on it later. Anyway, there are more uses, and SMAP plus pmap applications of ifunc are more clear. I selected amd64/fpu.c for the initial commit to check toolchain in wild since this case is simplest and easy to diagnose if failing, I hope. > > XX #include > XX > XX #ifndef FUNC > XX #define FUNC cfoo > XX #endif > XX > XX int FUNC(void); > XX > XX int asmifunctargfoo(void); > XX int (*ifuncaddr)(void) = asmifunctargfoo; > XX int xsave = 1; > XX > XX int > XX cfoo(void) > XX { > XX return (xsave != 0 ? 2 : 1); > XX } > XX > XX __asm(".text; asmbranchfoo: cmpl $0,xsave(%rip); je 1f; movl $2,%eax; ret; 1: movl $1,%eax; ret"); > XX __asm(".text; asmnobranchfoo: cmpl $0,xsave(%rip); setne %al; movzbl %al,%eax; ret"); > XX __asm(".text; asmifuncfoo: jmp *ifuncaddr(%rip)"); > XX __asm(".text; asmifunctargfoo: movl $2,%eax; ret"); /* for xsave != 0 */ > XX > XX int > XX main(void) > XX { > XX volatile int i, res;; > XX > XX res = 0; > XX for (i = 0; i < 1000000000; i++) > XX res += FUNC(); > XX return (0); > XX } > > Compile this with gcc -o foo foo.c -O -FUNC={cfoo,asmbranchfoo,asmnobranchfoo, > asmifuncfoo). > > cfoo, asmnobranchfoo and asmifuncfoo take 1.71-1.72 seconds (~7 cycles at > 4.08 Ghz) on Haswell. Only asmnobranchfoo is slower -- it takes about > 1.78 seconds. That is a whole quarter of a cycle longer (7.25 cycles). > > The uncached case is hard to test. This loop runs in parallel with itself > as far as possible, so the main limit is the number of dependencies and > latency for branch prediction might not matter provided it is less than > the loop time of 7 cycles. In normal use, probably nothing stays cached > because it is rarely called. This also means that optimizations are in > the noise. > > I think the direct branch can be statically predicted better than the > indirect branch, and it is easy to arrange using excessive __predict_* > optimizations that it is right on some hardware. > > Bruce