Date: Thu, 10 May 2018 22:38:16 +0300 From: Konstantin Belousov <kostikbel@gmail.com> To: Bruce Evans <brde@optusnet.com.au> Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r333461 - head/sys/amd64/amd64 Message-ID: <20180510193816.GF6887@kib.kiev.ua> In-Reply-To: <20180511012309.V3949@besplex.bde.org> References: <201805101501.w4AF1iI0039082@repo.freebsd.org> <20180511012309.V3949@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, May 11, 2018 at 03:31:46AM +1000, Bruce Evans wrote: > On Thu, 10 May 2018, Konstantin Belousov wrote: > > > Log: > > Make fpusave() and fpurestore() on amd64 ifuncs. > > > > From now on, linking amd64 kernel requires either lld or newer ld.bfd. > > This breaks building with gcc: > > XX cc1: warnings being treated as errors > XX ../../../amd64/amd64/fpu.c:195: warning: 'ifunc' attribute directive ignored [-Wattributes] > XX ../../../amd64/amd64/fpu.c:195: warning: redundant redeclaration of 'fpusave' [-Wredundant-decls] > XX ./machine/fpu.h:64: warning: previous declaration of 'fpusave' was here > XX ../../../amd64/amd64/fpu.c:202: warning: 'ifunc' attribute directive ignored [-Wattributes] > XX ../../../amd64/amd64/fpu.c:202: warning: redundant redeclaration of 'fpurestore' [-Wredundant-decls] > XX ./machine/fpu.h:62: warning: previous declaration of 'fpurestore' was here Yes. Nothing unexpected. > > After building fpu.o with clang, linking doesn't reqire a newer ld, but > booting does -- after linking with an older ld, the boot panics with a > page fault near fork(). Yes. I noted this in the commit message. emaste will commit the check shortly which would prevent using inadequate linker for building kernel. > > I used the current as and ld with old gcc until early this year. This is > a bit fragile and regressed with lld. The regression was minor -- lld > expanded the bss in the data section, giving a few MB of bloat. Now the > bloat relative to old ld is only 29K in the text section. > > Debugging and presumably profiling is also broken: > > XX db> x/ia fpusususpend,20 > XX fpususpend: pushq %rbp > XX fpususpend+0x1: movq %rsp,%rbp > XX fpususpend+0x4: pushq %rbx > XX fpususpend+0x5: pushq %rax > XX fpususpend+0x6: movl %cr0,%ebx > XX fpususpend+0x9: clts > XX fpususpend+0xb: call 0xffffffff80299960 > XX fpususpend+0x10: movl %ebx,%cr0 > XX fpususpend+0x13: addq $0x8,%rsp > XX fpususpend+0x17: popq %rbx > XX fpususpend+0x18: popq %rbp > XX fpususpend+0x19: ret > XX db> x/ia 0xffffffff80299960 > XX 0xffffffff80299960: jmpl *0x56769a(%rip) > > ddb still doesn't understand pc-relative offsets. Decoding this with > "p/x 0xffffffff80299960+6" gives brwsection+0x1000. At this address > is a pointer to fpusave_xsave since the hardware supports xsave. > > This looks like a small or null pessimization. The branch in the old > version is probably faster. The following simple test on Haswell shows > that all reasonable versions have almost the same speed when cached: Yes, I already noted and mjg noted that ifuncs are directed through PLT. I remember that it was not the case when I did it the first time, but then both compiler and linker were different. I tried a quick experiment with adding -fvisibility=hidden to the kernel compilation, but it still leaves ifunc relocations on PLT (and PLT correctly consists only of ifuncs). I might look some more on it later. Anyway, there are more uses, and SMAP plus pmap applications of ifunc are more clear. I selected amd64/fpu.c for the initial commit to check toolchain in wild since this case is simplest and easy to diagnose if failing, I hope. > > XX #include <sys/cdefs.h> > XX > XX #ifndef FUNC > XX #define FUNC cfoo > XX #endif > XX > XX int FUNC(void); > XX > XX int asmifunctargfoo(void); > XX int (*ifuncaddr)(void) = asmifunctargfoo; > XX int xsave = 1; > XX > XX int > XX cfoo(void) > XX { > XX return (xsave != 0 ? 2 : 1); > XX } > XX > XX __asm(".text; asmbranchfoo: cmpl $0,xsave(%rip); je 1f; movl $2,%eax; ret; 1: movl $1,%eax; ret"); > XX __asm(".text; asmnobranchfoo: cmpl $0,xsave(%rip); setne %al; movzbl %al,%eax; ret"); > XX __asm(".text; asmifuncfoo: jmp *ifuncaddr(%rip)"); > XX __asm(".text; asmifunctargfoo: movl $2,%eax; ret"); /* for xsave != 0 */ > XX > XX int > XX main(void) > XX { > XX volatile int i, res;; > XX > XX res = 0; > XX for (i = 0; i < 1000000000; i++) > XX res += FUNC(); > XX return (0); > XX } > > Compile this with gcc -o foo foo.c -O -FUNC={cfoo,asmbranchfoo,asmnobranchfoo, > asmifuncfoo). > > cfoo, asmnobranchfoo and asmifuncfoo take 1.71-1.72 seconds (~7 cycles at > 4.08 Ghz) on Haswell. Only asmnobranchfoo is slower -- it takes about > 1.78 seconds. That is a whole quarter of a cycle longer (7.25 cycles). > > The uncached case is hard to test. This loop runs in parallel with itself > as far as possible, so the main limit is the number of dependencies and > latency for branch prediction might not matter provided it is less than > the loop time of 7 cycles. In normal use, probably nothing stays cached > because it is rarely called. This also means that optimizations are in > the noise. > > I think the direct branch can be statically predicted better than the > indirect branch, and it is easy to arrange using excessive __predict_* > optimizations that it is right on some hardware. > > Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180510193816.GF6887>