Date: Fri, 11 May 2018 03:31:46 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kib@freebsd.org> Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r333461 - head/sys/amd64/amd64 Message-ID: <20180511012309.V3949@besplex.bde.org> In-Reply-To: <201805101501.w4AF1iI0039082@repo.freebsd.org> References: <201805101501.w4AF1iI0039082@repo.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 10 May 2018, Konstantin Belousov wrote: > Log: > Make fpusave() and fpurestore() on amd64 ifuncs. > > From now on, linking amd64 kernel requires either lld or newer ld.bfd. This breaks building with gcc: XX cc1: warnings being treated as errors XX ../../../amd64/amd64/fpu.c:195: warning: 'ifunc' attribute directive ignored [-Wattributes] XX ../../../amd64/amd64/fpu.c:195: warning: redundant redeclaration of 'fpusave' [-Wredundant-decls] XX ./machine/fpu.h:64: warning: previous declaration of 'fpusave' was here XX ../../../amd64/amd64/fpu.c:202: warning: 'ifunc' attribute directive ignored [-Wattributes] XX ../../../amd64/amd64/fpu.c:202: warning: redundant redeclaration of 'fpurestore' [-Wredundant-decls] XX ./machine/fpu.h:62: warning: previous declaration of 'fpurestore' was here After building fpu.o with clang, linking doesn't reqire a newer ld, but booting does -- after linking with an older ld, the boot panics with a page fault near fork(). I used the current as and ld with old gcc until early this year. This is a bit fragile and regressed with lld. The regression was minor -- lld expanded the bss in the data section, giving a few MB of bloat. Now the bloat relative to old ld is only 29K in the text section. Debugging and presumably profiling is also broken: XX db> x/ia fpusususpend,20 XX fpususpend: pushq %rbp XX fpususpend+0x1: movq %rsp,%rbp XX fpususpend+0x4: pushq %rbx XX fpususpend+0x5: pushq %rax XX fpususpend+0x6: movl %cr0,%ebx XX fpususpend+0x9: clts XX fpususpend+0xb: call 0xffffffff80299960 XX fpususpend+0x10: movl %ebx,%cr0 XX fpususpend+0x13: addq $0x8,%rsp XX fpususpend+0x17: popq %rbx XX fpususpend+0x18: popq %rbp XX fpususpend+0x19: ret XX db> x/ia 0xffffffff80299960 XX 0xffffffff80299960: jmpl *0x56769a(%rip) ddb still doesn't understand pc-relative offsets. Decoding this with "p/x 0xffffffff80299960+6" gives brwsection+0x1000. At this address is a pointer to fpusave_xsave since the hardware supports xsave. This looks like a small or null pessimization. The branch in the old version is probably faster. The following simple test on Haswell shows that all reasonable versions have almost the same speed when cached: XX #include <sys/cdefs.h> XX XX #ifndef FUNC XX #define FUNC cfoo XX #endif XX XX int FUNC(void); XX XX int asmifunctargfoo(void); XX int (*ifuncaddr)(void) = asmifunctargfoo; XX int xsave = 1; XX XX int XX cfoo(void) XX { XX return (xsave != 0 ? 2 : 1); XX } XX XX __asm(".text; asmbranchfoo: cmpl $0,xsave(%rip); je 1f; movl $2,%eax; ret; 1: movl $1,%eax; ret"); XX __asm(".text; asmnobranchfoo: cmpl $0,xsave(%rip); setne %al; movzbl %al,%eax; ret"); XX __asm(".text; asmifuncfoo: jmp *ifuncaddr(%rip)"); XX __asm(".text; asmifunctargfoo: movl $2,%eax; ret"); /* for xsave != 0 */ XX XX int XX main(void) XX { XX volatile int i, res;; XX XX res = 0; XX for (i = 0; i < 1000000000; i++) XX res += FUNC(); XX return (0); XX } Compile this with gcc -o foo foo.c -O -FUNC={cfoo,asmbranchfoo,asmnobranchfoo, asmifuncfoo). cfoo, asmnobranchfoo and asmifuncfoo take 1.71-1.72 seconds (~7 cycles at 4.08 Ghz) on Haswell. Only asmnobranchfoo is slower -- it takes about 1.78 seconds. That is a whole quarter of a cycle longer (7.25 cycles). The uncached case is hard to test. This loop runs in parallel with itself as far as possible, so the main limit is the number of dependencies and latency for branch prediction might not matter provided it is less than the loop time of 7 cycles. In normal use, probably nothing stays cached because it is rarely called. This also means that optimizations are in the noise. I think the direct branch can be statically predicted better than the indirect branch, and it is easy to arrange using excessive __predict_* optimizations that it is right on some hardware. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180511012309.V3949>